Paper: https://arxiv.org/html/2506.09113v2

Just my commentary while studying the paper.

Abstract

Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality.

I think it is very well put. Succinctly describes the issue with current models.

In this report, we introduce Seedance 1.0, a high performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving 10x inference speedup through multi-stage distillation strategies and system-level optimizations.

A lot to unpack here.

(i) “multi-source data curation”: Love it. I want to dive into this because you could describe a video into 1-2 words or a full paragraph. So, it is really going to be interesting how they have done this.
(ii) “multi-shot generation”: This sounds interesting. This would be really helpful for character animation sequences if I can generate multi-shots in one prompt. Anyways, I will have to read up on it.
(iv) 10x speed up sounds amazing.

1 Introduction

It introduces Seedance 1.0.

a foundational video generation model with native support bilingual (Chinese/English) video generation and multi-task versatility encompassing text-to-video synthesis and image-guided video generation. Seedance 1.0 integrates four key technical improvements:

I like the fact that the video model is bilingual. There are things that are more easily understood in one language than other. Also, that bilingually may help with understanding concepts in more multi-facetted way than the when using only English.

Imagine this. You can now sprinkle just a tad bit of Szechuan spice into your American dish.

Chinese language embeddings have some coverage of concepts that are foreign to English language. It is like a van diagram of western and eastern worlds.

Multi-Source Data with Comprehensive Video Captioning. Through multi-stage, multi-perspective curation and dataset balancing, we construct a large-scale high quality video dataset spanning diverse categories, styles, and sources. This enables a comprehensive learning of rich scenarios, topics, and action dynamics. Our precision video captioning system ensures accurate interpretation of user instructions while enabling fluent generation of complex video narratives.

Putting this on the front and center is really promising. Data is king.

Efficient Architecture Design: In our design, we decouple spatial and temporal layers with an interleaved multimodal positional encoding.

Decoupling spatial and temporal… What does it really mean? I’ll read further to understand this better. But in general, in my mind, this decoupling will definitely help with training because you can lock down either spatial or temporal layers/blocks to train just the right layers and blocks.

It also mentions “interleaved multimodal positional encoding.“ I want to learn more about what this is… I think they are just saying interweaving the spatial and location encodings, but I will have to read the full document.

Enhanced Post-Training Optimization. We use a small set of carefully collected data for SFT, which is followed by a video-tailored RLHF algorithm (Reinforcement Learning from Human Feedback). We develop feedback-driven learning algorithms using multiple well-developed reward models, which allow us to considerably improve our performance on both T2V and I2V, in terms of motion naturalness, structural coherence, and visual fidelity.

Sounds pretty standard.

Inference Acceleration. We proposed a multi-stage distillation framework to reduce the number of function evaluations (NFE) required for generation, with inference infrastructure optimization techniques, achieving over 10x end-to-end speedup with no degradation in model performance.

Sounds really promising. I don’t know a think about NFE, but I will study on it.

Comprehensive Generation Capabilities. Seedance 1.0 achieves superior spatiotemporal coherence and structural stability, demonstrating exceptional motion fluidity and physical plausibility. The model produces photorealistic visuals with nuanced textures and compositional richness, attaining state-of-the-art performance …

Yada yada yada—but agreed.

Precision Instruction Following. Through comprehensive learning of diverse scenarios, entities, and action semantics, Seedance 1.0 precisely interprets complex user specifications. It robustly handles multi-agent interactions, adaptive camera control, and stylistic variations while maintaining narrative continuity.

Okay… Sounds little standard to me but I will read on and see what is done differently in Seedance.

Multi-Shot Narrative Capabilities. Seedance 1.0 natively supports coherent multi-shot storytelling with stable view transitions while maintaining consistent subject representation across temporal-spatial transformations.

I’ve noticed this in my tests as well. Even with lots of deformations the character stays more or less the same. It is kinda too good. Have to look into how they did it.

Ultra-Fast Generation Experience…

Yup. You already mentioned it.

2 Model Design

Would be interesting to see how much they would reveal here. Seedance is not open weights model.

2.1 Variational Autoencoder

Temporally-Causal Compression. Following MAGVIT, we adopt a temporally causal convolutional architecture for both the encoder and decoder, allowing joint spatial-temporal compression of images and videos within latent space. To be more specific, the model transforms the input data from the RGB pixel space with shape \((T' + 1, H',W',3)\) into a continuous latent representation with shape \((T+1, H, W, C)\)where \((t, h, w, c)\)denotes time, height, width, and channel dimensions with \(r_t = T'/T\), \(r_h=H'/H\), \(r_w=W'/W\) representing the downsample ratios along these three axes, respectively. Benefiting from the causal design, the VAE model can seamlessly process image input and output in the case of \(T=T'=0\).

Okay, so the 4-D video is encoded into a smaller 4-D tensor along temporal and spatial dimensions. This sounds reasonable.

In our practice, for the sake of trainingand inference efficiency and overall reconstruction and generation performance, we set \((r_t,r_h,r_w) = (4,16,16)\)and \(C = 48\).

So spatially, each patch one channel vector encodes 4 frame of 16×16 pixels.

To accommodate the higher downsampling rate and pursue better generation performance, we remove the patchification operation on the DiT side, following the strategy adopted in DCAE

Seedance skips the extra 2×2 “patchification“ step used in the original DiT: it feeds the VAE’s continous latent grid (already 16× compressed) directly into the transformer. This is becoming common in video diffisuion models because the latent maps are small enough that further down-sampling would hurt spatial fidelity.

VAE Training. Our VAE is trained with L1 reconstruction loss, KL loss, LPIPS perceptual loss and adversarial training loss. Adversarial training has shown to be effective in improving the quality of VAE reconstruction by enforcing finer supervision on local textures and detailed structures.

L1 reconstruction loss → standard.
KL loss → standard.
LPIPS perceptual loss → This one asks “does the output feel the same?“ This uses deep-net features instead of raw pixels, so you don’t over-penalize tiny shifts.
Adversarial loss → basically a mini GAN critic shouting “that texture looks fake.“

Taking into account appearance and motion modeling simultaneously, we apply a hybrid discriminator with an architecture similar to that used in PatchGAN.

PatchGAN is a CNN that judges realism patch-by-patch (usually 70×70 or 16×16 regions) instead of over the whole frame.

2.2 Diffusion Transformer

With the visual tokens encoded by VAE and text tokens generated by a text encoder, we employ the transformer as our diffusion backbone, where a fine-tuned decoder-only LLM as the text encoder. The visual tokens are then concatenated with textual tokens and fed into the transformer blocks.

So, the visual token is probably a flattened version of the latent 4-D cube from the VAE.

They add 3-D RoPE (x, y, t) to these tokens. RoPE is a positional embedding applied after tensor has been turned into a list of vectors; each latent “pixel“ becomes one sequence element with its own (x,y,t) coordinate.

For text, per-token embeddings are generated.

Those are then concatenated. In my mind:

vid₁ vid₂ … vidₘ | text₁ text₂ … textₙ

Decoupled Spatial and Temporal Layers. Considering both training and inference efficiency, we build the diffusion transformer with decoupled spatial and temporal layers, where the spatial layers perform attention aggregation within each frame, while the temporal layers focus attention computation across frames.

Basically saying the spatial layers ignore the other frames, and the temporal layers look across the frames. One thing to note tis that the “frames” here is latent frames which encodes 4 real frames worth of information. So, in that sense, it would have some information about the motion. Spatial layer’s job is to horn in on the pixel fidelity and that minuscule motion, while Temporal layer works at a macro level and has sight into all latent frames.

We perform window partition within each frame in the temporal layers, allowing for a global receptive field across the temporal dimensions.

This is a detail easy to miss. Looks like the temporal layer does not look at the frames the same way the spatial layer looks at frames. It looks at a “window partition” of the frame. So not all of the frame. It would be good to learn how it is really partitioned.

In addition, textual tokens only participate in cross-modality interaction in spatial layers.

So, this means temporal blocks don’t use the text tokens. It seems counter intuitive to me that temporal token does not get the text tokens which embeds the overall intent of the video. So, temporal block isn’t really there to “direct“ the video, it is just there to congregate all of the frames and make them play nicely with each other.

That said, the spatial attention outputs already attend to text tokens and implicitly carries the “will” to certain extent. So, even if temporal block does not see the explicit text tokens, it can probably infer the intent to a certain degree.

Or perhaps it is a design flaw. Or a compromise they had to make to reduce compute cost. Perhaps it would have been better if they allowed the temporal layer to see the text tokens.

Video Frames               Temporal Dense Caption
     ↓                               ↓           
VAE Encoder                         LLM          
     ↓         ↙        ↘            ↓           
v_1..v_m | t_1..t_n ... v_1..v_m | t_1..t_n    
      Shot 1                   Shot N  
                     ↓                           
          ┌──────────────────────┐               
          │  Spatial DiT Block   │               
      N x │          ↓           │               
          │  Temporal DiT Block  │               
          └──────────────────────┘               
                     ↓               
(Figure 2 transcribed in ascii)

Above figure shows the architecture at high level. In this diagram there is a concept of “shot.“ I think the next section will probably describe them in more detail, but it looks like a new grouping that combines set of frames into one.

Also in Figure 2:

 v_1..v_m | t_1..t_n             v_1..v_m       
          ↓                          ↓          
┌───────────────────┐      ┌───────────────────┐
│         ↓         │      │         ↓         │
│     MM-RoPE       │      │        RoPE       │
│         ↓         │      │         ↓         │
│ Spatial Attention │      │  Spatial-temporal │
│   ↓           ↓   │      │     Attention     │
│ Vision       Text │      │         ↓         │
│  MLP         MLP  │      │     Vision MLP    │
│   ↓           ↓   │      │         ↓         │
└───────────────────┘      └───────────────────┘
  Spatial DiT Block         Temporal DiT Block

There are two kinds of DiT blocks—spatial and temporal.

Spatial DiT block uses MM-RoPE (x, y, t + text channel tag) and looks within one latent frame. Output is then funneled into Vision MLP and Text MLP.

So, besides the positional embedding, there is an embedding that tells whether something is a visual token or a text token. And, before the output of spatial attention is passed into Vision and Text MLP, they are sliced into visual tokens and text tokens and they are then passed down to corresponding MLPs.

Now, Temporal DiT block uses RoPE embedding (x, y, t). It looks across frames at the same windowed area.

In short, Spatial is flat and wide, and Temporal is deep with smaller receptive field.

MMDiT Architecture. For the transformer blocks, we follow the MMDiT design in Stable Diffusion 3, where a multi-modality self-attention layer is applied exclusively in spatial layers to integrate both the visual and textual tokens, whereas a self-attention layer only processes the visual tokens, whereas a self-attention layer only processes the visual tokens in temporal layers.

It again highlights the fact that the cross attention between visual and text tokens happen in spatial layer only.

Considering the semantic differences between visual and textual tokens, we use two separate sets of weights including adaptive layer norm, QKV projection, and MLP, for the two modalities in spatial layers.

So, meaning layer normalization for text and visual tokens happen separately. And QKV projection and MLP as well.

To prevent training instability, the Q and K embeddings are normalized prior to the attention matrix calculations

Seems like a standard practice nowadays.

Multishot MM-RoPE. In this paper, in addition to using 3D RoPE encoding for visual tokens, following Seaweed and LCT, we add 3D Multi-modal RoPE (MM-RoPE) in the concatenated sequences by adding extra 1D positional encoding for textual tokens.

Mostly describing MM-RoPE. That it includes positional embedding for textual tokens.

The MM-RoPE also supports interleaved sequences of visual tokens and textual tokens, and can be extended to training video with multiple shots, where shots are organized in the temporal order of actions and each shot has its own detailed caption.

Sounds like they are using MM-RoPE’s ability to interleave visual tokens and textual tokens to support multi-shot scenarios. There will be more information on this later.

Unified Task Formulation.

The “unified“ here is basically saying that one model can do text-to-video and image-to-video.

To enable conditional video generation, we concatenate the noisy inputs with cleaned or zero-padded frames along the channel dimensions, and use binary masks to indicate which frames are instructions to follow.

“noisy inputs” is going to be the latent noise that is passed at the beginning of diffusion process of T2V.
“zero-padded frames along the channel dimensions” are probably describing the first frame information given in I2V. They are zero padded because… I don’t know…
“binary masks” are used to indicate which frame is instruction. But probably means they are going to freeze that frame.

With this formulation, we can further unify different generation tasks such as text-to-image, text-to-video and image-to-video. During the training process, we mix these tasks and adjust the proportion by controlling the conditional inputs.

So, from reading this, it looks like they have the ability to provide frames at any frame locations. But in web UIs, there is only option to provide 1st frame. Perhaps there is other providers that allow us to lock other frames.

2.3 Diffusion Refiner

Take into account the training and inference efficiency, we employ a cascaded diffusion framework for high-resolution (HR) video generation.

Okay, so basically, it will generate low res version then upscale.

The base model generates 480p videos first, which are then upscaled to 720p or 1080p high-resolution videos through a learned diffusion refiner model to enhance visual detials and textures.

Yup.

Refiner Model Training. To facilitate training, the diffusion refiner model is initialized from the pre-trained base model. Different from the base model, the diffusion refiner model is trained with conditioning on the low-resolution (LR) videos. Specifically, the LT video is upsampled to a high resolution first, then concatenated with the diffusion noise along the channel dimension to form the input of the diffusion transformer.

Seems standard.

2.4 Prompt Engineering (PE)

Finally, we are here. I wanted to learn how the prompts were formatted for Seedance 1 Pro. Here we are, let’s dive in. Heck, so much typos in this section LoL.

As described in Sec 3.2, texts used in DiT are form of dense video captions.

Ok, here we go. “dense video captions.” Hopefully it is not can of worms.

Therefore, we need to employ a large language model to convert the user prompts into corresponding caption format.

Okay, so the captions are refined by LLM to fit a certain prompt format.

To achieve this, we initialized based on Qwen2.4-14B and employ two stages to implement high-quality Prompt Engineering (PE): Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

SFT means there was a human involved in fine tuning this caption refinement model. And RL means there was some automated scoring mechanism and the network learned to get the best score.

Supervised Fine-Tuning. In the SFT stage, we synthesize large amount of user prompts and their dense caption expression by manual annotation.

So, they generate captions, and have human refine it.

We specially divide the image-to-video and text-to-video tasks, as they are different in user prompt styles.

This sounds reasonable because if you had a first frame, the prompt would be like “make it run“ as opposed to t2v task would be like “there is a boy running.“

We then adapt a fully fine-tuning strategy to train the model on the annotated data to acquire basic rephrasing ability.

Sounds like instead of doing LoRA style finetuning, they are doing full fine tune. I mean, why not, if you got the resources.

Reinforcement Learning. However, due to the presence of model hallucinations, the results of the first SFT stage cannot guarantee that the semantics of the rewritten results fully meet the requirements of the user prompts.

I mean, yeah, if you give a short caption and have it generate highly detailed dense caption that is a paragraph long, it sure would hallucinate.

If they want a more detailed version of the caption in a certain format, they should probably have gone for VLMs instead of fine tuning the caption refinement model. VLMs have visual understanding and can generate dense captions in any format. Perhaps that wasn’t available at the time.

Therefore, we carefully collect a dataset of pairs with correct and incorrect rephrased results to perform the Direct Preference Optimization (DPO) training.

Okay, seems like DPO needs correct and incorrect pairs to work properly.

After the above stages, our prompt engineering model has strong ability to understand user prompts and gives precise and high-quality rephrased results in video caption format, consistent with DiT training.

Good, but there are room for improvements. Like using VLM.

3 Data

The performance of video generation models is inextricably linked to the scale, diversity, and quality of the training data. While our broader training corpus incorporates both video and image datasets, with image data preparation following methodologies similar to Seedream, this section specifically details our rigorous approach to curating video data.

It is first Video diffusion paper I’m reading and I didn’t realize that they use image data for training. Perhaps image data is used for training spatial layers in isolation.

We develop a systematic data processing workflow, illustrated in figure 3, to transform vast, heterogeneous raw video collections into a refined, high-quality, diverse and safe dataset for training robust video generation models. This workflow is deployed as a robust, automated system optimized for high-throughput processing of massive data volumes.

This is the talk in the town lately. Garbage in Garbage out. So it is critical to create a dataset that is well balanced with high quality labels. This is also “data moat“ for a company that differentiates a company from other companies.

3.1 Data Pre-Processing

At the heart of our video data curation is multi-stage pre-processing pipeline, designed to…

Diversity-Oriented Data Sourcing. Our video data acquisition strategy prioritizes ethically and legally sourced content from diverse public and license repositories.

“ethically”: This is reassuring. But, in my mind, data acquisition may be easier in mainland China than in places like USA. China is experiencing digital content explosion. There is constant influx of quality data. Just my thought.
“diverse”: Diverse dataset makes sense.

We aim to maximize coverage across critical dimensions, including clip duration, resolution, subject matter (e.g. humans, animals, objects), scene types (e.g. natural landscapes, urban environments), subject actions, genres (e.g. documentary, animation), artistic styles, camera kinematics, and cinematographic techniques. Raw video collection exhibits significant heterogeneity and often contain undesirable elements, posting key challenges that our pipeline is designed to address.

So, here are the pieces to the puzzle. The labels contain information on:

clip duration
resolution
subject matter
scene types
subject actions
genres
artistic styles
camera kinematics
cinematographic techniques

These are in some way encoded into the natural language prompts.

This means, to use Seedance 1 Pro effectively, we would want to format it to exact prompt format that the model was trained on with all these aspects worded into the the natural language prompt.

Shot-Aware Temporal Segmentation. Raw long-form videos are not suitable for direct model training. We employ automated shot boundary detection techniques by analyzing inter-frame visual dissimilarities or utilizing pre-trained detectors to identify natural scene transitions.

This concept of “shot“ is a welcome addition. Finally we have some leash that we can use to control video generation beast.

Subsequently, videos are segmented into shorter clips, with maximum duration of 12 seconds. each resulting clip may contain one or multiple temporally coherent shots, preserving local narrative flow while ensuring manageable input lengths for model ingestion.

Got it.

Video Overlay Rectification. Many source videos contain extraneous visual overlays such as logos, watermarks, subtitles, or on-screen graphics that can introduce noise or bias. Our rectification stage identifies these occlusions using a hybrid approach of heuristic rule-based systems and specialized object detection models. Frames are then adaptively cropped to maximize the retention of the primary visual content, yielding cleaner and more focused video data.

Ok, so, they crop out the watermarks and subtitles. Probably better to run a model through to remove the subtitles.

Quality and Safety Filtering.

Ethical filtering stuff.

Distribution Rebalancing. Raw data often exhibits significant category imbalance across various attributes. We analyze the dataset’s distribution along these dimensions by quantifying frequencies across attributes tailored to different semantic and technical perspectives, such as subject categories, scene types, dominant actions, genres, visual styles, clip duration, resolution, and motion characteristics.

Data rebalancing is done in following multi dimensional categories.

subject categories
scene types
dominant actions
genres
visual styles
clip duration
resolution
motion characteristics

For over-represented head categories, downsampling is applied. Conversely, for under-represented tail categories, we increase their sampling probability during training and initiate targeted data acquisition to augment their presence, aiming for a more equitable and comprehensive representation of the visual world.

Looks like they use “sampling probability” to over expose certain dataset categories. Then also use it to guide the data acquisition team to acquire more data.

3.2 Video Captioning

…

Caption Style. We adopt a dense caption style integrating dynamic and static features. For dynamic features, we meticulously describe actions and camera movements of video clip, highlighting changing elements. For static features, we elaborate on the characteristics of core characters or scenes in the video.

There is a division of dynamic and static features. Wonder how it is divided in the captions.

Caption Elements. we define specific categories dynamic and static features respectively. Dynamic features cover categories of motions, subjects or scenes changing and camera movements, while static features include appearances, aesthetics, styles, etc.

So:

Dynamic elements
- categories of motions
- subjects
- scenes changing
- camera movements
Static elements
- appearances
- aesthetics
- styles

We collect diverse data on such categories and conduct high-quality manual annotations for training. The trained caption model can accurately describe the critical content of complex and abstract video materials.

It’s probably this meticulous “manual“ process that makes all the differences.

Model Training. We train the caption model on the annotated data with Tarsier2, a model with strong video understanding capabilities. The visual encoder is forzen and the language model is fully fine-tuned. We train on both Chinese and English data to acquire bilingual capabilities.

Oh, okay. So they use this data to train a video captioning model. That makes sense.

I also like the fact that it is bilingual. There are certain things that are more natively described in a particular language. There is also “interlingual representation“ effect which is an idea that when a model is trained on two or more languages it forms a shared semantic space that lets knowledge hop the linguistic fence and cross-pollinate concepts. This of it as giving the model a second set of mental Lego bricks, so it can swap colours when one language’s parts don’t quite click. Of course at the cost of needing a higher memory requirements.

During inference, we use our PE model described in Sec 2.4 to rephrase user to rephrase user prompts into detail video captions, in which the format is aligned with the training captions in content and structure.

It is little to vague. I wish they just told us the prompt format and gave us some examples.

3.3 Efficient Engineering Infrastructure

Engineering Infrastructure Overview. Our engineering infrastructure for data processing is illustrated in figure 4, which consists of three layers…

For computation framework, they use BMF, Ray (open source) and Babit (internal) which optimizes resource allocation.

Underlying resource layer is using ByteCloud (internal) and Volcengine (external).

Efficient Heterogeneous Computing. To maximize resource utilization, our frameworks dynamically allocate video operations to optimal hardware (e.g., CPU for decoding, GPU for deep model inference). Asynchronous communication between computing units is used to mitigate bottlenecks introduced by the performance gaps between different types of computation hardware. To address the complexities arising from the instability of elastic computation resources, our frameworks incorporate two critical capabilities: adaptive auto-scaling to handle resource fluctuations and failure retry mechanisms for preempted tasks. Customized versions of BMF and Ray implement these optimizations, delivering near-linear scalability and extremely high throughput to efficiently process massive-scale video training data.

Got it. Let’s move on.

4 Model Training

…

4.1 Pre-Training

Diffusion Scheduling. During training, we employ the flow matching framework with velocity prediction, and a training timestep is sampled from a logit-normal distribution. Considering that videos with higher resolutions and longer duration require more noise to disrupt their signal, we then transform the training timestep with a resolution-aware shifts, which increases the noise perturbation for videos with higher resolution and longer duration.

Can’t say I understand this to satisfactory degree, but moving on.

Progressive Training. To enable higher data throughput and training efficiency, we initialize the model with sufficient low-resolution text-to-image (256px) training

So, initial phase just trains a text-to-image diffusion model.

and then progressively introduce view modalities with higher resolution and higher fps in following stages: (1) We condcut image-video joint training using 256px images and video clips from 3 to 12 seconds (12fps). (2) In the second stage, we increase the training resolution to 640px while maintaining the same duration. (3) In the final stage, we train the model with 24fps video to further improve the video smoothness. During video pre-training, we also retain a small portion of text-to-image task to maintain semantic alignment and set the proportion of the image-to-video task to 20% to activate the ability to follow visual prompts.

Basically, the model is in constant exercise for the fundamentals (spatial) while training for the grand finale (temporal).

4.2 Continue Training (CT)

As the image-to-video task constitutes only a small fraction of pre-training, the model’s potential in this area remains underexplored. To address this, we introduce the Continue Training (CT) phase focused on strengthening image-to-video generation after pre-training. In this phase, we increase the image-to-video ratio from 20% to 40% and further refine the training dataset to improve overall multitask performance.

I see. So CT is done to improve the image-to-video tasks. Focusing on image-to-video makes sense because vast majority of the videos will be generated with help of an image. It is the main usecase for video gen ai models.

High-Quality Data Selection. We select a subset of the pre-training data with higher aesthetic quality and richer motion dynamics by using a series of specialized evaluation models., including aesthetic scorer and motion evaluators based on optical flow.

Good to know.

Since the first frame is always provided in the image-to-video task, we design two types of caption for training: (1) original long captions with detailed descriptions of both dynamic and static content, and (2) short captions that focus solely on motion dynamics by removing the static description corresponding to the first frame. This encourages stronger semantic alignment alignment with the training objective.

Moral of the story is that you need to either provide (a) a detailed prompt with all the static features or (b) a short prompt that only focus on the motion.

Training Strategy. During continued training, we use slightly fewer GPUs than in pre-training stage, while maintaining and annealed learning rate schedule. The richer motion dynamics and diverse captions enable the model to generate more natural and smoother videos. Furthermore, the higher aesthetic quality of the training data leads to significant improvements in the visual fidelity of text-to-video generations. As a result, the final model supports both text-to-video and image-to-video tasks with enhanced overall performance.

Good.

4.3 Supervised Fine-Tuning (SFT)

Following CT, we perform supervised fine-tuning (SFT) to further align the model’s output with human preferences regarding visual quality and motion coherence. During this phase, the model trains on a carefully curated set of high-quality video-text pairs with manually verified captions, allowing it to generate videos with improved aesthetics and more consistent motion dynamics.

So, if CT had a purpose to support image-to-vide, this stage’s job is to improve the overall quality by fine-tuning with subset of data with extra high quality.

Human-Curated Dataset. Ensuring data quality and distributional balance is essential. To achieve this, we define several hundred categories based on visual style, motion type, and other key attributes. We then collect data in a targeted manner within each category, resulting in a curated dataset of high-quality video samples with accurate and meaningful captions.

Well said. Even in fine-tuning data balance is really important. Otherwise, you skew the model towards one direction.

Model Merging. To fully leverage high-quality data, we train separate models on curated subsets designed to capture a wide range of styles, motions, and scenarios. The resulting models are subsequently merged into a single model that integrates their respective strengths.

Hmm interesting…

So, they branch the model out into multiple models by fine-tuning each on separate datasets to make each branched model a specialist.

Then, they are merging them… How are they merging those models into one? Like in MoE way or just weight averaging?

Each model is trained with a smaller learning rate than in pre-training and utilizes a limited number of GPUs. Moreover, we apply early stopping at an effective point to prevent overfitting and maintain text controllability. The final merging step significantly improves both visual fidelity and motion quality.

Still not telling us how you merged the models.

4.4 Human Feedback Alignment (RLHF)

4.4.1 Feedback Data Infrastructure

We collect prompts from training datasets and online users, and perform data balancing and information filtering on prompts to discard duplicate and ambiguous ones. we collect high-quality video data pairs for human preference labeling, including synthetic video generated by different stages of our model. Experimental results demonstrate that the incorporation of multiple source visual materials can further enhance the domain acpacity of the RM model, expand the preference upper bound of RM, and strengthen generalization capabilities. We adopt a multi-dimensional annotation approach in the labeling process, i.e., selecting the best and worth videos under a specific labeling dimension while ensuring that the best videos are not inferior to the worst ones in other dimensions.

Take-aways:

Prompt pool: Harvested from live users. This gives them real-world prompts.
Preference pairs of videos: For each prompt they show labeler two clips and ask “which one is better on this axis?” (sharpness, motion, prompt-match, etc.).
Multi-dimensional annotations: labelers pick best & worst for one dimension while making sure the “best” isn’t secretly awful on the others.

4.4.2 Reward Model

To comprehensively enhance model performance, we design a sophisticated reward system comprising three specialized reward models: Foundational Reward Model, Motion Reward Model, and Aesthetic Reward Model.

Okay, so they have three rewards models:

Foundational reward
motion reward
aesthetic reward

Foundational reward model focuses on enhancing fundamental model capabilities, such as image-text alignment and structural stability. We employ a Vision-Language Model as the architecture of this reward model.

VLM is used for this foundational reward model.

Motion reward model helps to mitigate video artifacts while enhancing motion amplitude and vividness.

Motion reward model is used for rewarding better motions.

Given that video aesthetics primarily derive from keyframes, we design the aesthetic reward model from image-space input inspired by Seedream

Aesthetic reward model looks at the image space to reward better keyframe.

4.4.3 Base Model Feedback Learning

…

4.4.4 Super-Resolution RLHF Framework

…

Skipping these sections.

5 Inference Optimizations

5.1 Model Acceleration

Key takeaways:

Two-stage distillation pipeline super-charges DiT
- TSCD (Trajectory-Segmented Consistency Distillation) cuts denoising steps 4× by forcing the student to match the teacher on several mid-trajectory checkpoints—keeps quality while slashing compute.
- RayFlow score distillation fine-aligns the student’s noise predictions sample-by-sample, so fast samplers stay stable even at very low NFEs and show fewer splashy artifacts.
Human-guided adversarial headroom
- They reuse APT-style adversarial training but add a human-preference discriminator; this steers the aggressively distilled model back toward outputs people actually like (no crunchy edges, no melting faces).
Result: the distilled DiT matches the original on four expert metrics—prompt match, motion smoothness, frame fidelity, and source-image consistency—while running 4× faster.
“Thin” VAE decoder halves pixel-space bottleneck
- Profiling showed most latency lives in the last, wide convolution blocks of the decoder.
- Shrinking those channel widths (encoder left untouched) + quick retrain yields 2× VAE speed-up with no noticeable image-quality hit.
Net gain: end-to-end video generation runs roughly 8× faster (4× from DiT, 2× from VAE) yet keeps near-teacher visual and motion quality—crucial for real-time or cloud-cost-sensitive deployment.

Overall 8x gain. The VAE decoder optimization sounds good.

5.2 Inference Infrastructure

High-Performance Kernel. Extensive kernel fusion efforts have been conducted on the model’s core modules, resulting in a cumulative 15% improvement in the model’s inference throughput.

Kernel fusion → +15% throughput

Quantization and Sparse. Building on the Seedream technical solution, … We have achieved an optimal balance between performance and efficiency.

Way over my head.

Async Offloading Strategy. Due to the extensive computational demands of attention coupled with the large model size. We developed an automated and adaptive AsyncOffloading strategy. We successfully solved the large model deployment problem on memory-limited devices with a performance drop of less than 2%.

Yup, way over my head.

Hybrid Parallelism for Distributed VAE. Moreover, to address the issue of high GPU memory consumption due to the VAE-Decoder, we implemented an adaptive hybrid parallel strategy. This method partitions the input data along the spatial and temporal dimensions simultaneously and employs efficient collective communication for Conv3D computation. Thus, we further improved parallel scaling performance.

Skip

Pipeline Optimizations. We adopted kernel fusion, quantization, parallelization, continuous batching, prefix caching, and other common techniques to improve the overall throughput of the prompt engineering effectively. Furthermore, to tackle the issue of low encoding efficiency in long videos, we have implemented video encoding acceleration solutions.

Okay, so they had to build even a video encoding acceleration solutions. Crazy…

6 Training Infrastructure

6.1 Pre-Training Optimization

To support efficient large-scale pre-training of long-context video models on thousands of GPUs, we have designed a highly optimized training infrastructure. Our system focuses on maximizing hardware efficiency, scalability, and robustness. It integrates high-performance kernel fusion, a hybrid parallelism strategy, multi-level activation checkpointing (MLAC), runtime-aware workload balancing, and multi-level fault tolerance. These components work together to ensure stable, high-throughput training under diverse workloads and hardware scales…

Thousands of GPUs.
Kernel Fusion
Paralleism
Multi-level Activation Checkpointing
Workload balancing
Multi-level fault tolerance

Okay.

6.2Post-Training Optimization

Skip

7 Model Performance

Skip

7.2.1 SeedVideoBench 1.0

To comprehensively evaluate video generation models across diverse scenarios, we proposed SeedVideoBench-1.0, a benchmark designed through systematic analysis of real-world user prompts. This benchmark encompasses a wide range of application scenarios, including special effects, e-commerce, and professional-generated content (PGC). Additionally, a detailed taxonomy has been developed to assess model capabilities. The following section demonstrates the classification of main label categories, using text-to-video as an example.

It is a new benchmark.

Subject
Subject Description
Action Description
Camera
Aesthetic Description

I wish to check this out in the future.

7.2.2 Video Evaluation Metrics

In collaboration with film directors, we developed a set of specialized evaluation metrics for generated videos, enabling assessment from a professional perspective.

Gosh, they even hired film directors…

Motion Quality
Prompt Following
Aesthetic Quality
Preservation

7.2.3 Human Evaluation

Skip

7.3Multi-Shot Generation

Seedance 1.0 demonstrates the capability to generate multiple consecutive shots from a single prompt, while ensuring subject continuity and stylistic coherence across frames. This enables the model to handle complex narrative techniques commonly used in cinematic storytelling. Specifically, Seedance 1.0 facilitates the construction of shot-reverse shot sequences for dialogic interaction, as well as the use of cut-in and cut-away shots to enrich narrative pacing and contextual layering. Furthermore, it supports match cuts and action cuts, enabling seamless transitions and preserving visual continuity. These competencies highlight Seedance’s proficiency in cinematic shot composition and temporal coherence, offering enhanced creative control and narrative expressiveness for video content generation. Figure 14 presents an example of continuous shot transitions generated by Seedance 1.0, which exhibits more coherent and fluid cinematic storytelling compared to other models.

cuts, cuts and cuts. Perhaps this idea came from those film directors.

Some example prompts:

Prompt 1:

Original (ZH):
从观众视角切入场内灯光聚焦，转为拳手挥拳慢动作，接着是对手反应的超近特写，最后切到裁判吹哨一刻的动静对比。

English:
Cut in from the crowd’s viewpoint as the spotlights converge in the ring → switch to the boxer throwing punches in slow-motion → ultra-close-up of the opponent’s reaction → final cut to the referee blowing the whistle, contrasting motion and stillness.

Prompt 2:

Original (ZH):
镜头从废墟中前行机器人脚步切入，切至头部光学镜头扫描特写，再转为它视角中的城市轮廓扫描图像，最终定格在墙上涂鸦‘HUMANITY?’。

English:
Start with the robot’s footsteps moving through ruins → cut to a close-up of its optical lens scanning → switch to its POV scan of the city skyline → end on a freeze-frame of graffiti on a wall reading “HUMANITY?”.

Prompt 3:

Original (ZH):
清晨，一位少年骑着自行车穿过欧洲老城区。镜头从石板路上的车轮切入，切换为街边行人的仰视视角，再是他穿梭巷道的第三人称航拍视角，最后以他驶入阳光洒满的广场作为结束画面。

English:
Early morning: a teenage boy cycles through an old European town. Cut in on the bicycle wheel rolling over cobblestones → switch to a low-angle pedestrian view from the sidewalk → then a third-person aerial shot as he weaves through an alley → finish with him riding into a sun-drenched plaza.

It is good to see some example prompts. However, I wish there as specialized tokens to describe “<Cut>“ or “<Shot>.“ So that we can use it in the prompts. Perhaps there is, but I didn’t see it yet.

7.4 Multi-Style Alignment

Seedance 1.0 exhibits strong generalization across a broad spectrum of visual styles. In text-to-video (T2V) tasks, Seedance 1.0 enables direct generation of fine-grained stylistic videos, while in image-to-video (I2V) tasks, it reliably preserves visual characteristics of the reference image. The model supports a wide range of real-world cinematic styles, including black-and-white silent films, classic Hong Kong cinema, and retro Hollywood aesthetics, as well as animated and fantasy-oriented styles such as Japanese anime, cyberpunk futurism, and ink-wash animation. This multi-style adaptability facilitates seamless transitions between realism and fantasy without the need for extensive task-specific tuning. As a result, Seedance 1.0 offers exceptional versatility and controllability, making it well-suited for professional filmmaking and AIGC creation.

AIGC=”AI-generated content”

8 Conclusion

We have introduced Seedance 1.0, a native bilingual video generation foundation model that unifies multiple generation paradigms (such as text-to-video and image-to-video) and excels in instruction following, motion stability, and visual quality. We presented our technical improvements in dataset construction, efficient architecture design with training paradigm, post-training optimization, and inference acceleration, which are integrated effectively to achieve a high-performance model with fast inference. It demonstrates excellent capabilities in handling complex scenarios, multi-shot generation, and long-range temporal coherence, all while delivering fast and photorealistic generation experiences.

F*, we are here. Finally reached the end. This was an engaging read. They really knocked it out of the park.

Lessons Learned

Seedance 1 Pro supports “shots.“
It speaks Chinese and English. I should try it out in Chinese as well.
Crazy amount of manual labor went in.
Crazy amount of technical optimizations went in.
It does not reveal much about dataset size.
Each prompt should contain
- Dynamic elements
  - categories of motions
  - subjects
  - scenes changing
  - camera movements
- and optionally static elements
  - appearances
  - aesthetics
  - styles

Testing out Multi-Shot Prompts

Playing around with Multi-Shot Prompts. The prompt excerpt in the paper.

Prompt: 从观众视角切入场内灯光聚焦，转为拳手挥拳慢动作，接着是对手反应的超近特写，最后切到裁判吹哨一刻的动静对比。

Seed: 42

https://youtu.be/f5E1C2B-24k

Result: Only shows 3 cuts out of 4.

If you translate it in english via ChatGPT, and keep the same seed.

Prompt: Cut in from the crowd’s viewpoint as the spotlights converge in the ring → switch to the boxer throwing punches in slow-motion → ultra-close-up of the opponent’s reaction → final cut to the referee blowing the whistle, contrasting motion and stillness.

Seed: 42

https://youtu.be/ndFUmHQ7ZU4

Result: Only shows 3 cuts out of 4.

Alternate Prompt Format that is more structured. I conjured it up.

Prompt:
Shot 1: Crowd’s-eye view as the arena lights converge on the ring.
Shot 2: Slow-motion of the boxer throwing punches.
Shot 3: Ultra close-up of the opponent’s reaction.
Shot 4: Cut to the referee blowing the whistle—contrast between motion and stillness.

Seed: 42

https://youtu.be/ge5991C4NWM

Result: Shows all 4 cuts. Seems like being explicit is better.

Let’s try being even more explict:

Prompt:

<SHOT> Crowd’s-eye view as the arena lights converge on the ring. <CUT>

<SHOT> Slow-motion of the boxer throwing punches. <CUT>

<SHOT> Ultra close-up of the opponent’s reaction. <CUT>

<SHOT> Cut to the referee blowing the whistle—contrast between motion and stillness. <CUT>

Seed: 42

https://youtu.be/lUxULOykpq0

Result: 4 out of 4. Good.

Probably “<CUT>“ is unnecessary because “<SHOT>“ is already there.

Prompt:

<SHOT> Crowd’s-eye view as the arena lights converge on the ring.

<SHOT> Slow-motion of the boxer throwing punches.

<SHOT> Ultra close-up of the opponent’s reaction.

<SHOT> Cut to the referee blowing the whistle—contrast between motion and stillness.

Seed: 42

https://youtu.be/QbcV7lFJ4T0

Result: It generated only 2 shots. May be the keyword is “CUT“?

Prompt:

Crowd’s-eye view as the arena lights converge on the ring.

<CUT>

Slow-motion of the boxer throwing punches.

<CUT>

Ultra close-up of the opponent’s reaction.

<CUT>

Cut to the referee blowing the whistle—contrast between motion and stillness.

Seed: 42

https://youtu.be/UHeHL0kMBB8

It does provide 4 cuts out of 4, but interesting that the video looks so much different even in same seed value when these keywords are used.

Different variations work but I would stick to something explicit like:

[SHOT 1] Crowd’s-eye view as the arena lights converge on the ring. [CUT]

[SHOT 2] Slow-motion of the boxer throwing punches. [CUT]

[SHOT 3] Ultra close-up of the opponent’s reaction. [CUT]

[SHOT 4] Cut to the referee blowing the whistle—contrast between motion and stillness.

https://youtu.be/qKD6hQbueKY

Result: Nice clean 4 cuts.

TL;DR: Overall multi-shot prompting is a life-savior when it comes to generating videos.

Macro motion?

However, I see an issue unrelated to multi-shot feature where the micro motions sometimes repeat itself.

Motions are amazing at micro scale but when looking at the motion at a longer duration, they kinda repeat themselves or kinda “yanks” at a motion boundaries where the local kinetic energy is minimal. My suspicion is that this is because of the model architecture where temporal attention does not see the text tokens.

Anyhow, that’s it. In next post, I will design a prompt for SpriteDX with the learnings from this paper.

—Sprited Dev 🌱

Studying Seedance 1 Paper and Testing Out Multi-shot Prompts

Abstract

1 Introduction

2 Model Design

2.1 Variational Autoencoder

2.2 Diffusion Transformer

2.3 Diffusion Refiner

2.4 Prompt Engineering (PE)

3 Data

3.1 Data Pre-Processing

3.2 Video Captioning

3.3 Efficient Engineering Infrastructure

4 Model Training

4.1 Pre-Training

4.2 Continue Training (CT)

4.3 Supervised Fine-Tuning (SFT)

4.4 Human Feedback Alignment (RLHF)

4.4.1 Feedback Data Infrastructure

4.4.2 Reward Model

4.4.3 Base Model Feedback Learning

4.4.4 Super-Resolution RLHF Framework

5 Inference Optimizations

5.1 Model Acceleration

5.2 Inference Infrastructure

6 Training Infrastructure

6.1 Pre-Training Optimization

6.2Post-Training Optimization

7 Model Performance

7.2.1 SeedVideoBench 1.0

7.2.2 Video Evaluation Metrics

7.2.3 Human Evaluation

7.3Multi-Shot Generation

7.4 Multi-Style Alignment

8 Conclusion

Lessons Learned

Testing out Multi-Shot Prompts

Macro motion?

Subscribe to my newsletter

Sprited Dev

Sprited Dev