add ltx2 vae in sana-video; by lawrence-cj · Pull Request #13229 · huggingface/diffusers

lawrence-cj · 2026-03-09T04:16:06Z

This PR adds LTX-VAE support for SANA-Video.

GPU memory needed: 47GB for LTX refiner

SANA-Video with LTX2-Refiner:

"""Sana Video + LTX2 Refiner: Stage 1 generate latent → Stage 2 refine (3 steps)."""

import gc
import torch
from diffusers import SanaVideoPipeline, FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda"
dtype = torch.bfloat16
prompt = "A cat walking on the grass, facing the camera."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_score = 30
height, width, frames, frame_rate = 704, 1280, 81, 16.0
seed = 42

# ── Load all models ──
sana_pipe = SanaVideoPipeline.from_pretrained(
    "Efficient-Large-Model/SANA-Video_2B_720p_diffusers", torch_dtype=dtype,
)
sana_pipe.text_encoder.to(dtype)
sana_pipe.enable_model_cpu_offload()

ltx_pipe = LTX2Pipeline.from_pretrained("Lightricks/LTX-2", torch_dtype=dtype)
ltx_pipe.load_lora_weights(
    "Lightricks/LTX-2", adapter_name="stage_2_distilled",
    weight_name="ltx-2-19b-distilled-lora-384.safetensors",
)
ltx_pipe.set_adapters("stage_2_distilled", 1.0)
ltx_pipe.vae.enable_tiling()
ltx_pipe.enable_model_cpu_offload()

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "Lightricks/LTX-2", subfolder="latent_upsampler", torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=ltx_pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)

# ── Stage 1: Sana Video ──
video_latent = sana_pipe(
    prompt=prompt + f" motion score: {motion_score}.", negative_prompt=negative_prompt,
    height=height, width=width, frames=frames,
    guidance_scale=6.0, num_inference_steps=50,
    generator=torch.Generator(device=device).manual_seed(seed),
    output_type="latent", return_dict=True,
).frames

del sana_pipe; gc.collect(); torch.cuda.empty_cache()

# ── Stage 1.5: Latent Upsample (2x spatial) ──
video_latent = upsample_pipe(
    latents=video_latent.to(device=device, dtype=dtype),
    latents_normalized=True,
    height=height, width=width, num_frames=frames,
    output_type="latent", return_dict=False,
)[0]
latents_mean = ltx_pipe.vae.latents_mean.view(1, -1, 1, 1, 1).to(video_latent.device, video_latent.dtype)
latents_std = ltx_pipe.vae.latents_std.view(1, -1, 1, 1, 1).to(video_latent.device, video_latent.dtype)
video_latent = (video_latent - latents_mean) * ltx_pipe.vae.config.scaling_factor / latents_std

# ── Stage 2: LTX2 Refine ──
packed = LTX2Pipeline._pack_latents(
    video_latent.to(device=device, dtype=dtype),
    patch_size=ltx_pipe.transformer_spatial_patch_size,
    patch_size_t=ltx_pipe.transformer_temporal_patch_size,
)
_, _, lF, lH, lW = video_latent.shape
pH, pW, pT = (
    lH * ltx_pipe.vae_spatial_compression_ratio,
    lW * ltx_pipe.vae_spatial_compression_ratio,
    (lF - 1) * ltx_pipe.vae_temporal_compression_ratio + 1,
)

del video_latent
gc.collect()
torch.cuda.empty_cache()

video, _audio = ltx_pipe(
    latents=packed,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=pH,
    width=pW,
    num_frames=pT,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    frame_rate=frame_rate,
    generator=torch.Generator(device=device).manual_seed(seed),
    output_type="np",
    return_dict=False,
)

video = torch.from_numpy((video * 255).round().astype("uint8"))
encode_video(
    video[0],
    fps=frame_rate,
    audio=None,
    audio_sample_rate=None,
    output_path=args.output_path,
)

Result

sana_ltx_refined.mp4

sayakpaul · 2026-03-10T03:18:11Z

@lawrence-cj thanks for the PR! Could you also provide some sample outputs?

sayakpaul · 2026-03-10T03:19:24Z

src/diffusers/pipelines/sana_video/pipeline_sana_video.py

+        if getattr(self, "vae", None):
+            if hasattr(self.vae.config, "scale_factor_temporal"):
+                self.vae_scale_factor_temporal = self.vae.config.scale_factor_temporal
+            elif hasattr(self.vae.config, "temporal_compression_ratio"):
+                # LTX2 VAE uses temporal_compression_ratio
+                self.vae_scale_factor_temporal = self.vae.config.temporal_compression_ratio
+            else:
+                self.vae_scale_factor_temporal = getattr(self.vae, "temporal_compression_ratio", 4)
+
+            if hasattr(self.vae.config, "scale_factor_spatial"):
+                self.vae_scale_factor_spatial = self.vae.config.scale_factor_spatial
+            elif hasattr(self.vae.config, "spatial_compression_ratio"):
+                # LTX2 VAE uses spatial_compression_ratio
+                self.vae_scale_factor_spatial = self.vae.config.spatial_compression_ratio
+            else:
+                self.vae_scale_factor_spatial = getattr(self.vae, "spatial_compression_ratio", 8)
+        else:
+            self.vae_scale_factor_temporal = 4
+            self.vae_scale_factor_spatial = 8


Hmm, should this be conditioned on the class type of the VAE being used?

sayakpaul

Thanks, I just left one comment. But it looks good to me.

lawrence-cj · 2026-03-10T16:05:03Z

Could you also provide some sample outputs?

Updated code and result.

@sayakpaul @dg845

dg845

Thanks for the PR! The code looks good to me. However, running the example script doesn't work for me because I don't have access to the Sana_video/safetensors/sana_ltxvae_sft checkpoint. Would it be possible to provide a checkpoint for testing?

HuggingFaceDocBuilderDev · 2026-03-13T00:40:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lawrence-cj · 2026-03-16T09:44:44Z

Hi @dg845 , the repo-id is updated: Efficient-Large-Model/SANA-Video_2B_720p_diffusers

yiyixuxu

thanks for the PR!
I think the code would be clear if we just add a new code path based on the type of vae

if getattr(self, "vae", None):
     if isinstance(self.vae, AutoencoderKLLTX2Vide):
        self.vae_scale_factor_temporal = self.vae.config.temporal_compression_ratio
        self.vae_scale_factor_spatial = self.vae.config.spatial_compression_ratio
    elif isinstance(self.vae,  (AutoencoderDC, AutoencoderKLWan):
       # current code

similar for the latents_mean/latents_std

src/diffusers/pipelines/sana_video/pipeline_sana_video.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

dg845

I think we should update the vae type annotation in SanaVideoPipeline.__init__ to reflect the fact that LTX-2 VAEs are now supported (this will also fix a runtime warning):

class SanaVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
    def __init__(
        ...
        vae: AutoencoderDC | AutoencoderKLWan | AutoencoderKLLTX2Video,
        ...
    ):
        ...

I also agree with #13229 (review) that the code would be more clear if we split the logic based on the VAE class. It may make the code less general but I think the improved readability is worth it unless it is necessary to support arbitrary VAEs (if so, the vae type annotation should reflect this).

dg845 · 2026-03-18T02:23:24Z

As an aside, I think the example script could be simplified by letting the LTX-2 refinement pipeline sample audio_latents internally from the Gaussian prior:

...
# ── Stage 2: LTX2 Refine ──
packed = LTX2Pipeline._pack_latents(
    video_latent.to(device=device, dtype=dtype),
    patch_size=ltx_pipe.transformer_spatial_patch_size,
    patch_size_t=ltx_pipe.transformer_temporal_patch_size,
)
_, _, lF, lH, lW = video_latent.shape
pH, pW, pT = (
    lH * ltx_pipe.vae_spatial_compression_ratio,
    lW * ltx_pipe.vae_spatial_compression_ratio,
    (lF - 1) * ltx_pipe.vae_temporal_compression_ratio + 1,
)

del video_latent
gc.collect()
torch.cuda.empty_cache()

# Let audio_latents take on its default value of `None` so that latents are sampled from the prior
video, audio = ltx_pipe(
    latents=packed,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=pH,
    width=pW,
    num_frames=pT,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    frame_rate=frame_rate,
    generator=torch.Generator(device=device).manual_seed(seed),
    output_type="np",
    return_dict=False,
)

video = torch.from_numpy((video * 255).round().astype("uint8"))
encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=ltx_pipe.vocoder.config.output_sampling_rate,
    output_path="sana_ltx2_refined_audio.mp4",
)

The resulting generated audio is reasonably good:

sana_ltx2_refined_audio.mp4

lawrence-cj · 2026-03-18T02:48:51Z

@dg845 thanks! Now the code logi depends on VAE type

yiyixuxu

thanks!

lawrence-cj added 2 commits March 8, 2026 21:14

add ltx2 vae in sana-video;

cd639f9

add ltx vae in conversion script;

c03b739

lawrence-cj force-pushed the sana-video-ltx2vae branch from 5db0b20 to c03b739 Compare March 9, 2026 05:04

sayakpaul reviewed Mar 10, 2026

View reviewed changes

sayakpaul requested a review from dg845 March 10, 2026 03:19

lawrence-cj added 2 commits March 11, 2026 00:05

Merge branch 'main' into sana-video-ltx2vae

b6a2c97

Merge branch 'main' into sana-video-ltx2vae

e114ec1

dg845 reviewed Mar 13, 2026

View reviewed changes

Merge branch 'main' into sana-video-ltx2vae

e3f5bc7

yiyixuxu reviewed Mar 16, 2026

View reviewed changes

src/diffusers/pipelines/sana_video/pipeline_sana_video.py Outdated Show resolved Hide resolved

src/diffusers/pipelines/sana_video/pipeline_sana_video.py Outdated Show resolved Hide resolved

lawrence-cj and others added 5 commits March 17, 2026 10:03

Update src/diffusers/pipelines/sana_video/pipeline_sana_video.py

90eb2f8

Co-authored-by: YiYi Xu <yixu310@gmail.com>

Update src/diffusers/pipelines/sana_video/pipeline_sana_video.py

5c4941b

Co-authored-by: YiYi Xu <yixu310@gmail.com>

Merge branch 'main' into sana-video-ltx2vae

d1a7852

Merge branch 'main' into sana-video-ltx2vae

3f83cf6

Merge branch 'main' into sana-video-ltx2vae

fb7d2dc

dg845 reviewed Mar 18, 2026

View reviewed changes

lawrence-cj added 2 commits March 17, 2026 19:26

condition vae_scale_factor_xxx related settings on VAE types;

4fbc382

make the mean/std depends on vae class;

bcfc391

yiyixuxu approved these changes Mar 18, 2026

View reviewed changes

yiyixuxu merged commit c6f72ad into huggingface:main Mar 18, 2026
11 checks passed

Conversation

lawrence-cj commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SANA-Video with LTX2-Refiner:

Result

Uh oh!

sayakpaul commented Mar 10, 2026

Uh oh!

sayakpaul Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

lawrence-cj Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

lawrence-cj commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

lawrence-cj commented Mar 16, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

dg845 commented Mar 18, 2026

Uh oh!

lawrence-cj commented Mar 18, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lawrence-cj commented Mar 9, 2026 •

edited

Loading

lawrence-cj commented Mar 10, 2026 •

edited

Loading