Make a Cheap Scaling : A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

1Nanyang Technological University,  2Tencent AI Lab,  3HKUST4Clemson University
(* Equal Contribution    # Corresponding Author)

Abstract

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

Self-Cascade Diffusion Model

We propose FreeU, a method that substantially improves diffusion model sample quality at no costs: no training, no additional parameter introduced, and no increase in memory or sampling time.

Text to Image

Samples with 2048*2048 generated by SD 2.1 training with 512*512.

Samples with 1024*1024 generated by SD 2.1 training with 512*512.


Samples with 4096*4096 generated by SD XL training with 1024*1024.


Text to Video

Samples with 512*512 generated by SD 2.1 training with 256*256.


Text to Video

Samples with 2048*1280*16 generated by VideoCrafter training with 1024*1024*16.


BibTeX

@article{guo2024make,
  title={Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation},
  author={Guo, Lanqing and He, Yingqing and Chen, Haoxin and Xia, Menghan and Cun, Xiaodong and Wang, Yufei and Huang, Siyu and Zhang, Yong and Wang, Xintao and Chen, Qifeng and others},
  journal={arXiv preprint arXiv:2402.10491},
  year={2024}
}