Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance (NeurIPS 2024)

Kuan Heng Lin^1*, Sicheng Mo^1*, Ben Klingher¹, Fangzhou Mu², Bolei Zhou¹
¹UCLA ²NVIDIA
^*Equal contribution

Getting started

Environment setup

Our code is built on top of diffusers v0.28.0. To set up the environment, please run the following.

conda env create -f environment.yaml
conda activate ctrlx

Running Ctrl-X

Gradio demo

We provide a user interface for testing our method. Running the following command starts the demo.

python app_ctrlx.py

Script

We also provide a script for running our method. This is equivalent to the Gradio demo.

python run_ctrlx.py \
    --structure_image assets/images/horse__point_cloud.jpg \
    --appearance_image assets/images/horse.jpg \
    --prompt "a photo of a horse standing on grass" \
    --structure_prompt "a 3D point cloud of a horse"

If appearance_image is not provided, then Ctrl-X does structure-only control. If structure_image is not provided, then Ctrl-X does appearance-only control.

Optional arguments

There are three optional arguments for both app_ctrlx.py and run_ctrlx.py:

model_offload (flag): If enabled, offloads each component of both the base model and refiner to the CPU when not in use, reducing memory usage while slightly increasing inference time.
- To use model_offload, accelerate must be installed. This must be done manually with pip install accelerate as environment.yaml does not have accelerate listed.
sequential_offload (flag): If enabled, offloads each layer of both the base model and refiner to the CPU when not in use, significantly reducing memory usage while massively increasing inference time.
- Similarly, accelerate must be installed to use sequential_offload.
- If both model_offload and sequential_offload are enabled, then our code defaults to sequential_offload.
disable_refiner (flag): If enabled, disables the refiner (and does not load it), reducing memory usage.
model (str): When provided a safetensor checkpoint path, loads the checkpoint for the base model.

Approximate GPU VRAM usage for the Gradio demo and script (structure and appearance control) on a single NVIDIA RTX A6000 is as follows.

Flags	Inference time (s)	GPU VRAM usage (GiB)
None	28.8	18.8
`model_offload`	38.3	12.6
`sequential_offload`	169.3	3.8
`disable_refiner`	25.5	14.5
`model_offload` + `disable_refiner`	31.7	7.4
`sequential_offload` + `disable_refiner`	151.4	3.8

Here, VRAM usage is obtained via torch.cuda.max_memory_reserved(), which is the closest option in PyTorch to nvidia-smi numbers but is probably still an underestimation. You can obtain these numbers on your own hardware by adding the benchmark flag for run_ctrlx.py.

Have fun playing around with Ctrl-X! :D

Contact

For any questions, thoughts, discussions, and any other things you want to reach out for, please contact Jordan Lin ([email protected]).

Reference

If you use our code in your research, please cite the following work.

@inproceedings{lin2024ctrlx,
    author = {Lin, {Kuan Heng} and Mo, Sicheng and Klingher, Ben and Mu, Fangzhou and Zhou, Bolei},
    booktitle = {Advances in Neural Information Processing Systems},
    title = {Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance},
    year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets/images		assets/images
ctrl_x		ctrl_x
docs		docs
.gitignore		.gitignore
README.md		README.md
app_ctrlx.py		app_ctrlx.py
environment.yaml		environment.yaml
run_ctrlx.py		run_ctrlx.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance (NeurIPS 2024)

Getting started

Environment setup

Running Ctrl-X

Gradio demo

Script

Optional arguments

Contact

Reference

About

Releases

Packages

Languages

genforce/ctrl-x

Folders and files

Latest commit

History

Repository files navigation

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance (NeurIPS 2024)

Getting started

Environment setup

Running Ctrl-X

Gradio demo

Script

Optional arguments

Contact

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages