<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance</title></titleStmt>
			<publicationStmt>
				<publisher>Advances in Neural Information Processing Systems (NeurIPS), 2024</publisher>
				<date>12/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10573541</idno>
					<idno type="doi"></idno>
					
					<author>Kuan Heng Lin</author><author>Sicheng Mo</author><author>Ben Klingher</author><author>Fangzhou Mu</author><author>Bolei Zhou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Ctrl-X enables training-free and guidance-free zero-shot control of pretrained text-to-image diffusion models given any structure conditions and appearance images.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The rapid advancement of large text-to-image (T2I) generative models has made it possible to generate high-quality images with just one text prompt. However, it remains challenging to specify the exact concepts that can accurately reflect human intents using only textual descriptions. Recent approaches like ControlNet <ref type="bibr">[44]</ref> and IP-Adapter <ref type="bibr">[43]</ref> have enabled controllable image generation upon pretrained T2I diffusion models regarding structure and appearance, respectively. Despite the impressive results in controllable generation, these approaches <ref type="bibr">[44,</ref><ref type="bibr">25,</ref><ref type="bibr">46,</ref><ref type="bibr">20]</ref> require fine-tuning the entire generative model or training auxiliary modules on large amounts of paired data.</p><p>Training-free approaches <ref type="bibr">[7,</ref><ref type="bibr">24,</ref><ref type="bibr">4]</ref> have been proposed to address the high overhead associated with additional training stages. These methods optimize the latent embedding across diffusion steps using specially designed score functions to achieve finer-grained control than text alone with a process called guidance. Although training-free approaches avoid the training cost, they significantly increase computing time and required GPU memory in the inference stage due to the additional backpropagation over the diffusion network. They also require sampling steps that are 2-20 times longer. Furthermore, as the expected latent distribution of each time step is predefined for each diffusion model, it is critical to tune the guidance weight delicately for each score function; Otherwise, the latent might be out-of-distribution and lead to artifacts and reduced image quality.</p><p>To tackle these limitations, we present Ctrl-X, a simple training-free and guidance-free framework for T2I diffusion with structure and appearance control. We name our method "Ctrl-X" because we reformulate the controllable generation problem by 'cutting' (and 'pasting') two tasks together: spatial structure preservation and semantic-aware stylization. Our insight is that diffusion feature maps capture rich spatial structure and high-level appearance from early diffusion steps sufficient for structure and appearance control without guidance. To this end, Ctrl-X employs feature injection and spatially-aware normalization in the attention layers to facilitate structure and appearance alignment with user-provided images. By being guidance-free, Ctrl-X eliminates additional optimization overhead and sampling steps, resulting in a 35-fold increase in inference speed compared to guidancebased methods. Figure <ref type="figure">1</ref> shows sample generation results. Moreover, Ctrl-X supports arbitrary structure conditions beyond natural images and can be applied to any T2I and even text-to-video (T2V) diffusion models. Extensive quantitative and qualitative experiments, along with a user study, demonstrate the superior image quality and appearance alignment of our method over prior works.</p><p>We summarize our contributions as follows:</p><p>1. We present Ctrl-X, a simple plug-and-play method that builds on pretrained text-to-image diffusion models to provide disentangled and zero-shot control of structure and appearance during the generation process requiring no additional training or guidance. 2. Ctrl-X presents the first universal guidance-free solution that supports multiple conditional signals (structure and appearance) and model architectures (e.g. text-to-image and text-to-video). 3. Our method demonstrates superior results in comparison to previous training-based and guidancebased baselines (e.g. ControlNet + IP-Adapter <ref type="bibr">[44,</ref><ref type="bibr">43]</ref> and FreeControl <ref type="bibr">[24]</ref>) in terms of condition alignment, text-image alignment, and image quality. Training-based structure control methods require paired condition-image data to train additional modules or fine-tune the entire diffusion network to facilitate generation from spatial conditions <ref type="bibr">[44,</ref><ref type="bibr">25,</ref><ref type="bibr">20,</ref><ref type="bibr">46,</ref><ref type="bibr">42,</ref><ref type="bibr">3,</ref><ref type="bibr">47,</ref><ref type="bibr">38,</ref><ref type="bibr">49]</ref>. While pixel-level spatial control can be achieved with this approach, a significant drawback is needing a large number of condition-image pairs as training data. Although some condition data can be generated from pretrained annotators (e.g. depth and segmentation maps), other condition data is difficult to obtain from given images (e.g. 3D mesh, point cloud), making these conditions challenging to follow. Compared to these training-based methods, Ctrl-X supports conditions where paired data is challenging to obtain, making it a more flexible and effective solution.</p><p>Training-free structure control methods typically focus on specific conditions. For example, R&amp;B <ref type="bibr">[40]</ref> facilitates bounding-box guided control with region-aware guidance, and DenseDiffusion <ref type="bibr">[17]</ref> gen-Figure <ref type="figure">2</ref>: Visualizing early diffusion features. Using 20 real, generated, and condition images of animals, we extract Stable Diffusion XL <ref type="bibr">[27]</ref> features right after decoder layer 0 convolution. We visualize the top three principal components computed for each time step across all images. t = 961 to 881 correspond to inference steps 1 to 5 of the DDIM scheduler with 50 time steps. We obtain x t by directly adding Gaussian noise to each clean image x 0 via the diffusion forward process.</p><p>erates images with sparse segmentation map conditions by manipulating the attention weights.</p><p>Universal Guidance <ref type="bibr">[4]</ref> employs various pretrained classifiers to support multiple types of condition signals. FreeControl <ref type="bibr">[24]</ref> analyzes semantic correspondence in the subspace of diffusion features and harnesses it to support spatial control from any visual condition. While these approaches do not require training data, they usually need to compute the gradient of the latent to lower an auxiliary loss, which requires substantial computing time and GPU memory. In contrast, Ctrl-X requires no guidance at the inference stage and controls structure via direct feature injections, enabling faster and more robust image generation with spatial control.</p><p>Diffusion appearance control. Existing appearance control methods that build upon pretrained diffusion models can also similarly be categorized into two types (training-based vs. training-free).</p><p>Training-based appearance control methods can be divided into two categories: Those trained to handle any image prompt and those overfitting to a single instance. The first category <ref type="bibr">[44,</ref><ref type="bibr">25,</ref><ref type="bibr">43,</ref><ref type="bibr">38]</ref> trains additional image encoders or adapters to align the generated process with the structure or appearance from the reference image. The second category <ref type="bibr">[30,</ref><ref type="bibr">14,</ref><ref type="bibr">8,</ref><ref type="bibr">2,</ref><ref type="bibr">26,</ref><ref type="bibr">31]</ref> is typically applied to customized visual content creation by finetuning a pretrained text-to-image model on a small set of images or binding special tokens to each instance. The main limitation of these methods is that the additional training required makes them unscalable. However, Ctrl-X offers a scalable solution to transfer appearance from any instance without training data.</p><p>Training-free appearance control methods generally follow two approaches: One approach <ref type="bibr">[1,</ref><ref type="bibr">5,</ref><ref type="bibr">41]</ref> manipulates self-attention features using pixel-level dense correspondence between the generated image and the target appearance, and the other <ref type="bibr">[7,</ref><ref type="bibr">24]</ref> extracts appearance embeddings from the diffusion network and transfers the appearance by guiding the diffusion process towards the target appearance embedding. A key limitation of these approaches is that a single text-controlled target cannot fully capture the details of the target image, and the latter methods require additional optimization steps. By contrast, our method exploits the spatial correspondence of self-attention layers to achieve semantically-aware appearance transfer without targeting specific subjects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries</head><p>Diffusion models are a family of probabilistic generative models characterized by two processes:</p><p>The forward process iteratively adds Gaussian noise to a clean image x 0 to obtain x t for time step t &#8672; [1, T ], which can be reparameterized in terms of a noise schedule &#8629; t where</p><p>) for &#9999; &#8672; N (0, I); The backward process generates images by iteratively denoising an initial Gaussian noise x T &#8672; N (0, I), also known as diffusion sampling <ref type="bibr">[13]</ref>. This process uses a parameterized denoising network &#9999; &#10003; conditioned on a text prompt c, where at time step t we obtain a cleaner x t 1</p><p>Formally,</p><p>) approximates a score function scaled by a noise schedule t that points toward a high density of data, i.e., x 0 , at noise level t <ref type="bibr">[34]</ref>.</p><p>Guidance. The iterative inference of diffusion enables us to guide the sampling process on auxiliary information. Guidance modifies Equation 2 to compose additional score functions that point toward richer and specifically conditioned distributions <ref type="bibr">[4,</ref><ref type="bibr">7]</ref>, expressed as</p><p>where g is an energy function and s is the guidance strength. In practice, g can range from classifierfree guidance (where g = &#9999; and y = ?, i.e. the empty prompt) to improve image quality and prompt adherence for T2I diffusion <ref type="bibr">[12,</ref><ref type="bibr">29]</ref>, to arbitrary gradients r xt `(&#9999;(x t | t, c) | t, y) computed from auxiliary models or diffusion features common to guidance-based controllable generation <ref type="bibr">[4,</ref><ref type="bibr">7,</ref><ref type="bibr">24]</ref>. Thus, guidance provides great customizability on the type and variety of conditioning for controllable generation, as it only requires any loss that can be backpropagated to x t . However, this backpropagation requirement often translates to slow inference time and high memory usage. Moreover, as guidance-based methods often compose multiple energy functions, tuning the guidance strength s for each g may be finicky and cause issues of robustness. Thus, Ctrl-X avoids guidance and provides instant applicability to larger T2I and T2V models with minor hyperparameter tuning.</p><p>Diffusion U-Net architecture. Many pretrained T2I diffusion models are text-conditioned U-Nets, which contain an encoder and a decoder that downsample and then upsample the input x t to predict &#9999;, with long skip connections between matching encoder and decoder resolutions <ref type="bibr">[13,</ref><ref type="bibr">29,</ref><ref type="bibr">27]</ref>. Each encoder/decoder block contains convolution layers, self-attention layers, and cross-attention layers: The first two control both structure and appearance, and the last injects textual information. Thus, many training-free controllable generation methods utilize these layers, through direct manipulation <ref type="bibr">[11,</ref><ref type="bibr">36,</ref><ref type="bibr">18,</ref><ref type="bibr">1,</ref><ref type="bibr">41]</ref> or for computing guidance losses <ref type="bibr">[7,</ref><ref type="bibr">24]</ref>, with self-attention most commonly used: Let h l,t 2 R (hw)&#8677;c be the diffusion feature with height h, width w, and channel size c at time step t right before attention layer l. Then, the self-attention operation is</p><p>where</p><p>are linear transformations which produce the query Q, key K, and value V, respectively, and softmax is applied across the second (hw)-dimension. (Generally, c = d for diffusion models.) Intuitively, the attention map A 2 R (hw)&#8677;(hw) encodes how each pixel in Q corresponds to each in K, which then rearranges and weighs V. This correspondence is the basis for Ctrl-X's spatially-aware appearance transfer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Guidance-free structure and appearance control</head><p>Ctrl-X is a general framework for training-free, guidance-free, and zero-shot T2I diffusion with structure and appearance control. Given a structure image I s and appearance image I a , Ctrl-X manipulates a pretrained T2I diffusion model &#9999; &#10003; to generate an output image I o that inherits the structure of I s and appearance of I a . Method overview. Our method is illustrated in Figure <ref type="figure">3</ref> and is summarized as follows: Given clean structure and appearance latents I s = x s 0 and I a = x a 0 , we first directly obtain noised structure and appearance latents x s t and x a t via the diffusion forward process, then extract their U-Net features from a pretrained T2I diffusion model. When denoising the output latent x o t , we inject convolution and self-attention features from x s t and leverage self-attention correspondence to transfer spatially-aware appearance statistics from x a t to x o t to achieve structure and appearance control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Feed-forward structure control</head><p>Structure control of T2I diffusion requires transferring structure information from I s = x s 0 to x o t , especially during early time steps. To this end, we initialize x o T = x s T &#8672; N (0, I) and obtain x s t via the diffusion forward process in Equation 1 with x s 0 and randomly sampled &#9999; &#8672; N (0, I). Inspired by the observation where diffusion features contain rich layout information <ref type="bibr">[36,</ref><ref type="bibr">18,</ref><ref type="bibr">24]</ref>, we perform feature and self-attention injection as follows: For U-Net layer l and diffusion time step t, let f o l,t and f s l,t be features/activations after the convolution block from x o t and x s t , and let A o l,t and A s l,t be the attention maps of the self-attention block from x o t and x s t . Then, we replace</p><p>(a) Ctrl-X pipeline (b) Spatially-aware appearance transfer Figure <ref type="figure">3</ref>: Overview of Ctrl-X. (a) At each sampling step t, we obtain x s t and x a t via the forward diffusion process, then feed them into the T2I diffusion model to obtain their convolution and selfattention features. Then, we inject convolution and self-attention features from x s t and leverage self-attention correspondence to transfer spatially-aware appearance statistics from x a t to x o t . (b) Details of our spatially-aware appearance transfer, where we exploit self-attention correspondence between x o t and x a t to compute weighted feature statistics M and S applied to x o t .</p><p>In contrast to <ref type="bibr">[36,</ref><ref type="bibr">18,</ref><ref type="bibr">24]</ref>, we do not perform inversion and instead directly use forward diffusion (Equation <ref type="formula">1</ref>) to obtain x s t . We observe that x s t obtained via the forward diffusion process contains sufficient structure information even at very early/high time steps, as shown in Figure <ref type="figure">2</ref>. This also reduces appearance leakage common to inversion-based methods observed by FreeControl <ref type="bibr">[24]</ref>. We study our feed-forward structure control method in Sections 5.1 and 5.2.</p><p>We apply feature injection for layers l 2 L feat and self-attention injection for layers l 2 L self , and we do so for (normalized) time steps t &#63743; &#8999; s , where &#8999; s 2 [0, 1] is the structure control schedule.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Spatially-aware appearance transfer</head><p>Inspired by prior works that define appearance as feature statistics <ref type="bibr">[15,</ref><ref type="bibr">21]</ref>, we consider appearance transfer to be a stylization task. T2I diffusion self-attention transforms the value V with attention map A, where the latter represents how pixels in Q corresponds to pixels in K. As observed by Cross-Image Attention <ref type="bibr">[1]</ref>, QK &gt; can represent the semantic correspondence between two images when Q and K are computed from features from each, even when the two images differ significantly in structure. Thus, inspired by AdaAttN <ref type="bibr">[21]</ref>, we propose spatially-aware appearance transfer, where we exploit this correspondence to generate self-attention-weighted mean and standard deviation maps from x a t to normalize x o t : For any self-attention layer l, let h o l,t and h a l,t be diffusion features right before self-attention for x o t and x a t , respectively. Then, we compute the attention map</p><p>where norm is applied across spatial dimension (hw). Notably, we normalize h o l,t and h a l,t first to remove appearance statistics and thus isolate structural correspondence. Then, we compute the mean and standard deviation maps M and S of h a l,t weighted by A and use them to normalize h o l,t ,</p><p>and S :=</p><p>M and S, weighted by structural correspondences between I o and I a , are spatially-aware feature statistics of x a t which are transferred to x o t . Lastly, we perform layer l self-attention on h o l,t as normal. We apply appearance transfer for layers l 2 L app , and we do so for (normalized) time steps t &#63743; &#8999; a , where &#8999; a 2 [0, 1] is the appearance control schedule.</p><p>(a)</p><p>Structure Appearance Output Structure Appearance Output Structure Appearance Output (b) a photo of a railway during sunset a painting of a railway during the harsh winter a realistic photo of a bear and an avocado in a forest a painting of a tiger looking at a large white egg on a beach a photo of a yellow sports car speeding in a city a painting of an abandoned, worn out car in a desert a cartoon of an evil goblin holding a piece of gold a rough sketch of a kangaroo on top of a mountain a cartoon of the Grim Reapaer sitting on a bench looking at his phone a photo of a Stormtrooper sitting on a bench looking at their phone in a futuristic city a photo of a river during winter, bird's-eye view a photo of a city intersection at night, bird's eye view Ctrl-X supports a diverse variety of structure images for both (a) structure and appearance controllable generation and (b) prompt-driven conditional generation.</p><p>Structure and appearance control. Finally, we replace &#9999; &#10003; in Equation <ref type="formula">2</ref>with</p><p>where {f s l,t } l2L feat , {A s l,t } l2L self , and {h a l,t } l2L app respectively correspond to x s t features for feature injection, x s t attention maps for self-attention injection, and x a t features for appearance transfer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>We present extensive quantitative and qualitative results to demonstrate the structure preservation and appearance alignment of Ctrl-X on T2I diffusion. Appendix A contains more implementation details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">T2I diffusion with structure and appearance control</head><p>Baselines. For training-based methods, ControlNet <ref type="bibr">[44]</ref> and T2I-Adapter <ref type="bibr">[25]</ref> learn an auxiliary module that injects a condition image into a pretrained diffusion model for structure alignment. We then combine them with IP-Adapter <ref type="bibr">[43]</ref>, a trained module for image prompting and thus appearance transfer. Uni-ControlNet <ref type="bibr">[46]</ref> adds a feature extractor to ControlNet to achieve multi-image structure control of selected condition types, along with image prompting for global/appearance control. Splicing ViT Features <ref type="bibr">[35]</ref> trains a U-Net from scratch per source-appearance image pair to minimize their DINO-ViT self-similarity distance and global [CLS] token loss. (For structure conditions not supported by a training-based baseline, we convert them to canny edge maps.) For guidance-based methods, FreeControl <ref type="bibr">[24]</ref> enforce structure and appearance alignment via backpropagated score functions computed from diffusion feature subspaces. For guidance-free methods, Cross-Image Attention <ref type="bibr">[1]</ref> manipulates attention weights to transfer appearance while maintaining structure. We run all methods on SDXL v1.0 <ref type="bibr">[27]</ref> when possible and on their default base models otherwise.</p><p>Dataset. Our method supports T2I diffusion with appearance transfer and arbitrary-condition structure control. Since no benchmarks exist for such a flexible task, we create a new dataset comprising 256 diverse structure-appearance pairs. The structure images consist of 31% natural images, 49% ControlNet-supported conditions (e.g. canny, depth, segmentation), and 20% in-the-wild conditions (e.g. 3D mesh, point cloud), and the appearance images are a mix of Web and generated images. We use templates and hand-annotation for the structure, appearance, and output text prompts.</p><p>Evaluation metrics. For quantitative evaluation, we report two widely-adopted metrics: DINO Self-sim measures the self-similarity distance <ref type="bibr">[35]</ref> between the structure and output image in the DINO-ViT <ref type="bibr">[6]</ref> feature space, where a lower distance indicates better structure preservation; DINO-I measures the cosine similarity between the DINO-ViT [CLS] tokens of the appearance and output images <ref type="bibr">[30]</ref>, where a higher score indicates better appearance transfer.</p><p>Qualitative results. As shown in Figures <ref type="figure">4</ref> and <ref type="figure">5</ref>, Ctrl-X faithfully preserves structure from structure images ranging from natural images and ControlNet-supported conditions (e.g. HED, segmentation) to in-the-wild conditions (e.g. wireframe, 3D mesh) not possible in prior training-based methods while adeptly transferring appearance from the appearance image with semantic correspondence. Moreover, as shown in Figure <ref type="figure">6</ref>, Ctrl-X is capable of multi-subject generation, capturing strong semantic correspondence between different subjects and the background, achieving balanced structure and appearance alignment. On the contrary, ControlNet + IP-Adapter <ref type="bibr">[44,</ref><ref type="bibr">43]</ref> often fails to maintain the structure and/or transfer the subjects' or background's appearances.</p><p>Comparison to baselines. Figure <ref type="figure">5</ref> and Table <ref type="table">2</ref> compare Ctrl-X to the baselines for qualitative and quantitative results, respectively. Moreover, our user study in Table <ref type="table">4</ref>, Appendix A shows the human preference percentages of how often participants preferred Ctrl-X over each of the baselines on result quality, structure fidelity, appearance fidelity, and overall fidelity.</p><p>For training-based and guidance-based methods, despite Uni-ControlNet <ref type="bibr">[46]</ref> and FreeControl's <ref type="bibr">[24]</ref> stronger structure preservation (smaller DINO self-similarity), they generally struggle to enforce faithful appearance transfer and yield worse DINO-I scores, which is particularly visible in Figure <ref type="figure">5</ref> row 1 and 3. Since the training-based methods combine a structure control module (ControlNet <ref type="bibr">[44]</ref> and T2I-Adapter <ref type="bibr">[25]</ref>) with a separately-trained appearance transfer module IP-Adapter <ref type="bibr">[43]</ref>, the two modules sometimes exert conflicting control signals at the cost of appearance transfer (e.g. row 1)-and for ControlNet, structure preservation as well. For Uni-ControlNet, compressing the  Method Training Preprocessing time (s) Inference latency (s) Total time (s) Peak GPU memory usage (GiB) Splicing ViT Features [35] 3 0.00 1557.09 1557.09 3.95 Uni-ControlNet [46] 3 0.00 6.96 6.96 7.36 ControlNet + IP Adapter [44, 43] 3 0.00 6.21 6.21 18.09 T2I-Adapter + IP-Adapter [25, 43] 3 0.00 4.37 4.37 13.28 Cross-Image Attention [1] 7 18.33 24.47 42.80 8.85 FreeControl [24] 7 239.36 139.53 378.89 44.34 Ctrl-X (ours) 7 0.00 10.91 10.91 11.51</p><p>appearance image to a few prompt tokens results in often inaccurate appearance transfer (e.g. rows 4 and 5) and structure bleed artifacts (e.g. row 6). For FreeControl, its appearance score function from extracted embeddings may not sufficiently capture more complex appearance correspondences, which, along with needing per-image hyperparameter tuning, results in lower contrast outputs and sometimes failed appearance transfer (e.g. row 4). Moreover, despite Splicing ViT Features <ref type="bibr">[35]</ref> having the best self-similarity and DINO-I scores in Table <ref type="table">2</ref>, Figure <ref type="figure">5</ref> reveals that its output images are often blurry while displaying structure image appearance leakage with non-natural images (e.g. row 3, 5, and 6). It benchmarks well because its per-image training minimizes DINO metrics directly.</p><p>There is a trade-off between structure consistency (self-similarity) and appearance similarity (DINO-I), as these are competing metrics-increasing structure preservation corresponds to worse appearance similarity, which we show in Figure <ref type="figure">11</ref>, Appendix B by varying controls schedules. As single metrics are not representative of overall method performance, we survey overall fidelity in our user study (Table <ref type="table">4</ref>, Appendix A), where Ctrl-X achieved best overall fidelity while matching result quality, structure fidelity, and appearance fidelity with training-based methods, showcasing our method's ability to balance the conflicting, disentangled tasks of structure and appearance control.</p><p>Guidance-free baseline Cross-Image Attention <ref type="bibr">[1]</ref>, in contrast, is less robust and more sensitive to the structure image, as the inverted structure latents contain strong appearance information. This causes both poorer structure alignment and frequent appearance leakage or artifacts (e.g. row 6) from the structure to the output images, resulting in worse DINO self-similarity and DINO-I scores. Similarly, Ctrl-X results are consistently preferred over Cross-Image Attention ones in our user study across all metrics (Table <ref type="table">4</ref>, Appendix A). In practice, we find Cross-Image Attention to be sensitive to its domain name, which is used for attention masking to isolate subjects, and it thus sometimes fails to produce outputs with cross-modal pairs (e.g. wireframes to photos).</p><p>Inference efficiency. We study the inference time, preprocessing time, and peak GPU memory usage of our method compared to the baselines, all with base model SDXL v1.0 except Uni-ControlNet (SD v1.5), Cross-Image Attention (SD v1.5), and Splicing ViT Features (U-Net). Table <ref type="table">1</ref> reports the  average inference time using a single NVIDIA H100 GPU. Ctrl-X is slightly slower than trainingbased ControlNet (1.76&#8677;) and T2I-Adapter (2.50&#8677;) with IP-Adapter yet significantly faster than per-image-trained Splicing ViT (0.0070&#8677;), guidance-based FreeControl (0.029&#8677;), and guidance-free Cross-Image Attention (0.25&#8677;). Moreover, for methods with SDXL v1.0 as the base model, Ctrl-X has lower peak GPU memory usage than training-based methods and significantly lower memory than training-free methods. Our training-free and guidance-free method achieves comparable run time and peak GPU memory usage compared to training-based methods, indicating its flexibility.</p><p>Extension to prompt-driven conditional generation. Ctrl-X also supports prompt-driven conditional generation, where it generates an output image complying with the given text prompt while aligning with the structure from the structure image, as shown in Figures <ref type="figure">4</ref> and <ref type="figure">7</ref>. Inspired by FreeControl <ref type="bibr">[24]</ref>, instead of a given I a , Ctrl-X can jointly generate I a based on the text prompt alongside I o , where we obtain x a t 1 via denoising with Equation 2 from x a t without control. Baselines, qualitative and quantitative analysis, and implementation details are available in Appendix C.</p><p>Extension to video diffusion models. Ctrl-X is training-free, guidance-free, and demonstrates competitive runtime. Thus, we can directly apply our method to text-to-video (T2V) models, as seen in Figure <ref type="figure">17</ref>, Appendix D. Our method closely aligns the structure between the structure and output videos while transferring temporally consistent appearance from the appearance image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Ablations</head><p>Effect of control. As seen in Figure <ref type="figure">8</ref>(a), structure control is responsible for structure preservation (appearance-only vs. ours). Also, structure control alone cannot isolate structure information, display- No control Structure-only Appearance-only Ours Structure Appearance Structure Appearance Ours Without attention (c) Ablation on inversion vs. our method Structure Appearance Inversion Ours Structure Appearance Inversion Ours Figure 8: Ablations. We study ablations on control, appearance transfer method, and inversion. Output Structure Appearance Output Structure Appearance Ctrl-X can struggle with localizing the corresponding subject in the appearance image with appearance transfer when the subject is too small. ing strong structure image appearance leakage and poor-quality outputs (structure-only vs. ours), as it merely injects structure features, which creates the semantic correspondence for appearance control.</p><p>Appearance transfer method. As we consider appearance transfer as a stylization task, we compare our appearance statistics transfer with and without attention weighting in Figure <ref type="figure">8(b)</ref>. Without weighting (equivalent to AdaIN <ref type="bibr">[15]</ref>), we have global normalization which ignores the semantic correspondence between the appearance and output images, so the outputs are low-contrast.</p><p>Effect of inversion. We compare DDIM inversion vs. forward diffusion (ours) to obtain x o T = x s T and x s t in Figure <ref type="figure">8</ref>(c). Inversion displays appearance leakage from structure images in challenging conditions (left) while being similar to our method in others (right). Considering inversion costs and additional model inference time, forward diffusion is a better choice for our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We present Ctrl-X, a training-free and guidance-free framework for structure and appearance control of any T2I and T2V diffusion model. Ctrl-X utilizes pretrained T2I diffusion model feature correspondences, supports arbitrary structure image conditions, works with multiple model architectures, and achieves competitive structure preservation and superior appearance transfer compared to trainingand guidance-based methods while enjoying the low overhead benefits of guidance-free methods. As shown in Figure <ref type="figure">9</ref>, the key limitation of Ctrl-X is the semantic-aware appearance transfer method may fail to capture the target appearance when the instance is small because of the low resolution of the feature map. We hope our method and findings can unveil new possibilities and research on controllable generation as generative models become bigger and more capable.</p><p>Broader impacts. Ctrl-X makes controllable generation more accessible and flexible by supporting multiple conditional signals (structure and appearance) and model architectures without the computational overhead of additional training or optimization. However, this accessibility also makes using pretrained T2I/T2V models for malicious applications (e.g. deepfakes) easier, especially since the controllability enables users to generate specific images and raises ethical concerns with consent and crediting artists for using their work as condition images. In response to these safety concerns, T2I and T2V models have become more secure. Likewise, Ctrl-X can inherit the same safeguards, and its plug-and-play nature allows the open-source community to scrutinize and improve its safety.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Method, implementation, and evaluation details</head><p>More details on feed-forward structure control. We inject diffusion features after convolution skip connections. Since we initialize x o T as random Gaussian noise, the image structure after the first inference step likely does not align with I s , as observed by <ref type="bibr">[36]</ref>. Thus, injecting before skip connections results in weaker structure control and image artifacts, as we are summing features f o t and f s t with conflicting structure information. More details on inference. With classifier-free guidance, inspired by <ref type="bibr">[24,</ref><ref type="bibr">1]</ref>, we only control the prompt-conditioned &#9999; &#10003; , 'steering' the diffusion process away from uncontrolled generation and thus strengthening structure and appearance alignment. Also, since structure and appearance control can result in out-of-distribution x t 1 after applying Equation <ref type="formula">2</ref>, we apply n r steps of self-recurrence. Particularly, after obtaining x o t 1 with structure and appearance control, we repeat</p><p>n r times for (normalized) time steps t 2 [&#8999; r 0 , &#8999; r 1 ], where &#8999; r 0 , &#8999; r 1 2 [0, 1]. Notably, the self-recurrence steps occur without structure nor appearance control, and we observe generally lower artifacts and slightly better appearance transfer when self-recurrence is enabled.</p><p>Comparison to prior works. We compare Ctrl-X to prior works in terms of capabilities in Table <ref type="table">3</ref>. Compared to baselines, our method is the only work which supports appearance and structure control with any structure conditions, while being training-free and guidance-free.</p><p>Experiment hyperparameters. For both T2I diffusion with structure and appearance control and structure-only conditional generation, we use Stable Diffusion XL (SDXL) v1.0 <ref type="bibr">[27]</ref> for all Ctrl-X experiments, unless stated otherwise. For SDXL, we set L feat = {0} decoder , L self = {0, 1, 2} decoder , L app = {1, 2, 3, 4} decoder [ {2, 3, 4, 5} encoder , and &#8999; s = &#8999; a = 0.6. We sample I o with 50 steps of DDIM sampling and set &#8984; = 1 <ref type="bibr">[33]</ref>, doing self-recurrence for n r = 2 for &#8999; r 0 = 0.1 and &#8999; r 1 = 0.5. We implement Ctrl-X with Diffusers <ref type="bibr">[37]</ref> and run all experiments on a single NVIDIA A6000 GPU, except evaluating inference efficiency in Table <ref type="table">1</ref> where we run on a single NVIDIA H100 GPU.</p><p>More details on evaluation metrics. To evaluate structure and appearance control results (Table <ref type="table">2</ref>), we report DINO Self-sim and DINO-I. For DINO Self-sim, we compute the self-similarity (i.e., mean squared error) between the structure and output image in the DINO-ViT <ref type="bibr">[6]</ref> feature space, where we use the base-sized model with patch size 8 following Splicing ViT Features <ref type="bibr">[35]</ref>. For DINO-I, we compute the cosine similarity between the DINO-ViT [CLS] tokens of the appearance and output images, where we use the small-sized model with patch size 16 following DreamBooth <ref type="bibr">[30]</ref>.</p><p>To evaluate prompt-driven controllable generation results (Table <ref type="table">5</ref>), we report DINO-Self-sim, CLIP score, and LPIPS. DINO Self-sim is computed the same way as structure and appearance control metrics. For CLIP score, we compute the cosine similarity between the output image and text prompt in the CLIP embedding space, where we use the large-sized model with patch size 14 (ViT-L/14) following FreeControl <ref type="bibr">[24]</ref>. For LPIPS, we compute the appearance deviation of the output image from the structure image, where we use the official lpips package <ref type="bibr">[45]</ref> with AlexNet (net="alex").</p><p>User study. We follow the setting of the user study from DenseDiffusion <ref type="bibr">[17]</ref>, where we compare Ctrl-X to baselines on structure and appearance control in Table <ref type="table">4</ref>, we display the average human preference percentages of how often participants preferred our method over each of the baselines. We randomly selected 15 sample pairs from our dataset and then assigned each sample pair to 7 methods: Splicing ViT Feature <ref type="bibr">[35]</ref>, Uni-ControlNet <ref type="bibr">[46]</ref>, ControlNet + IP-Adapter <ref type="bibr">[44,</ref><ref type="bibr">43]</ref>, T2I-Adapter + IP-Adapter <ref type="bibr">[25,</ref><ref type="bibr">43]</ref>, Cross-Image Attention <ref type="bibr">[1]</ref>, FreeControl <ref type="bibr">[24]</ref>, and Ctrl-X. We invited 10 users to evaluate pairs of results, each consisting of our method, Ctrl-X, and a baseline method. For each comparison, users assessed 15 pairs between Ctrl-X and each baseline, based on four criteria: "the quality of displayed images," "the fidelity to the structure reference," "the fidelity to the appearance reference," and "overall fidelity to both structure and appearance reference," which we denote result quality, structure fidelity, appearance fidelity, and overall fidelity, respectively. We collected 150 comparison results for between Ctrl-X and each individual baseline method. We reported the human preference rate, which indicates the percentage of times participants preferred our results over the baselines. The user study demonstrates that Ctrl-X outperforms training-free baselines and has a competitive performance compared to training-based baselines. The user study (Figure <ref type="figure">10</ref>) is conducted via Amazon Mechanical Turk.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Structure and appearance schedules and higher-level conditions</head><p>Ctrl-X has two hyperparameters, structure control schedule (&#8999; s ) and appearance control schedule (&#8999; a ), which enable finer control over the influence of the structure and appearance images on the  output. As structure alignment and appearance transfer are conflicting tasks, controlling the two schedules allows the user to determine the best tradeoff between the two. The default values of &#8999; s = 0.6 and &#8999; a = 0.6 we choose merely works well for most-but not all-structure-appearance image pairs. Particularly, this control enables better results for challenging structure-appearance pairs and allows our method to be used with higher-level conditions without clear subject outlines.</p><p>Effect of control schedules. We vary structure and appearance control schedules (&#8999; s and &#8999; a ) as seen in Figure <ref type="figure">11</ref>. Decreasing structure control can make cross-class structure-appearance pairs (e.g., horse normal map with puppy appearance) look more realistic, as doing so trades strict structure adherence for more sensible subject shapes in challenging scenarios. Decreasing appearance control trades appearance alignment for less artifacts. Note that, generally, &#8999; s &#63743; &#8999; a , as structure control requires appearance transfer to realize the structure information and avoid structure image appearance leakage, most prominently demonstrated in Figure <ref type="figure">8(a)</ref>.</p><p>Higher-level structure conditions. By decreasing the structure control schedule &#8999; s from the default 0.6 to 0.3-0.5, Ctrl-X can handle sparser and higher-level structure conditions such as bounding boxes and human post skeletons/keypoints, shown in Figure <ref type="figure">12</ref>. Not only does this make our method applicable to other higher-level control types, it also generally reduces structure image appearance leakage with challenging structure conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Extension to prompt-driven controllable generation</head><p>Ctrl-X also supports prompt-driven conditional generation, where it generates an output image complying with the given text prompt while aligning with the structure from the structure image, as shown in Figures <ref type="figure">4</ref> and <ref type="figure">7</ref>. Inspired by FreeControl <ref type="bibr">[24]</ref>, instead of a given I a , Ctrl-X can jointly generate I a based on the text prompt alongside I o , where we obtain x a t 1 via denoising with Equation 2 from x a t without control. Baselines. For training-based methods, we test ControlNet <ref type="bibr">[44]</ref> and T2I-Adapter <ref type="bibr">[25]</ref>. For guidancebased methods, we test FreeControl <ref type="bibr">[24]</ref>, where we generate an appearance image alongside the output image instead of inverting a given appearance image. For guidance-free methods, SDEdit <ref type="bibr">[23]</ref> adds noise to the input image and denoises it with a pretrained diffusion model to preserve structure. Prompt-to-Prompt <ref type="bibr">[11]</ref> and Plug-and-Play <ref type="bibr">[36]</ref> manipulate features and attention of pretrained T2I models for prompt-driven image editing. InfEdit <ref type="bibr">[41]</ref> uses three-branch attention manipulation and consistent multi-step sampling for fast, consistent image editing.  It is also more robust than guidance-based and guidance-free methods across a wide variety of condition types. (We run ControlNet <ref type="bibr">[44]</ref> and T2I-Adapter <ref type="bibr">[25]</ref> on SD v1.5 <ref type="bibr">[29]</ref> instead of SDXL v1.0 <ref type="bibr">[27]</ref>, as the latter frequently generates low-contrast, flat results for the two methods.)</p><p>Table <ref type="table">5</ref>: Quantitative comparison on conditional generation. Ctrl-X outperforms all trainingbased and guidance-free baselines in prompt alignment (CLIP score). Although many baselines seem to better preserve structure with low DINO self-similarity distances, the low distances mainly come from severe structure image appearance leakage (high LPIPS), also shown in Figure <ref type="figure">13</ref>. Also, though FreeControl displays better structure preservation and prompt alignment, it still experiences appearance leakage which results in poor image quality (Figure <ref type="figure">13</ref>).</p><p>Method Training ControlNet-supported New condition Self-sim # CLIP score " LPIPS " Self-sim # CLIP score " LPIPS " ControlNet [44] 3 0.126 0.298 0.657 0.092 0.302 0.507 T2I-Adapter [25] 3 0.096 0.303 0.504 0.068 0.302 0.415 SDEdit [23] 7 0.102 0.300 0.366 0.096 0.309 0.373 Prompt-to-Prompt [11] 7 0.100 0.276 0.370 0.097 0.287 0.357 Plug-and-Play [36] 7 0.056 0.282 0.272 0.050 0.292 0.301 InfEdit [41] 7 0.117 0.314 0.523 0.102 0.311 0.442 FreeControl [24] 7 0.108 0.340 0.557 0.104 0.339 0.492 Ctrl-X (ours) 7 0.134 0.322 0.635 0.135 0.326 0.590 Dataset. Our controllable generation dataset comprises of 175 diverse image-prompt pairs with the same (structure) images as Section 5.1. It consists of 71% ControlNet-supported conditions and 29%</p><p>new conditions. We use the same hand-annotated structure prompts and hand-create output prompts with inspiration from Plug-and-Play's datasets <ref type="bibr">[36]</ref>. See more details in Appendix E.</p><p>Evaluation metrics. For quantitative evaluation, we report three widely-adopted metrics: DINO Self-sim from Section 5.1 measures structure preservation; CLIP score <ref type="bibr">[28]</ref> measures the similarity between the output image and text prompt in the CLIP embedding space, where a higher score suggests stronger image-text alignment; LPIPS distance <ref type="bibr">[45]</ref> measures the appearance deviation of the output image from the structure image, where a higher distance suggests lower appearance leakage from the structure image.</p><p>Qualitative results. As shown in Figures <ref type="figure">4</ref> and <ref type="figure">13</ref>, Ctrl-X generates high-quality images with great structure preservation and close prompt alignment. Our method can extract structure information from a wide range condition types and produces results of diverse modalities based on the prompt.</p><p>Comparison to baselines. Figure <ref type="figure">7</ref> and Table <ref type="table">5</ref> compare our method to the baselines. Trainingbased methods typically better preserve structure, with lower DINO self-similarity distances, at the cost of worse prompt adherence, with lower CLIP scores. This is because these modules are trained on condition-output pairs which limit the output distribution of the base T2I model, especially for in-the-wild conditions where the produced canny maps are unusual. Our method, in contrast, transfers appearance from a jointly-generated appearance image that utilizes the full generation power of the base T2I model and is neither domain-limited by training nor greatly affected by hyperparameters.</p><p>In contrast, guidance-based and guidance-free methods display appearance leakage from the structure image. The guidance-based FreeControl requires per-image hyperparameter tuning, resulting in fluctuating image quality and appearance leakage when ran with its default hyperparameters. Thus, even if it displays slightly higher prompt adherence (higher CLIP score), the appearance leakage often produces lower-quality output images (lower LPIPS). Guidance-free methods, on the other hand, share (inverted) latents (SDEdit, Prompt-to-Prompt, Plug-and-Play) or injects diffusion features (all) with the structure image without the appearance regularization which Ctrl-X's jointly-generated appearance image provides. Consequently, though structure is preserved well with better DINO self-similarity distances, undesirable structure image appearance is also transferred over, resulting in worse LPIPS scores. For example, all guidance-based and guidance-free baselines display the magenta-blue-green colors of the dining room normal map (row 3), the color-patchy look of the car and mountain sparse map (row 7), and the red background of the 3D squirrel mesh (row 8).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Additional results</head><p>Additional structure and appearance control results. We present additional results of structure and appearance control in Figure <ref type="figure">14</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Structure Appearance</head><note type="other">Structure Appearance Figure 14</note><p>: Additional results of structure and appearance control. We present additional Ctrl-X results of structure and appearance control. a photo of a white cat sitting in a forest during sunset Structure Appearance jointly generated a photo of a horse standing in a field at night Structure Appearance jointly generated a photo of a medieval soldier standing on a barren field, raining Structure Appearance jointly generated a photo of a Victorian library, sunlight streaming in Structure Appearance jointly generated a realistic photo of a cyberpunk city at night, neon lights Structure Appearance jointly generated a photo of an ornate teapot in a museum display Structure Appearance jointly generated Figure 16: Structure-only control.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ctrl-X (ours)</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IP-Adapter Appearance</head><p>We display the jointly generated appearance images for prompt-driven conditional generation. Ctrl-X appearance transfer preserves the image quality of the generated appearances, so structure-only retains the quality of the base model.</p><p>Appearance-only control. Ctrl-X is a method which disentangles control from given structure and appearance images, balancing structure alignment and appearance transfer when the two tasks are inherently conflicting. However, Ctrl-X can also achieve appearance-only control by simply dropping the structure control branch (and thus not needing to generate a structure image), as shown in Figure <ref type="figure">15</ref>. Our method displays better appearance alignment for both subjects and background compared to the training-based IP-Adapter <ref type="bibr">[43]</ref>.</p><p>Structure-only control. For prompt-driven conditional (structure-only) generation, Ctrl-X needs to jointly generate an appearance image, where the jointly generated image is equivalent to vanilla SDXL v1.0 generation. We display the outputs alongside these appearance images in Figure <ref type="figure">16</ref>, where there is minimal quality difference between the generated appearance images and the appearance-</p><p>Appearance Structure Appearance Structure Appearance Structure Appearance Structure AnimateDiff w/ Realistic Vision AnimateDiff w/ Realistic Vision LaVie AnimateDiff w/ Realistic Vision AnimateDiff w/ Realistic Vision Extension to text-to-video (T2V) models. Ctrl-X can be directly applied to T2V models for controllable video structure and appearance control, with AnimateDiff <ref type="bibr">[9]</ref> with Realistic Vision v5.1 <ref type="bibr">[32]</ref> and LaVie <ref type="bibr">[39]</ref> here as examples. A playable video version of the AnimateDiff results can be found in the attached supplementary zip file as ctrl_x_animatediff.mp4.</p><p>transferred output images, indicating that the need for appearance transfer does not greatly impact image quality. Thus, Ctrl-X adheres well to the quality of its base models.</p><p>Extension to video diffusion models. We also present results of our method directly applied to text-to-video (T2V) diffusion models in Figure <ref type="figure">17</ref>, namely AnimateDiff <ref type="bibr">[9]</ref> with base model Realistic Vision v5.1 <ref type="bibr">[32]</ref> and LaVie <ref type="bibr">[39]</ref>. A playable video version of the AnimateDiff T2V results can be found in the attached supplementary zip file as ctrl_x_animatediff.mp4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Dataset details</head><p>For our dataset, we list all images present in the paper and their associated sources and licenses present in the paper in dataset_sources.pdf in the supplementary materials zip. All academic datasets which we use are cited here <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">43,</ref><ref type="bibr">24,</ref><ref type="bibr">48,</ref><ref type="bibr">22,</ref><ref type="bibr">19,</ref><ref type="bibr">36,</ref><ref type="bibr">16]</ref>. We publicly release our dataset in our code release: <ref type="url">https://github.com/genforce/ctrl-x</ref>.</p><p>Overview. Our dataset consists of 177 1024 &#8677; 1024 images divided into 16 types and across 7 categories. We split the images into condition images (67 images: "canny edge map", "metadrive", "3d mesh", "3d humanoid", "depth map", "human pose image", "point cloud", "sketch", "line drawing", "HED edge drawing", "normal map", and "segmentation mask") and natural images (110 images: "photo", "painting", "cartoon" and "birds eye view"), with the the largest type being "photo" (83 images). The condition images are further divided into two groups in our paper: ControlNetsupported conditions ("canny edge map", "depth map", "human pose image", "line drawing", "HED edge drawing", "normal map", and "segmentation mask") and in-the-wild conditions ("metadrive", "3D mesh", "3D humanoid", "point cloud", and "sketch"). All of our images fall into one of seven categories: "animals" (52 images), "buildings" (11 images), "humans" (28 images), "objects" (29 images), "rooms" (24 images), "scenes" (22 images) and "vehicles" (11 images). About two thirds of the images come from the Web, while the remaining third is generated using SDXL 1.0 <ref type="bibr">[27]</ref> or converted from natural images using Controlnet Annotators packaged in controlnet-aux <ref type="bibr">[44]</ref>. For each of these images, we hand annotate them with a text prompt and other metadata (e.g. type). Then, these images, promtps, and metadata are combined to form the structure and appearance control dataset and conditional generation dataset, detailed below.</p><p>T2I diffusion with structure and appearance control dataset. This dataset consists of 256 pairs of images from the image dataset described above. This dataset is used to evaluate our method and the baselines' ability to generate images adhering to the structure of a condition or natural image while aligning to the appearance of a second natural image. Each pair contains a structure image (which may be a condition or natural image) and an appearance image (which is a natural image).</p><p>The dataset also includes a structure prompt for the structure image (e.g. "a canny edge map of a horse galloping"), an appearance prompt for the appearance image (e.g. "a painting of a tawny horse in a field"), and one target prompt for the output image (e.g. "a painting of tawny horse galloping") generated by combining the metadata of the appearance and structure prompts via a template, with a few edge cases hand-annotated. Image pairs are constructed from two images from the same category (e.g. "animals") and the majority of pairs consist of images of the same subject (e.g. "horse"), but we include 30 pairs of cross-subject images (e.g. "cat" and "dog") to test the methods' ability to generalize structure information across subjects.</p><p>In practice, when running Ctrl-X, we set the appearance prompt to be the same as the output prompt instead of our hand-annotated appearance prompt. We found little differences between the two.</p><p>Conditional generation dataset. The conditional dataset combines conditional images with both template-generated and hand-written output prompts (inspired by Plug-and-Play <ref type="bibr">[36]</ref> and FreeControl <ref type="bibr">[24]</ref>) to evaluate our method and the baselines' ability to construct an image adhering to the structure of the input image while complying with the given prompt. Each entry in the conditional dataset consists of a condition image combined with a unique prompt. We have 175 such condition-prompt pairs from the set of 66 condition images above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NeurIPS Paper Checklist</head><p>1. Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? Answer: [Yes] Justification: Yes, all of our claims accurately reflect the paper's contributions and scope.</p><p>Particularly, we claim that Ctrl-X is a training-free and guidance-free structure and appearance control method which supports arbitrary structure conditions and diffusion models-all claims we show in the main paper and Appendix. Guidelines:</p><p>&#8226; The answer NA means that the abstract and introduction do not include the claims made in the paper. &#8226; The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. &#8226; The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. &#8226; It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Limitations</head><p>Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, we discuss the limitations of our work in Section 6.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. &#8226; The authors are encouraged to create a separate "Limitations" section in their paper.</p><p>&#8226; The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. &#8226; The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. &#8226; The authors should reflect on the factors that influence the performance of the approach.</p><p>For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. &#8226; The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. &#8226; If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. &#8226; While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Theory Assumptions and Proofs</head><p>Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?</p><p>Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We publicly release our code and our data (for quantitative evaluation) at <ref type="url">https://github.com/genforce/ctrl-x</ref>.</p><p>Guidelines:</p><p>&#8226; The answer NA means that paper does not include experiments requiring code.</p><p>&#8226; Please see the NeurIPS code and data submission guidelines (<ref type="url">https://nips.cc/  public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). &#8226; The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (<ref type="url">https:  //nips.cc/public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. &#8226; The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. &#8226; At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). &#8226; Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Yes, we provide all the relevant hyperparameters in Appendix A and dataset details in Section 5.1 and Appendix E. Our work is training-free (and guidance-free), so we do not have any training (and optimization) details. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. &#8226; The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: No, our work does not report error bars, because we could not perform enough runs for the per-image training-based and guidance-based baselines for the resulting error bars to be meaningful, as their long inference time makes doing more runs too computationally expensive. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. &#8226; The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). &#8226; The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) &#8226; The assumptions made should be given (e.g., Normally distributed errors). &#8226; It should be clear whether the error bar is the standard deviation or the standard error of the mean. &#8226; It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. &#8226; For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). &#8226; If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Yes, we explained in 5.1 that we used a single NVIDIA A6000 GPU for all experiments, and we also report inference times and peak GPU memory usages in Table 1 on a single NVIDIA H100 GPU. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. &#8226; The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. &#8226; The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <ref type="url">https://neurips.cc/public/EthicsGuidelines</ref>? Answer: [Yes] Justification: Yes, our research conforms with the NeurIPS Code of Ethics in every respect. Guidelines: &#8226; The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. &#8226; If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. &#8226; The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss potential positive and negative societal impacts in Section 6.</p><p>Guidelines:</p><p>&#8226; The answer NA means that there is no societal impact of the work performed.</p><p>&#8226; If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. &#8226; Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. &#8226; The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. &#8226; The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. &#8226; If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11.">Safeguards</head><p>Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?</p><p>Answer: [Yes] Justification: Our code can be easily incorporated with the Diffusers <ref type="bibr">[37]</ref> default safety checker to screen for NSFW outputs and remove them. Moreover, since our work is trainingfree, its output domain inherits the same domain as the base model, so the qualitative examples we show has all the safeguards which SDXL v1.0 <ref type="bibr">[27]</ref> has. We recognize that proper safeguards for image/video generators is still an open research problem and is far from perfect. However, with the release of future T2I and T2V generative models with more safeguards built in, our method can seamlessly inherit the same safeguards.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper poses no such risks.</p><p>&#8226; Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. &#8226; Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. &#8226; We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.</p><p>12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?</p><p>Answer: [Yes]</p><p>Justification: We cite all models and code inspirations we use in both the main paper body and Appendix, which is listed in References. For our dataset, we list all images present in the paper and their associated sources and licenses present in the paper in dataset_sources.pdf in the supplementary materials zip.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not use existing assets.</p><p>&#8226; The authors should cite the original paper that produced the code package or dataset.</p><p>&#8226; The authors should state which version of the asset is used and, if possible, include a URL. &#8226; The name of the license (e.g., CC-BY 4.0) should be included for each asset.</p><p>&#8226; For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. &#8226; If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. &#8226; For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. &#8226; If this information is not available online, the authors are encouraged to reach out to the asset's creators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="13.">New Assets</head><p>Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?</p><p>Answer: [Yes]</p><p>Justification: We detail our dataset for quantitative evaluation in Section 5.1 and Appendix E. The dataset is publicly released alongside our code at <ref type="url">https://github.com/genforce/  ctrl-x</ref> with documentation provided alongside on how to use it.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not release new assets.</p><p>&#8226; Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. &#8226; The paper should discuss whether and how consent was obtained from people whose asset is used. &#8226; At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="14.">Crowdsourcing and Research with Human Subjects</head><p>Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?</p><p>Answer: <ref type="bibr">[Yes]</ref> Justification: This paper conducts a user study using Amazon Mechanical Turk, with details, example instructions, and screenshots provided in Appendix A. We compensate the participants with the local minimum rates as provided by the platform.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. &#8226; Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.</p><p>&#8226; According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: &#8226; The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. &#8226; Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.</p><p>&#8226; We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. &#8226; For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.</p></div></body>
		</text>
</TEI>
