<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Practical Guide for Incorporating Symmetry in Diffusion Policy</title></titleStmt>
			<publicationStmt>
				<publisher>Neural Information Processing Systems</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652226</idno>
					<idno type="doi"></idno>
					
					<author>Dian Wang</author><author>Boce Hu</author><author>Shuran Song</author><author>Robin Walters</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Recently, equivariant neural networks for policy learning have shown promising improvements in sample efficiency and generalization, however, their wide adoption faces substantial barriers due to implementation complexity. Equivariant architectures typically require specialized mathematical formulations and custom network design, posing significant challenges when integrating with modern policy frameworks like diffusion-based models. In this paper, we explore a number of straightforward and practical approaches to incorporate symmetry benefits into diffusion policies without the overhead of full equivariant designs. Specifically, we investigate (i) invariant representations via relative trajectory actions and eye-inhand perception, (ii) integrating equivariant vision encoders, and (iii) symmetric feature extraction with pretrained encoders using Frame Averaging. We first prove that combining eye-in-hand perception with relative or delta action parameterization yields inherent SE(3)-invariance, thus improving policy generalization. We then perform a systematic experimental study on those design choices for integrating symmetry in diffusion policies, and conclude that an invariant representation with equivariant feature extraction significantly improves the policy performance. Our method achieves performance on par with or exceeding fully equivariant architectures while greatly simplifying implementation.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Although recent advancements in incorporating symmetry in robotic policy learning have shown promising results <ref type="bibr">[52,</ref><ref type="bibr">60]</ref>, the practical impact of this approach remains limited due to the significant implementation challenges. While equivariant models <ref type="bibr">[62,</ref><ref type="bibr">11]</ref> can substantially improve sample efficiency and generalization, integrating these symmetric properties into modern policy learning frameworks presents several obstacles. First, symmetry reasoning must be tailored specifically for each policy formulation, requiring different mathematical analyses and architectures across different policy frameworks like Q-learning <ref type="bibr">[57]</ref>, actor-critic methods <ref type="bibr">[59]</ref>, and diffusion models <ref type="bibr">[3]</ref>. This creates a steep learning curve for practitioners seeking to leverage symmetry benefits. Second, state-of-the-art equivariant architectures often introduce considerable complexity with specialized layers <ref type="bibr">[4,</ref><ref type="bibr">15]</ref> and require structured inputs <ref type="bibr">[11,</ref><ref type="bibr">34]</ref>, which do not naturally align well with modern policy components like diffusion-based action generation <ref type="bibr">[5]</ref>, eye-in-hand perception <ref type="bibr">[7]</ref>, or pretrained vision encoders. As a result, robotics researchers frequently face a difficult choice: adopting complex equivariant architectures with specialized implementation, or maintaining practical implementation concerns at the cost of symmetry benefits. This dilemma is particularly pronounced in diffusion-based visuomotor policies <ref type="bibr">[5]</ref>, which have emerged as a powerful paradigm for generating smooth, multimodal robot actions. Although prior works have tried to implement equivariant diffusion models <ref type="bibr">[51,</ref><ref type="bibr">65,</ref><ref type="bibr">60]</ref>, the diffusion-denoising nature significantly enhances the difficulty of incorporating equivariant structure.</p><p>Figure <ref type="figure">1:</ref> We propose a number of practical approaches for incorporating symmetry in Diffusion Policy, achieving comparable performance as fullyequivariant policies while maintaining simplicity.</p><p>In this paper, we present a practical guide for incorporating symmetry into diffusion policies without requiring significant design overhead or sacrificing the advantages of modern policy formulations. We systematically investigate several straightforward approaches that achieve this balance: (i) invariant representations through relative trajectory actions and eye-in-hand perception, (ii) equivariant vision encoders that can be incorporated into standard diffusion frameworks, and (iii) Frame Averaging <ref type="bibr">[48]</ref> techniques that enable symmetrization with pretrained encoders. Our approach offers a compelling alternative to endto-end equivariant architectures: rather than choosing between symmetry benefits and implementation practicality, our methods demonstrate that these advantages can be achieved with minimal architectural changes and overhead. As shown in Figure <ref type="figure">1</ref>, our method achieves excellent performance while maintaining implementation simplicity, positioning it at an optimal balance point in the symmetry-practicality tradeoff. Notably, our work using a single eyein-hand image input reaches a similar performance as Wang et al. <ref type="bibr">[60]</ref>, which uses four cameras to reconstruct the 3D voxel grid input.</p><p>Our contributions can be summarized as follows:</p><p>&#8226; We prove that eye-in-hand perception and relative trajectory action inherently possesses SE(3)invariance, significantly improving the policy's generalization. &#8226; We demonstrate that transitioning from absolute action representations to relative trajectory actions provides a straightforward improvement in policy performance, both with eye-in-hand perception and extrinsic perception. &#8226; We propose a novel approach that integrates a symmetric encoder into standard diffusion policy learning, achieved by either using equivariant network in end-to-end training or using Frame Averaging with pretrained encoders, and show that it can significantly improve performance. &#8226; We show that combining the invariant perception and action representations and a pretrained encoder with Frame Averaging achieves the state-of-the-art results in the MimicGen <ref type="bibr">[42]</ref> benchmark while maintaining low architectural complexity and computational overhead compared to fully equivariant methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Diffusion Policies: Denoising diffusion models have transformed generative modeling in vision, achieving state-of-the-art results in image and video synthesis <ref type="bibr">[18,</ref><ref type="bibr">55]</ref> as well as planning <ref type="bibr">[29,</ref><ref type="bibr">33]</ref>. Recently, Chi et al. <ref type="bibr">[5,</ref><ref type="bibr">6]</ref> introduced Diffusion Policy, which extends diffusion models to robotic visuomotor control by denoising action trajectories conditioned on observations. Subsequent extensions include applications to reinforcement learning <ref type="bibr">[61,</ref><ref type="bibr">49]</ref>, incorporation of 3D inputs <ref type="bibr">[68,</ref><ref type="bibr">31]</ref>, hierarchical policies <ref type="bibr">[64,</ref><ref type="bibr">40,</ref><ref type="bibr">71]</ref>, and large vision-language action models <ref type="bibr">[45,</ref><ref type="bibr">2]</ref>. A key limitation of diffusion methods is their heavy demand for training data. To mitigate this, recent works have injected domain symmetries as equivariant constraints into the denoising network, thereby boosting sample efficiency and generalization <ref type="bibr">[3,</ref><ref type="bibr">51,</ref><ref type="bibr">60,</ref><ref type="bibr">65,</ref><ref type="bibr">56]</ref>. However, they typically require complex equivariant denoising models. In contrast, our approach integrates symmetry by combining invariant observation and action representations with an equivariant vision encoder, greatly simplifying implementation.</p><p>Equivariant Policy Learning: Robotic policies often require generalizing across spatial transformations of the environment. Traditional methods often achieve this via extensive data augmenta-tion <ref type="bibr">[69,</ref><ref type="bibr">70]</ref>. Recently, equivariant models, a class of methods that are mathematically constrained to be equivariant <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">62,</ref><ref type="bibr">4,</ref><ref type="bibr">11,</ref><ref type="bibr">15,</ref><ref type="bibr">34,</ref><ref type="bibr">35]</ref>, has been widely adapted to robotics to automatically instantiate spatial generalization. Such models have been widely applied across robot learning, including equivariant reinforcement learning <ref type="bibr">[57,</ref><ref type="bibr">59,</ref><ref type="bibr">58,</ref><ref type="bibr">44,</ref><ref type="bibr">43,</ref><ref type="bibr">32,</ref><ref type="bibr">37,</ref><ref type="bibr">19]</ref>, imitation learning <ref type="bibr">[30,</ref><ref type="bibr">66,</ref><ref type="bibr">14]</ref>, grasp learning <ref type="bibr">[74,</ref><ref type="bibr">75,</ref><ref type="bibr">24,</ref><ref type="bibr">21,</ref><ref type="bibr">36]</ref>, and pick-place policies <ref type="bibr">[52,</ref><ref type="bibr">53,</ref><ref type="bibr">46,</ref><ref type="bibr">50,</ref><ref type="bibr">22,</ref><ref type="bibr">25,</ref><ref type="bibr">23,</ref><ref type="bibr">27,</ref><ref type="bibr">13,</ref><ref type="bibr">26]</ref>. However, these methods often demand complex symmetry reasoning and equivariant layers, which can hinder scalability. By contrast, our framework introduces symmetry in a more modular fashion, making it easier to implement and adapt.</p><p>Policy Learning using Eye-in-Hand Images: Eye-in-hand perception, using a camera mounted on the robot's end-effector, has been a popular choice in manipulation because it is simple and calibrationfree. For instance, Jangir et al. <ref type="bibr">[28]</ref> learn a shared latent space between egocentric and external views to train hybrid-input policies. Hsu et al. <ref type="bibr">[20]</ref> show that eye-in-hand images (alone or combined with external cameras) yield higher success rates and better generalization, and similar findings have been captured in <ref type="bibr">[41]</ref>. Consequently, many recent frameworks retain eye-in-hand imagery <ref type="bibr">[72,</ref><ref type="bibr">1,</ref><ref type="bibr">73,</ref><ref type="bibr">2,</ref><ref type="bibr">38]</ref>.</p><p>Another advantage is rapid data collection using a handheld gripper <ref type="bibr">[54,</ref><ref type="bibr">67,</ref><ref type="bibr">47,</ref><ref type="bibr">7,</ref><ref type="bibr">63,</ref><ref type="bibr">16]</ref>. In this work, we theoretically analyze how eye-in-hand observation, when paired with relative or delta trajectory actions, yields inherent symmetry advantages for diffusion-based policies.</p><p>3 Background Problem Statement: We consider behavior cloning for visuomotor policy learning in robotic manipulation, where the goal is to learn a policy that maps an observation o to an action a, mimicking an expert policy. Both o and a may span multiple time steps, i.e., o = {o t-(m-1) , . . . , o t-1 , o t }, a = {a t , a t+1 , . . . , a t+(n-1) }, where m is the number of past observations and n is the number of future action steps. At time step t, the observation o t = (I t , T t , w t ) contains the visual information I t , (e.g., images), the pose of the gripper in the world frame T t &#8712; SE(3), as well as the gripper aperture w t &#8712; R. The action a t = (A t , w t ) specifies a target pose A t &#8712; SE(3) of the gripper and an openwidth command w t &#8712; R.</p><p>To simplify the notation for our analysis, we omit the gripper command w t in the action and focus on the pose command by writing a = {A t , A t+1 . . . , A t+n-1 } (while in the actual implementation the policy controls both the pose and the aperture).</p><p>Diffusion Policy: Chi et al. <ref type="bibr">[5]</ref> introduced Diffusion Policy, which formulates the behavior cloning problem as learning a Denoising Diffusion Probabilistic Model (DDPMs) <ref type="bibr">[18]</ref> over action trajectories. Diffusion Policy learns a noise prediction function &#949; &#952; (o, a + &#949; k , k) = &#949; k using a network &#949; &#952; , which is trained to predict the noise &#949; k added to an action a. The training loss is</p><p>where &#949; k is a random noise conditioned on a randomly sampled denoising step k. At inference, starting from a k &#8764; N (0, 1), the model iteratively denoises</p><p>where &#1013; &#8764; N (0, &#963; 2 I). &#945;, &#947;, &#963; are functions of the denoising step k (also known as the noise schedule).</p><p>The action a 0 is the final executed action trajectory.</p><p>where &#961; x and &#961; y are representations of G on X and Y . Equivariance ensures that applying g before or after f yields the same result, thus f is G-symmetric. When &#961; is clear, we often write g &#8226; x or gx.</p><p>For 2D images, the group SO(2) = {Rot &#952; : 0 &#8804; &#952; &lt; 2&#960;} of planar rotations and its subgroup C u = {Rot &#952; : &#952; &#8712; { 2&#960;i u |0 &#8804; i &lt; u}} of rotations by multiples of 2&#960; u are often used to construct equivariant neural networks <ref type="bibr">[62,</ref><ref type="bibr">4]</ref> that can capture rotated features. The regular representation &#961; reg : G &#8594; R u&#215;u of C u is of particular interest of this paper, which defines how C u acts on a vector x &#8712; R u by u &#215; u permutation matrices. Intuitively, the vector x can be viewed as containing information for each rotation in C u . Let r v &#8712; C u = {1, r 1 , . . . r u-1 } and x = (x 1 , . . . x u ) &#8712; R u , then &#961; reg (r v )x = (x u-v+1 , . . . , x u , x 1 , x 2 , . . . , x u-m ) cyclically permutes the coordinates of x. We use only an eye-in-hand image as the input, and use an equivariant encoder to acquire symmetry-aware features from the input image. In policy denoising, the noisy action and the noise-free action output are both in the gripper frame. Other components remain identical to the original Diffusion Policy. Compared with Equivariant Diffusion Policy (right), our approach is significantly simpler while maintaining a comparable experimental performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Approaches for Incorporating Symmetry in Diffusion Policy</head><p>In this section, we introduce three practical approaches for incorporating symmetry into diffusion policies without requiring complex end-to-end equivariant architecture design. First, we examine how invariant action and perception representations naturally induce symmetric properties. Second, we explore integrating equivariant vision encoders that extract symmetry-aware features while maintaining standard diffusion heads. Finally, we present how to leverage pre-trained vision encoders in an equivariant way through Frame Averaging <ref type="bibr">[48]</ref>. Together, these approaches offer a spectrum of options for balancing symmetry benefits with implementation simplicity. As shown in Figure <ref type="figure">2</ref>, our proposed approach (middle) requires minimal architectural change compared with the original Diffusion Policy <ref type="bibr">[5]</ref> (left), and is much simpler than the fully equivariant model <ref type="bibr">[60]</ref> (right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Representing Actions as Absolute, Relative, and Delta Trajectory</head><p>Choosing the right representation for actions and observations is crucial for sample-efficient policy learning in robotic manipulation. When a robotic task and environment exhibit rigid-body symmetries, incorporating an equivariant or invariant representation can significantly enhance generalization to unseen object configurations and poses. In this section, we explore three natural trajectory action representations: absolute, relative, and delta trajectories, highlighting their symmetry properties under global SE(3) transformations. Notice that although relative trajectories were introduced by Chi et al. <ref type="bibr">[7]</ref>, their symmetric advantages have not yet been explored.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1 (Absolute Trajectory Action</head><p>). An absolute trajectory action specifies future gripper poses directly in the world frame as</p><p>where each A t+i &#8712; SE(3) is the desired pose at time t + i. Definition 2 (Relative Trajectory Action). Let T t &#8712; SE(3) be the current gripper pose in the world frame. A relative trajectory action is a sequence a r = A r t , A r t+1 , . . . , A r t+n-1 , where each A r t+i &#8712; SE(3) specifies the gripper's pose relative to its initial frame at time t. The corresponding absolute poses are recovered via</p><p>Definition 3 (Delta Trajectory Action). A delta trajectory action is a sequence of incremental transforms expressed in a moving local frame:</p><p>A d t+1 , . . . , A d t+n-1 , where A d t+i &#8712; SE(3) represents the incremental motion at time step t + i expressed relative to the gripper's frame at the previous time step t + i -1. The absolute poses are reconstructed as:</p><p>We now formalize the transformation properties of these action representations (see Appendix A for the proof): Proposition 1 (Equivariance and Invariance under SE(3)). Consider a global transformation g &#8712; SE(3) applied to the world coordinate frame, which transforms the current gripper pose as T t &#8594; gT t . Under this transformation:</p><p>1. The absolute trajectory action transforms equivariantly, i.e., g &#8226; a = {gA t , gA t+1 , . . . } .</p><p>2. The relative trajectory action is invariant, i.e., g &#8226; a r = a r .</p><p>3. The delta trajectory action is invariant, i.e., g &#8226; a d = a d . A key advantage of using relative or delta trajectory action representations is that, if the policy being modeled is equivariant, i.e., &#960;(go) = g&#960;(o), the learned function &#960; : o &#8594; a r or &#960; : o &#8594; a d becomes invariant, i.e., &#960;(go) = &#960;(o). This makes the underlying denoising network &#1013; &#952; also invariant, significantly reducing the function space and potentially easing the training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">SE(3)-Invariant Policy Learning</head><p>Moreover, when the relative or delta trajectory is combined with eye-in-hand perception, it naturally yields an SE(3)-invariant canonicalization. Specifically, consider an agent equipped with a gripper-mounted camera, which captures an eyein-hand image I t . When an arbitrary transformation g &#8712; SE(3) is applied to the world, the visual input from the eye-in-hand camera remains unchanged since the relative position and orientation between the camera and the world do not vary. For the input observation o t = (I t , T t , w t ) consisting of the invariant image I t , a gripper pose T t , and gripper open-width w t , applying the transformation g to the world frame affects only the gripper pose T t ,</p><p>Let us first assume that the policy does not depend on the gripper pose T t , then the policy &#960; : o &#8594; a is SE(3)-equivariant when using eye-in-hand perception and relative or delta action:</p><p>Proposition 2. Let &#960; : o &#8594; a r or &#960; : o &#8594; a d be a function mapping from the observation o to the relative trajectory a r or the delta trajectory a d . Assume an eye-in-hand observation is used where Equation 4 is satisfied and &#960; does not depend on T t . If the policy &#960; : o &#8594; a reconstructs the absolute trajectory a using Equation 2 or 3, then &#960; is SE(3)-equivariant, i.e., &#960;(go) = g&#960;(o).</p><p>See Appendix B for the proof. This equivariance property implies that once a policy is learned, it automatically generalizes across different poses in space without additional training data, thus enhancing sample efficiency and robustness. As shown in Figure <ref type="figure">3</ref>, when an SE(3) transformation is applied to the world, both the perception and action remain invariant.</p><p>In practice, the assumption that &#960; does not depend on T t will not perfectly hold because the changing T t will affect the network prediction, thus we will only have an approximate invariance property.</p><p>Since T t provides important information to the policy despite breaking the symmetry, we choose not to explicitly constrain the policy to be invariant to T t . Still, our experiments demonstrate significant performance improvements when employing this approximate invariant property.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Equivariant Vision Encoders</head><p>While the invariant representations described in Section 4.2 theoretically achieve SE(3)-equivariant policy learning, we experimentally found that they do not fully match the performance of end-to-end equivariant policies <ref type="bibr">[60]</ref>. This is because equivariant neural networks not only guarantee global symmetric transformations, but more importantly, they extract richer local features that can capture the underlying symmetries of the problem domain. Instead of falling back to a network architecture that is end-to-end equivariant, we propose a novel approach that incorporates an equivariant vision encoder to extract symmetry-aware features while preserving a standard non-equivariant diffusion backbone. This approach would preserve the benefits of symmetric feature extraction without the complexity of a full equivariant model.</p><p>Specifically, we can replace the standard CNN vision encoder in a Diffusion Policy with an equivariant CNN that operates on the group C u &#8834; SO(2). This encoder maps the input eye-in-hand image to a feature vector that transforms according to the regular representation of C u , providing a richer representation to the diffusion head with explicit information about how features transform under rotations, significantly enhancing learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Incorporating Pretrained Vision Encoders with Frame Averaging</head><p>Although there exists a wide variety of equivariant neural network architectures <ref type="bibr">[8,</ref><ref type="bibr">62,</ref><ref type="bibr">4]</ref>, they typically require defining each layer to be equivariant by constraining the weights with specialized kernels <ref type="bibr">[10]</ref>. This usually implies that the network is built specifically for a task and is trained from scratch. However, modern computer vision has seen tremendous progress through large-scale pretrained models, which provide powerful general-purpose representations. To bridge the gap between equivariance and pre-training, we employ Frame Averaging <ref type="bibr">[48]</ref> for turning an arbitrary function &#934; : X &#8594; Y into a G-equivariant network by averaging over a Frame F : X &#8594; 2 G \ {&#8709;} that satisfies equivariance as a set F (gx) = F (x):</p><p>where &#936; : X &#8594; Y will have the equivariant property &#936;(&#961; x (g)x) = &#961; y (g)&#936;(x). For a finite group G, one can set the frame to be the whole group F (x) = G, and Equation 5 becomes symmetrization:</p><p>When using a pretrained encoder with Frame Averaging, we obtain the benefits of both powerful pretrained representations and explicit rotational equivariance, allowing us to leverage state-of-the-art vision backbones without sacrificing symmetry properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>In this section, we conduct a systematic experimental study comparing different approaches for incorporating symmetry in diffusion policies. We investigate the following key research questions: 1. Invariant representations: How do the invariant action and perception representations analyzed in Section 4.1 impact diffusion policy learning performance? 2. Equivariant vision encoders: Can diffusion policies benefit from incorporating equivariant vision encoders?</p><p>3. Pre-trained encoders: How effectively can we leverage pre-trained encoders with Frame Averaging (Equation <ref type="formula">5</ref>)?</p><p>4. Comparison to end-to-end equivariant diffusion: How do these approaches compare with fully equivariant diffusion policies <ref type="bibr">[60]</ref>?</p><p>We evaluate our approaches on 12 robotic manipulation tasks in the MimicGen <ref type="bibr">[42]</ref> benchmark, as illustrated in Figure <ref type="figure">4</ref>. We perform an additional Robomimic <ref type="bibr">[41]</ref> experiment in Appendix D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Action and Observation Representation</head><p>We first evaluate the three action representations (absolute trajectory, relative trajectory, and delta trajectory) discussed in Section 4.1 across two different observation settings, Large FOV In-Hand and In-hand + External. Figure <ref type="figure">2</ref> illustrates the differences between these configurations.</p><p>(a) Threading (b) Coffee Prep. The results, presented in Table <ref type="table">1</ref>, demonstrate several key findings. First, relative trajectory consistently outperforms absolute trajectory in 11 (for Large FOV In-Hand) or 10 (for In-Hand + External) out of 12 tasks. On average, relative trajectory provides a 5.9% improvement over absolute trajectory with Large FOV In-Hand observations and a 7.4% improvement with In-Hand + External observations. These results align with our theoretical analysis in Section 4.1, confirming that the symmetry properties of relative trajectory representations contribute to better performance. However, despite having the same theoretical guarantee, delta trajectory empirically performs poorly, underperforming absolute trajectory by 2.9% on average, where it only performs well in relatively simple tasks. We hypothesize that this is because delta trajectory can be interpreted as a sequence of velocity vectors, containing less temporal and structural information for the denoising process. Notice that similar observations of the underperformance of velocity control in diffusion policy learning were also reported in prior works <ref type="bibr">[5,</ref><ref type="bibr">60]</ref>. When comparing across observation settings using relative trajectory, we find that Large FOV In-Hand generally performs better or on par with In-Hand + External. However, if averaged across all tasks, the Large FOV In-Hand setup underperforms by 2.7%. This performance gap is primarily due to a significant drop in the Coffee Preparation task, where the eye-in-hand view alone provides insufficient information for completing this long-horizon task. Moreover, we found that tasks like Threading sometimes encounter occlusion challenges. As shown in Figure <ref type="figure">5</ref>, those limitations of a single eye-in-hand image constitute the majority of the failure modes. Despite these drawbacks, leveraging the invariant observation and action representations provides a 4.7% improvement compared with the original Diffusion Policy, which uses In-Hand + External views and absolute trajectory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Integrating Symmetry into the Vision Encoder</head><p>Having established the advantages of invariant action representations, we now investigate different approaches for incorporating symmetry into the vision encoder component of diffusion policies. We compare four methods: CNN Encoder (CNN Enc): A standard ResNet-18 <ref type="bibr">[17]</ref> without any symmetry constraints, trained from scratch; Equivariant Encoder (Equi Enc): An equivariant ResNet-18 architecture implemented with equivariant layers using the escnn <ref type="bibr">[4]</ref> library, enforcing C 8 -equivariance with outputs in the regular representation of C 8 ; Pretrained Encoder (Pretrain): A standard ResNet-18 pretrained on ImageNet-1k <ref type="bibr">[12]</ref>, without explicit symmetry constraints; Pretrained Encoder with Frame Averaging (Pretrain + FA): A pretrained ResNet-18 enhanced with Frame Averaging (Equation <ref type="formula">5</ref>) to achieve C 8 -equivariance without modifying the underlying network architecture.</p><p>Table <ref type="table">2</ref> presents our findings across all 12 manipulation tasks. The results reveal several important insights: First, comparing non-pretrained encoders (Equi Enc vs. CNN Enc), we observe that incorporating equivariance improves performance in 11 out of 12 tasks, yielding a substantial 9.1% average improvement. This confirms that explicit symmetry constraints significantly benefit diffusion policy learning. Second, in the pretrained encoder setting, adding Frame Averaging (Pretrain + FA vs. Pretrain) leads to a 4.1% average performance improvement, with superior results in 7 out of 12 tasks. This demonstrates that symmetry benefits can be obtained even when leveraging powerful pretrained representations. Third, comparing our approaches to Equivariant Diffusion Policy <ref type="bibr">[60]</ref> (EquiDiff), we find that both Equi Enc and Pretrain + FA achieve competitive performance.</p><p>Specifically, our Equi Enc approach outperforms image-based EquiDiff (Im) on average, while Pretrain + FA achieves results only 2.5% below voxel-based EquiDiff (Vo). This is particularly impressive considering that EquiDiff (Vo) utilizes RGBD inputs from four cameras and employs a substantially more complex architecture. In contrast, our Pretrain + FA approach requires only a single eye-in-hand RGB image and minimal equivariant reasoning, making it considerably more practical for real-world deployment. Overall, these results suggest that integrating symmetry through equivariant encoders provides significant performance benefits for diffusion policies, with Frame Averaging offering an elegant way to leverage powerful pretrained representations while maintaining equivariance properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion</head><p>In this paper, we present a practical guide for incorporating symmetry in diffusion policies, achieving performance competitive with or exceeding fully equivariant architectures while requiring significantly less implementation complexity. Notably, our method performs only 2.5% below voxel-based EquiDiff, despite using only a single eye-in-hand RGB image compared to EquiDiff's four RGBD cameras. Our approach not only defines a new state-of-the-art performance for RGB eye-in-hand diffusion policy, but more importantly, it addresses the trade-off between architectural complexity and sample efficiency when introducing symmetries into policy learning.</p><p>Concretely, we investigate three straightforward approaches for incorporating symmetry: invariant representations through relative trajectory and eye-in-hand perception, integrating equivariant vision encoders, and using Frame Averaging with pretrained encoders. Our extensive experimental evaluation across 12 manipulation tasks in MimicGen yields several important findings. First, we demonstrate that relative trajectory actions consistently outperform absolute trajectory, confirming our theoretical analysis that relative trajectory induces SE(3)-invariance. This finding is particularly valuable because a simple coordinate frame change in action representation can bring a 5-7% improvement. Second, we found that incorporating symmetry through equivariant vision encoders significantly enhances performance by 9.1%, highlighting the value of symmetry-aware features while avoiding complex end-to-end reasoning. Lastly, we show that Frame Averaging provides an elegant solution for leveraging the power of pre-trained vision encoders while maintaining equivariance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Limitations</head><p>There are several limitations of this work that suggest directions for future research. First, only leveraging an eye-in-hand image assumes a good coverage of the entire workspace (thus we use an enlarged FOV in our experiments); however, as shown in Figure <ref type="figure">5</ref>, the limited view still constitutes the most significant failure mode of our system. In future works, this could be addressed by using a fish-eye camera <ref type="bibr">[7]</ref>, or a memory mechanism to maintain context across timesteps. Second, while our approaches are theoretically applicable to other policy learning frameworks beyond diffusion models, such as ACT <ref type="bibr">[72]</ref>, we limited our investigation to diffusion policies and only experimented in the MimicGen <ref type="bibr">[42]</ref> and Robomimic <ref type="bibr">[41]</ref> benchmarks. Third, leveraging an equivariant encoder, especially with Frame Averaging, could be computationally expensive. Our method roughly takes twice the GPU hours to train compared with the original Diffusion Policy, but is twice as fast as EquiDiff (Im). Finally, although our method is well-suited for real-world deployment on systems like UMI <ref type="bibr">[7]</ref> , we have not yet demonstrated this transfer to physical robots.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Proof of Proposition 1</head><p>Proof. 1. Absolute trajectory equivariance: Given an absolute trajectory action a = {A t+i } n-1 i=0 , each pose A t+i is defined in the world frame. Under the transformation g, each pose transforms as gA t+i , i = 0, . . . , n -1. Thus, by definition, the absolute trajectory transforms equivariantly: g &#8226; a = {gA t , gA t+1 , . . . } 2. Relative trajectory invariance: Consider a relative trajectory action a r = {A r t+i } n-1 i=0 defined in the local gripper frame at the initial time step t. The corresponding absolute poses are obtained as A t+i = T t A r t+i . Under the global transform g, the absolute pose becomes g &#8226; A t+i = gT t A r t+i . Since the relative pose A r t+i appears as a right multiplication factor, it remains unchanged under the global transform. Hence, we have invariance: g &#8226; a r = a r 3. Delta trajectory invariance: For a delta trajectory action</p><p>t+i is expressed in the gripper's local frame at time t + i -1. The absolute pose reconstruction is given by</p><p>Under the global transform g, we have g</p><p>where each incremental transform A d t+j is multiplied on the right and thus remains unaffected by the global transformation g. Therefore, the delta trajectory action is invariant:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Proposition 2</head><p>Proof. We treat the two cases in parallel. In both, the policy</p><p>outputs a sequence of absolute poses A t+i &#8712; SE(3). Internally it first predicts a "local" sequence</p><p>(either relative or delta) and then reconstructs absolute poses by anchoring to the current gripper pose T t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Case 1: Relative trajectories. By Definition 2,</head><p>A t+i = T t A r t+i , i = 0, . . . , n -1, where A r t+i is the ith pose in the relative sequence &#960;</p><p>By assumption, &#960; does not depend on the gripper pose T t explicitly. Thus, applying the transformation g to the observation has no effect on the relative trajectory prediction:</p><p>Reconstructing absolute poses from the transformed observation gives, for each i, &#960;(g</p><p>Case 2: Delta trajectories. By Definition, the delta-reconstruction is</p><p>where {A d t+j } is the delta-sequence from &#960;(o t ). Again invariance of &#960; under g gives &#960;(g</p><p>In both cases we have shown that the entire trajectory satisfies &#960;(g</p><p>Lift Can Square tool hang Obs Action Method Mean 100 PH 100 MH 100 PH 100 MH 100 PH 100 MH 100 PH Large FOV In-Hand Rel Traj Ours 93.4 100.0&#177;0.0 100.0&#177;0.0 99.3&#177;0.7 100.0&#177;0.0 88.0&#177;2.3 92.0&#177;0.0 74.7&#177;2.4 Voxel Abs Traj EquiDiff 90.4 100.0&#177;0.0 100.0&#177;0.0 99.3&#177;0.7 96.7&#177;0.7 84.0&#177;1.2 76.7&#177;1.3 76.0&#177;0.0 In-Hand + External Abs Traj DiffPo 87.9 100.0&#177;0.0 100.0&#177;0.0 100.0&#177;0.0 95.3&#177;0.7 85.3&#177;0.7 70.7&#177;0.7 64.0&#177;5.8</p><p>Table <ref type="table">3</ref>: The performance of our method compared with the baselines in Robomimic. We experiment with 100 Proficient-Human (PH) or Multi-Human (MH) demos in each environment. Results averaged over three seeds. &#177; indicates standard error.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Training Detail</head><p>We follow the training setup and hyper-parameters of the prior works <ref type="bibr">[5,</ref><ref type="bibr">60,</ref><ref type="bibr">7]</ref>. Specifically, our RGB observation has a size of 3 &#215; 84 &#215; 84 (which will be random cropped to 3 &#215; 76 &#215; 76 during training), and all tasks have a full 6 DoF SE(3) action space. The observation contains two steps of history observation, and the output of the denoising process is a sequence of 16 action steps. We use all 16 steps for training but only execute eight steps in evaluation. In all pretraining encoder variations, we use two steps of proprioceptive observation but only one step of visual observation, following Chi et al. <ref type="bibr">[7]</ref>. The vision encoder's output dimension is 64 for for CNN Enc (following <ref type="bibr">[5]</ref>), 128&#215;8 for Equi Enc (128 channel regular representation of C 8 , following <ref type="bibr">[60]</ref>), and 512 for Pretrain (following <ref type="bibr">[7]</ref>). The diffusion UNet has [512, 1024, 2048] hidden channels for end-to-end training variations (following Chi et al. <ref type="bibr">[5]</ref>), and [256, 512, 1024] hidden channels for pretraining encoder variations (following Chi et al. <ref type="bibr">[7]</ref>). We train our models with the AdamW <ref type="bibr">[39]</ref> optimizer (with a learning rate of 10 -4 and weight decay of 10 -6 ) and Exponential Moving Average (EMA). We use a cosine learning rate scheduler with 500 warm-up steps. We use DDPM <ref type="bibr">[18]</ref> with 100 denoising steps for both training and evaluation. We perform training for 600 epochs, and evaluate the method every 10 episodes (60 evaluations in total). All trainings are performed on a single GPU, where we perform training on internal clusters and desktops with different GPU models. Each training of the Pretrain + FA method takes from 3 hours (Stack D1) to 24 hours (Pick Place D0), due to the different sizes of the dataset. The total amount of compute used in this project is roughly 3000 GPU hours.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Robomimic Experiment</head><p>In this section, we perform an experiment in the Robomimic <ref type="bibr">[41]</ref> environments. We compare our Pretrain + FA method against EquiDiff <ref type="bibr">[60]</ref> and the vanilla Diffusion Policy <ref type="bibr">[5]</ref>. As shown in Table <ref type="table">3</ref>, our method generally outperforms the baselines, achieving an average improvement of 3% compared with EquiDiff.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Pretraining and Frame Averaging with External View</head><p>In this experiment, we extend our analysis from Section 5.2 to the external view setting to verify whether our findings generalize across different observation configurations. Specifically, we perform an additional experiment on using pretrained encoders (with and without Frame Averaging) with In-Hand + External view. Similar to Section 5.2, we consider three methods in each view setting: 1) No Pretrain: A standard ResNet-18 <ref type="bibr">[17]</ref> trained from scratch; 2) Pretrain: A standard ResNet-18 pretrained on ImageNet-1k <ref type="bibr">[12]</ref>, without explicit symmetry constraints; 3) Pretrain + FA: A pretrained ResNet-18 enhanced with Frame Averaging (Equation <ref type="formula">5</ref>) to achieve C 8 -equivariance without modifying the underlying network architecture.</p><p>As shown in Table <ref type="table">4</ref>, the benefits of Frame Averaging remain consistent across both observation settings. In the In-Hand + External view setting, Pretrain + FA yields a 6.7% improvement compared with not using Frame Averaging (Pretrain), and a 15.5% improvement compared with training from scratch (No Pretrain). Notably, Pretrain + FA in In-Hand + External view outperforms EquiDiff (Im) in all tasks, and even achieves a 1% higher average performance compared with EquiDiff (Vo). This is particularly impressive considering the additional complexity of EquiDiff, as discussed in Section 5.2. Comparing Pretrain + FA across different views, despite using an additional camera, In-Hand + External only outperforms Large FOV In-Hand by 2.5%. This finding verifies our analysis in Section 4.2, and suggests Large FOV In-Hand is preferable in many applications due to its simplicity  (requiring only a single eye-in-hand camera) and easier transferability to diverse robotic platforms beyond tabletop manipulation scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Pretraining and Frame Averaging with Absolute Action and Multi-Camera Observation</head><p>To demonstrate generality beyond the in-hand setup, we evaluate the proposed Pretrain+FA encoder under the same observation/action setting as Diffusion Policy and EquiDiff (in-hand + external views and absolute actions) in four different environments. As shown in Table <ref type="table">5</ref>, employing the Pretrain+FA encoder yields a significant 21.5% and 9.7% improvement for Diffusion Policy and EquiDiff, respectively. These results confirm the encoder's plug-and-play nature across observation and action parameterizations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G Ablation Study</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.1 Isolating Relative Trajectory and Symmetric Feature Extraction</head><p>We explicitly isolate each component in our design. As shown in Table <ref type="table">6</ref>, starting from the full model, replacing the relative trajectory with absolute trajectory reduces performance by 11.1% across four tasks, while replacing the encoder with a non-pretrained CNN reduces performance more significantly by 19.6%. This result highlights the complementary benefits from symmetry-aware features and invariant action parameterization. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.2 Ablating Proprioception</head><p>We test Proposition 2's assumption by removing the gripper pose from the policy input in Table <ref type="table">7</ref>. As expected, this tighter symmetry assumption reduces performance-indicating that proprioception, while symmetry-breaking, provides valuable context. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.3 Ablating Symmetry Group</head><p>We vary the SO(2) discretization used by the equivariant encoder. As shown in Table <ref type="table">8</ref>, reducing the cyclic group order degrades performance (C8 &gt; C4 &gt; C2), aligning with prior observations <ref type="bibr">[62,</ref><ref type="bibr">58]</ref> that C8 is a practical sweet spot for 2D rotational symmetry. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H EquiDiff with Large FOV In-Hand Only</head><p>In this experiment, we evaluate EquiDiff with only a large-FOV eye-in-hand camera setting. The comparison is shown in Table <ref type="table">9</ref>. Compared to the original external-camera configuration, EquiDiff's mean success drops by 12.8%, underscoring the benefit EquiDiff derives from external viewpoint signals. This result also justifies our choice of using the original EquiDiff observation setting in our main results.</p><p>Table <ref type="table">9</ref>: Performance of EquiDiff under a large FOV in-hand-only camera setting vs. the original in-hand + external and voxel (multi-RGBD) settings. Removing external cameras substantially reduces EquiDiff performance on several tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I Broader Impact</head><p>This work has a couple of positive and negative social impacts. First, we provide a simple policy learning framework from eye-in-hand images, which could benefit the development of household robots or assistive robots that could be beneficial for society. However, since it is a behavior cloning algorithm and the robots' behavior completely depends on the training data, it could also potentially be used to train robots with harmful behavior. Therefore, it is important for future works to highlight safety monitoring, especially when deploying in real-world environments with human presence.</p><p>NeurIPS Paper Checklist</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Claims</head><p>Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?</p><p>Answer: [Yes] Justification: Our abstract and introduction clearly state the claims of this paper, including the contributions and the assumption that we are using an eye-in-hand camera. The claims made match our theoretical and experimental results.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the abstract and introduction do not include the claims made in the paper. &#8226; The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. &#8226; The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. &#8226; It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Limitations</head><p>Question: Does the paper discuss the limitations of the work performed by the authors?</p><p>Answer: [Yes]</p><p>Justification: We include a detailed limitation section in the paper, including the assumption and potential negative impact of using eye-in-hand camera, the limitation of the testing dataset, and the computational efficiency.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. &#8226; The authors are encouraged to create a separate "Limitations" section in their paper.</p><p>&#8226; The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. &#8226; The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. &#8226; The authors should reflect on the factors that influence the performance of the approach.</p><p>For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. &#8226; The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. &#8226; If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. &#8226; While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Theory assumptions and proofs</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.</head><p>Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We include the code for all experiments in the supplementary material, including data generation and all the models. All the results will be reproducible using the code.</p><p>In the final version of the paper, we will provide a GitHub repository for the code of the project.</p><p>Guidelines:</p><p>&#8226; The answer NA means that paper does not include experiments requiring code.</p><p>&#8226; Please see the NeurIPS code and data submission guidelines (<ref type="url">https://nips.cc/  public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). &#8226; The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (<ref type="url">https:  //nips.cc/public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. &#8226; The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. &#8226; At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). &#8226; Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide the experimental setting in both the paper and the code. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. &#8226; The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We include the standard error of all our experiments, which is calculated from three random seeds. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. &#8226; The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). &#8226; The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) &#8226; The assumptions made should be given (e.g., Normally distributed errors). &#8226; It should be clear whether the error bar is the standard deviation or the standard error of the mean. &#8226; It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. &#8226; For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). &#8226; If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We discuss the compute resources in the appendix. Guidelines: &#8226; The answer NA means that the paper does not include experiments. &#8226; The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. &#8226; The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. &#8226; The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <ref type="url">https://neurips.cc/public/EthicsGuidelines</ref>? Answer: [Yes] Justification: Our work conforms with the NeurIPS Code of Ethics in every respect.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.</p><p>&#8226; If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. &#8226; The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Broader impacts</head><p>Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?</p><p>Answer: [Yes]</p><p>Justification: We include a broader impact statement in the appendix, discussing the potential positive and negative societal impacts.</p><p>Guidelines:</p><p>&#8226; The answer NA means that there is no societal impact of the work performed.</p><p>&#8226; If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. &#8226; Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. &#8226; The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. &#8226; The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. &#8226; If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11.">Safeguards</head><p>Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?</p><p>Answer: [NA] Justification: Our work does not have a high risk of misuse.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper poses no such risks.</p><p>&#8226; Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. &#8226; Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. &#8226; We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.</p><p>12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?</p><p>Answer: [Yes]</p><p>Justification: All the prior works used are properly cited, we will also cite the codebases used in the paper in the GitHub repo.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not use existing assets.</p><p>&#8226; The authors should cite the original paper that produced the code package or dataset.</p><p>&#8226; The authors should state which version of the asset is used and, if possible, include a URL. &#8226; The name of the license (e.g., CC-BY 4.0) should be included for each asset.</p><p>&#8226; For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.</p><p>&#8226; If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. &#8226; For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. &#8226; If this information is not available online, the authors are encouraged to reach out to the asset's creators. 13. New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We will include those in the supplementary material as well as the GitHub repo for the final submission. Guidelines:</p><p>&#8226; The answer NA means that the paper does not release new assets.</p><p>&#8226; Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. &#8226; The paper should discuss whether and how consent was obtained from people whose asset is used.</p><p>&#8226; At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: &#8226; The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. &#8226; Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. &#8226; According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: &#8226; The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. &#8226; Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. &#8226; We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. &#8226; For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review. 16. Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required. Answer: [NA] Justification: The core method development in this research does not involve LLMs as any important, original, or non-standard components. We only use LLM for editing, which is denoted in Openreview. Guidelines: &#8226; The answer NA means that the core method development in this research does not involve LLMs as any important, original, or non-standard components. &#8226; Please refer to our LLM policy (<ref type="url">https://neurips.cc/Conferences/2025/LLM</ref>)</p><p>for what should or should not be described.</p></div></body>
		</text>
</TEI>
