<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Approximate Equivariance in Reinforcement Learning</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on Artificial Intelligence and Statistics</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652224</idno>
					<idno type="doi"></idno>
					
					<author>Jung Yeon Park</author><author>Sujay Bhatt</author><author>Sihan Zeng</author><author>Lawson LS Wong</author><author>Alec Koppel</author><author>Sumitra Ganesh</author><author>Robin Walters</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Equivariant neural networks have shown great success in reinforcement learning, improving sample efficiency and generalization when there is symmetry in the task. However, in many problems, only approximate symmetry is present, which makes imposing exact symmetry inappropriate. Recently, approximately equivariant networks have been proposed for supervised classification and modeling physical systems. In this work, we develop approximately equivariant algorithms in reinforcement learning (RL). We define approximately equivariant MDPs and theoretically characterize the effect of approximate equivariance on the optimal Q function. We propose novel RL architectures using relaxed group and steerable convolutions and experiment on several continuous control domains and stock trading with real financial data. Our results demonstrate that the approximately equivariant network performs on par with exactly equivariant networks when exact symmetries are present, and outperforms them when the domains exhibit approximate symmetry. As an added byproduct of these techniques, we observe increased robustness to noise at test time. Our code is available at https://github.com/  jypark0/approx_equiv_rl.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Symmetry is a powerful inductive bias that can be used to improve generalization and data efficiency in deep learning. One way to leverage symmetry Figure <ref type="figure">1</ref>: An approximately equivariant policy &#960; on a Reacher domain, where the goal is to determine the torques (green, magenta) to apply on each joint for the fingertip to reach the target (red). Due to wear, the first joint is more responsive to positive torques. When the state is flipped, the policy also flips the actions but can learn to adjust for symmetry breaking factors.</p><p>is through equivariant neural networks, which are model classes constrained to respect the symmetry of a known ground truth. Equivariant neural networks have successfully been applied to image classification <ref type="bibr">(Cohen and Welling, 2016;</ref><ref type="bibr">Worrall et al., 2017)</ref>, particle physics <ref type="bibr">(Bogatskiy et al., 2020)</ref>, molecular biology <ref type="bibr">(Satorras et al., 2021;</ref><ref type="bibr">Thomas et al., 2018)</ref>, and robotic manipulation <ref type="bibr">(Wang et al., 2022b)</ref>. Empirical studies have demonstrated that equivariant networks require much fewer data than their standard network counterparts <ref type="bibr">(Winkels and Cohen, 2018;</ref><ref type="bibr">Wang et al., 2022b)</ref>, can have fewer parameters <ref type="bibr">(Weiler and Cesa, 2019;</ref><ref type="bibr">He et al., 2022)</ref>, and can generalize better to unseen data <ref type="bibr">(Wang et al., 2020;</ref><ref type="bibr">Fuchs et al., 2020)</ref>.</p><p>However, equivariant neural networks crucially assume that the data is perfectly symmetric in both the inputs and outputs, which may not be true in real-world data such as fluid dynamics <ref type="bibr">(Wang et al., 2022c)</ref> or finan-cial data <ref type="bibr">(Black, 1986)</ref>. By relaxing the strict equivariance constraints, approximately equivariant networks can outperform exactly equivariant and unconstrained networks in the presence of asymmetry. While various approaches to achieve approximate equivariance have been proposed <ref type="bibr">(Wang et al., 2022c;</ref><ref type="bibr">van der Ouderaa et al., 2022;</ref><ref type="bibr">McNeela, 2023;</ref><ref type="bibr">Kim et al., 2023)</ref>, they focus on vision-based tasks or dynamics modeling.</p><p>One area where symmetry has been especially useful is in reinforcement learning <ref type="bibr">(RL)</ref>, where equivariant networks greatly improve sample efficiency <ref type="bibr">(Wang et al., 2022b;</ref><ref type="bibr">Zhu et al., 2022)</ref>, a key challenge in RL. However, most works consider exact symmetry and use exactly equivariant networks, which cannot address symmetry breaking in the reward or transition functions or noise in the observations. In this work, we employ relaxed group and steerable convolutional neural networks for RL <ref type="bibr">(Wang et al., 2022c)</ref>; they are flexible enough to adapt to approximate equivariance but also have improved efficiency and robustness.</p><p>In this paper, we theoretically and empirically investigate approximately equivariant reinforcement learning. Our key contributions are to:</p><p>&#8226; formalize the notion of approximately equivariant MDPs and prove the (optimal) value function in such MDPs exhibits approximate equivariance, motivating the use of approximately equivariant RL, &#8226; introduce a novel approximately equivariant RL architecture using relaxed group convolutions, &#8226; demonstrate improved sample efficiency and robustness to noise for our approximately equivariant RL compared to other baselines with or without symmetry biases, &#8226; successfully apply approximate equivariant RL to real-world financial data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Equivariant Reinforcement Learning Early works explored equivalence classes in reinforcement learning from the lens of abstractions by defining MDP homomorphisms <ref type="bibr">(Ravindran and Barto, 2002;</ref><ref type="bibr">Zinkevich and Balch, 2001)</ref>. More recently, several approaches have combined function approximation with RL with equivariant neural networks ( <ref type="bibr">Van der Pol et al., 2020;</ref><ref type="bibr">Wang et al., 2022b;</ref><ref type="bibr">Mondal et al., 2020)</ref> with significantly improved sample efficiency. However, all of these works considered perfectly symmetric domains where the policy is constrained to be exactly equivariant. This paper considers domains with symmetry breaking factors where exactly equivariant networks can be suboptimal.</p><p>Approximate Equivariant Architectures There has been recent interest in exploring approximate equivariance and approximately equivariant neural networks <ref type="bibr">(Finzi et al., 2021;</ref><ref type="bibr">Wang et al., 2022c;</ref><ref type="bibr">Romero and Lohit, 2022;</ref><ref type="bibr">van der Ouderaa et al., 2022;</ref><ref type="bibr">McNeela, 2023;</ref><ref type="bibr">Petrache and Trivedi, 2024;</ref><ref type="bibr">Samudre et al., 2024)</ref>. <ref type="bibr">Wang et al. (2022c</ref><ref type="bibr">Wang et al. ( , 2024b) )</ref> use a linear combination of exactly equivariant convolution kernels with learnable weights to achieve relaxed equivariance and discover symmetry breaking factors. van der Ouderaa et al. (2022) define a nonstationary kernel and a tunable frequency parameter to control the amount of approximate equivariance. McNeela (2023) propose using a neural network to approximate the exponential map from the Lie algebra to the group to learn almost equivariant functions. Petrache and Trivedi (2024) give theoretical bounds on when approximate equivariance can improve generalization. However, none of these works studied approximate equivariance in RL, the main focus of this work.</p><p>Closest to our setting is Residual Pathway Priors <ref type="bibr">(Finzi et al., 2021)</ref>, which considered soft equivariance constraints in model-free RL. They construct a relaxed equivariant neural network layer as the sum of an exactly equivariant and a non-equivariant layer with a prior on the equivariant layer. We take a different approach in this work and use relaxed group convolutions <ref type="bibr">Wang et al. (2022c)</ref>, which are flexible enough to learn different outputs for each transformation.</p><p>Learning with Latent Symmetry Other works also apply equivariant neural networks to domains with latent symmetry. These are cases where the full state has exact symmetry but only partial observations with an unknown group action are available to the model. Park et al. ( <ref type="formula">2022</ref>) learn the out-of-plane rotations from 2D images using a symmetric embedding network while others have learned 3D rotational features from images using manifold latent variables <ref type="bibr">(Falorsi et al., 2018)</ref> or disentanglement <ref type="bibr">(Quessard et al., 2020)</ref>. <ref type="bibr">Wang et al. (2022a)</ref> find that equivariant models where the group acts directly on observation space perform well in RL even with camera skew or occlusions. They define extrinsic equivariance (transformed samples are outside the data distribution) and show that it can benefit in some scenarios but can also be harmful in certain cases <ref type="bibr">(Wang et al., 2024a)</ref>. Unlike these works where the observation is partial and does not contain full information about the state, we assume that the domains are fully observable and consider various symmetry breaking factors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">BACKGROUND</head><p>In this section, we provide some background on symmetry groups and equivariant functions. As building blocks of exactly and approximately equivariant networks, we also describe exact and relaxed group convolutions, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Groups and Equivariance</head><p>A symmetry group G is a set equipped with a binary operation that satisfies associativity, existence of an identity, and existence of inverses. A group can act on vector space X via a group representation &#961; X which homomorphically assigns each element g &#8712; G an invertible matrix &#961; X (g) &#8712; GL(X). For example, for a finite group G, the regular representation acts on R |G| by permuting basis elements {e g : g &#8712; G} as</p><p>That is, transformations of the input x by g correspond to transformations of the output by the same group element. We can enforce this constraint in equivariant neural networks to learn only over the space of equivariant functions by replacing linear layers with group or steerable convolutional layers. One benefit of enforcing equivariance is lower sample complexity as the network searches over a reduced function class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Group Convolution</head><p>One method of constructing equivariant network layers is by group convolution <ref type="bibr">(Cohen and Welling, 2016)</ref>, which we briefly describe here. Group convolutions map between features which are signals over the group f : G &#8594; R. For inputs not natively of this form, a lift operation must first be performed. Let &#968; &#952; : G &#8594; R be the convolutional kernel parameterized by &#952;. A Gequivariant group convolutional layer is defined as</p><p>Equivariance follows from the fact that the kernel depends only on the product g -1 h and not the specific elements (g, h). For example, if we consider equivariance across translations, we obtain the standard convolution where h, g &#8712; Z 2 and g -1 h = h -g. Another possible approach to constructing equivariant network layers is with G-steerable convolutions (Cohen and Welling, 2017), which can generalize to continuous groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Relaxed Group Convolution</head><p>A key component of our method is the relaxed version of the group convolution <ref type="bibr">(Wang et al., 2022c)</ref>. The kernel &#968; is replaced with several kernels {&#968; l } L l=1 and the output is composed as a linear combination. The relaxed group convolution is defined as</p><p>where w l are the relaxed weights and each &#968; l &#952; are constrained to be exactly equivariant. Note that as w l (h) depends on the specific element h, this breaks the strict equivariance of the group convolution. <ref type="bibr">Wang et al. (2022c)</ref> also introduce relaxed versions of steerable convolutions, see <ref type="bibr">Wang et al. (2022c)</ref> or Appendix C.2 for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Approximate Equivariance</head><p>There have been several different definitions of approximate, relaxed, or partial equivariance. In this paper, we use the definition given by <ref type="bibr">Petrache and Trivedi (2024)</ref>. We give some background to build up to the definition. Let G be a group and f : X &#8594; Y, x &#8594; y be the task function.</p><p>Definition 1 (Equivariance Error). For g &#8712; G and x &#8712; X, the equivariance error ee(f, g, x) is defined as</p><p>Equivariance error measures exactly how far a function is from perfect equivariance with respect to G for a particular x. For an exactly G-equivariant function, ee(f, g, x) = 0 for all g &#8712; G and x &#8712; X.</p><p>Definition 2 (&#949;-stabilizer). The &#949;-stabilizer of f and G is defined as</p><p>The &#949;-stabilizer gives the set of group elements for which the equivariance error is under some threshold.</p><p>We adopt the definition of approximate equivariance where f has bounded equivariance error for all g &#8712; G, in contrast to partial equivariance, where Stab &#949; (f, G) &lt; G.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">METHOD: APPROXIMATELY EQUIVARIANT REINFORCEMENT LEARNING</head><p>We first theoretically characterize the problem by defining approximately equivariant Markov decision processes (MDP). We then prove that environments with approximate symmetry admit approximately invariant Q functions. This motivates our method of using approximately equivariant neural networks to learn the policy and Q function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Approximately Equivariant MDP</head><p>Consider an infinite-horizon discounted-reward Markov decision process (MDP) represented by a tuple M = (S, A, P, R, &#947;) with state space S, action space A, instantaneous reward function R : S &#215; A &#8594; R, a transition function P : S &#215; A &#8594; &#8710; S and discount factor &#947; &#8712; (0, 1).</p><p>Let &#960; : S &#8594; &#8710; A be a policy giving the probability &#960;(a|s) of taking action a in state s. The expected cumulative reward of using the policy starting from state s (or state s and action a) are the value functions defined as follows</p><p>The goal is to find a policy &#960; * that maximizes the expected return with an initial state distribution &#958;</p><p>We denote</p><p>Let G be a group acting on S and A. Denote the action of an element g &#8712; G on s and a by gs and ga, respectively. We now extend the definition of Equivariant MDPs <ref type="bibr">(Van der Pol et al., 2020)</ref> to cases where the symmetry is approximate. </p><p>where the Minkowski functional w.r.t F is</p><p>For the total variation distance &#961; F (f ) := 1 2 (max fmin f ) and for Kantorovich metric &#961; F (f ) := &#8741;f &#8741; Lip .</p><p>The following theorem provides a characterization of the gap between the value functions in the original and symmetry transformed domain, for the (G, &#1013; R , &#1013; P )-invariant MDP described in Definition 4. Theorem 1 highlights that the Q-function is approximately group-invariant, where the approximation is now a function of the reward and transition mismatch, the discount factor, and the Minkowski functional evaluated on the optimal value function. Theorem 1. Let the rewards R be bounded R min &#8804; R &#8804; R max , 0 &#8804; &#947; &lt; 1 and let g &#8712; G be an onto mapping. For any state s and action a, we have</p><p>.</p><p>Theorem 1 implies that when the invariance mismatch is small -i.e., when the domain has only minor symmetry violations -the Q-function is approximately group-invariant. A proof is provided in Appendix A. Note that, in Theorem 1, when the Kantorovich metric is used for uncertainty characterization, &#961; F (V * ) = &#8741;V * &#8741; Lip , where &#8741; &#8226; &#8741; Lip is the Lipschitz norm of the value function <ref type="bibr">(Gelada et al., 2019)</ref>. For total variation distance, &#961;</p><p>Also, from Theorem 1, it is clear that when &#947; &#8712; [0, 1), we obtain a non-trivial characterization, while &#947; = 1 results in a trivial and uninformative bound. This is the limitation of the infinite horizon setting, and can be remedied by considering an arbitrary finite-horizon setup. We do this for the sake of completeness in Appendix B. We not only show that the finite horizon setup allows for a time-dependent transition function, but also obtain an approximate group-invariance of the time dependent Q-function in terms of similar elements that appear in Theorem 1.</p><p>There are different ways to use the above results. One can discover how approximate the value functions are and learn &#945;, or one can incorporate approximate equivariance into the model and leverage the benefits of equivariance. We take the latter approach and consider approximately equivariant networks for the policy and critic in domains with inexact symmetry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Approximately Equivariant Actor-Critic</head><p>We propose approximately equivariant versions of two commonly used actor-critic algorithms, DrQv2 (Yarats  <ref type="bibr">et al., 2021)</ref> and SAC <ref type="bibr">(Haarnoja et al., 2018)</ref>. In doing so, we generalize exactly equivariant versions of SAC <ref type="bibr">(Wang et al., 2022b)</ref> and DrQv2 (Wang et al., 2022a) from previous works by replacing strictly equivariant layers with relaxed equivariant layers. Illustrative Example We first illustrate how to apply our proposed approximately equivariant actorcritic architecture on the Reacher domain; see Figure 2. The objective is to actuate a two-joint arm so that the end effector reaches the red point. The state is a stack of consecutive images s &#8712; R C&#215;H&#215;W and the action a &#8712; R 2 corresponds to torques for the first and second arms. There is clear rotational and reflectional symmetry in this domain. If the state (image)</p><p>is rotated, the action should be invariant to rotations as they are angular torques. If the state is reflected, then the action would also correspondingly be flipped (in sign). However, as in the example in Figure <ref type="figure">1</ref>, the first joint is more responsive to positive torques, which breaks rotational and reflectional symmetry.</p><p>For this domain, we implement approximate equivariance to the group D 2 of vertical reflections and &#960; rotations. The group D 2 transforms the input states by image transformations, where the input images are reflected or rotated. Latent representations are images z : R 2 &#8594; R C where g &#8712; D 2 acts on the pixel axes by image transformation and on the channel axis by permutations corresponding to the regular representation of D 2 , i.e. (gz)(x, y) = &#961; reg (g)z(g -1 &#8226;(x, y)). Note that the latent representations can be high-dimensional, consisting of a direct sum of several different or repeated low-dimensional representations of D 2 . For the output, the torques a 1 and a 2 are scalars that change sign under reflection but are invariant under rotations.</p><p>Encoder, Policy, and Critic We extend exactly equivariant versions of SAC <ref type="bibr">(Wang et al., 2022b)</ref> and DrQv2 <ref type="bibr">Wang et al. (2022a)</ref> by replacing each group convolution with relaxed group convolutions for the encoder, policy, and critics. Practically, each relaxed group convolution layer contains L exactly equivariant kernels &#968; l and the output is a linear combination of the outputs of these convolutions and relaxed weights w l (g). The w l (g) also transform as the regular representation of G, see Section 3.1 for the definition for finite groups.</p><p>The encoder E and the policy &#960; are approximately equivariant. The latent state z output by E is defined to transform as the direct sum of regular representations of G. The action representation is domainspecific. The critics are approximately invariant and output scalars q (s,a) that are fixed by G, i.e. transform via the trivial representation. For more details, please see Section 5 and Appendix C.</p><p>In the case of continuous groups, we can also construct relaxed steerable versions of the encoder, policy, and critics. Analogous to the group convolution case, we can replace the exactly equivariant steerable convolutions with relaxed steerable convolutions. See Appendix C for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>We experiment on how approximately equivariant RL compares to methods with exact equivariance and no equivariance in domains with both exact symmetry and various symmetry breaking factors, and to elucidate when approximate equivariance should be preferred. We consider standard continuous control domains and stock trading with real-world data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Continuous Control</head><p>We first experiment on four continuous control domains in DeepMind Control Suite <ref type="bibr">(Tassa et al., 2018)</ref>. Similar to <ref type="bibr">Wang et al. (2022a)</ref>, we consider a subset of the domains which have apparent symmetry. Acrobot, Cartpole, and BallInCup have reflectional symmetry  described by the group D 1 and Reacher has D 2 symmetry. For all domains, the observations are a stack of 3 consecutive RGB images.</p><p>We modify the domains to carefully control the type and degree of symmetry breaking that is present. We first remove fixed background features such as random stars in the sky and checkered floors (see Figure <ref type="figure">4</ref>). These features break symmetry to some extent since they do not transform with the underlying state, but give a form of mild symmetry breaking termed extrinsic equivariance, which has an inconsistent impact on equivariant models <ref type="bibr">(Wang et al., 2022a)</ref>. We then introduce several different symmetry breaking factors for each domain: 1) repeat action -the action is repeated twice in a certain region of the domain, 2) gravity -gravity is modified from the force vector (0, 0, -9.81) to (a, -a, -9.81) where a &#824; = 0, and 3) reflect action -the action direction is flipped in certain regions of the domain. repeat action and reflect action test local symmetry breaking factors, while gravity tests a global symmetry breaking factor. See Appendix D.1 for more details.</p><p>Models For the continuous control tasks, we implement an approximately equivariant(ApproxEquiv) version of a SOTA image-based RL algorithm DrQv2 <ref type="bibr">(Yarats et al., 2021)</ref>. We compare with exactly equivariant (ExactEquiv) and non equivariant (NonEquiv)  versions of the same architecture. We largely use the hyperparameters from Yarats et al. ( <ref type="formula">2021</ref>) but reduce the latent dimension for more tractable computation for all methods. We also compare against an approximately equivariant model, Residual Pathway Priors (RPP) <ref type="bibr">(Finzi et al., 2021)</ref>, and a self-supervised symmmetry-aware model, SiT <ref type="bibr">(Weissenbacher et al., 2024)</ref>. We extend RPP to the DrQv2 architecture by using RPP layers in the encoder, policy, and critics. We find that RPP is somewhat sensitive to the speed &#964; of the critic moving average (as mentioned in the original paper), and had to reduce its value for Acrobot and BallInCup for stability. We also extend SiT to the DrQv2 architecture by using an SiT as the encoder and standard MLPs for the policy and critics.</p><p>Although we adapted the code from the official SiT implementation, we were unable to modify the input image sizes and had to use the image size used in the original paper (64px).</p><p>Results Figure <ref type="figure">3</ref> show the total episode reward over training. As expected, we confirm that NonEquiv has much lower sample efficiency than the models with a symmetry bias. In the repeat action and reflect action variants of Acrobot, ApproxEquiv significantly outperforms ExactEquiv and RPP. It does slightly worse than ExactEquiv on the Reacher domain but beats RPP, suggesting that the symmetry breaking we introduced was not strong enough to achieve incorrect equivariance. It is also possible that ExactEquiv can infer the symmetry breaking factors from the 3 frames of input, making the task a case of extrinsic equivariance where an equivariant model can succeed <ref type="bibr">Wang et al. (2022a)</ref>. In CartPole and BallInCup, all methods perform similarly and learn an optimal policy quickly. In domains with exact symmetry (original), our method ApproxEquiv performs similarly to ExactEquiv, showing there is no cost in performance by giving the model the ability to adapt to symmetry breaking in cases where it is not needed. This result supports Proposition 3.1 from <ref type="bibr">Wang et al. (2024b)</ref>, which proves that relaxed group convolutions initialized to be exactly equivariant stay exactly equivariant when trained with exact data symmetry.</p><p>We visualize the relaxed weights of the first layers of the encoder and policy over all runs in Figure <ref type="figure">5</ref>. If these weights are equal, the model is equivariant; the more they differ the more the model has relaxed the symmetry constraint. For Acrobot and CartPole, the weights differ more for the modified domains than the original symmetric domain, especially for the encoder, while the policy weights vary more for the modified domains of BallInCup. This indicates the relaxed equivariant models have adapted to the symmetry breaking in the domains.</p><p>To quantitatively evaluate the models, we select the best-performing policy from all runs and measure the total reward over 50 episodes. The results echo the training curves in Figure <ref type="figure">3</ref>, where ApproxEquiv performs well, particularly in the domains with symmetry breaking factors (see Table <ref type="table">1</ref>).</p><p>To test whether approximately equivariant models are robust to noisy observations, we also consider variants of the domains where Gaussian noise are added to the input images only at test time (&#963; = 0.02 for Acrobot and Reacher, &#963; = 0.06 for CartPole and BallInCup).</p><p>Interestingly, we find that our approach is more robust to noisy inputs than ExactEquiv or NonEquiv, especially on the BallInCup and Reacher domains. We further experiment with training on noisy data and test on noisy domains to see which policies are more robust, see Table <ref type="table">4</ref> in Appendix E. We find that in the BallInCup domains, the approximately equivariant agent is still more robust to noise than the fully equivariant or non equivariant baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Stock Trading</head><p>We also consider a stock trading task using real world price data, formulated as an MDP <ref type="bibr">(Liu et al., 2018)</ref>. Given a fixed amount of initial cash, the objective is to learn the optimal number of stocks to buy and sell (once daily) to maximize the portfolio value. The state consists of the current cash balance, the stock prices, the number of shares in the current portfolio, and other technical indicators of each stock. The actions are the number of stocks to buy and sell for each stock. The reward is the scaled difference in portfolio values between consecutive timesteps. We assume that the market dynamics are not affected by our trading. There is a small 0.1% transaction cost for every trade. We use real financial data scraped from Yahoo Finance (yfi, 1997) and consider the stocks in the Models For this domain, we use SAC <ref type="bibr">(Haarnoja et al., 2018)</ref> as our RL algorithm and consider equivariance to both the translation group and scaletranslation group across the time dimension. Temporal translations can be useful as the most recent history of stock prices inform your actions, and this information may be approximately preserved across time.</p><p>Temporal scaling could also be beneficial as there could be market seasonality, which is only approximately shared across different time scales. As our actions do not affect stock prices, which in turn is directly correlated with the reward, we learn an approximately invariant policy and invariant critic for both symmetry groups. As before we compare approximately equivariant, strictly equivariant, and unconstrained models. We evaluate each method on the final portfolio value (equivalent to the total episode reward), annualized return, and the Sharpe ratio (Sharpe, 1994), which is a standard financial metric that measures an asset's risk-adjusted performance. We also include as baselines a uniform holding strategy Uniform, where we initially buy equal values of each stock and hold, and the Dow Jones index ^DJI.</p><p>Results Table <ref type="table">2</ref> lists the average test results of the learned policies on the stock trading domain. The ApproxEquiv model for both translation (T) and scaletranslation (ST) outperform all baselines, with annualized returns of 10.6% and 12.0% respectively. The Exact ST-Equiv model outperforms NonEquiv, while  the Exact T-Equiv model does worse. These observations suggest that temporal scale and translation symmetries can be good biases in analyzing financial data and that translation symmetry may be more approximate than scale. We also visualize 10 episode rollouts of the best-performing policies in Figure <ref type="figure">6</ref>, with the portfolio values on the left and transaction costs on the right. The Approx ST-Equiv method achieves the highest portfolio value for most timesteps and incurs lower transaction costs than the exactly equivariant policies. We note that overall the annualized returns are fairly low, as the test dataset from 2021-01-01 to 2024-07-01 includes both the COVID-19 pandemic and 2022 stock market decline.</p><p>We visualize the relaxed weights of the first layer of the encoder across translation (left) and scale (right) in Figure <ref type="figure">7</ref>. For translation, our model places higher weights on the very last timestep. This matches our intuition as the most recent stock prices and portfolio holdings would be most informative in determining the optimal action. For scale, we find that the relaxed weights do not differ greatly, but there is increased variance with increasing scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSION</head><p>We proposed a novel approximately equivariant architecture using relaxed group convolutions for modelfree reinforcement learning. Our experimental results on continuous control domains and a stock trading problem with real-world data demonstrate that the approximately equivariant model performs similarly to an exactly equivariant model in domains with perfect symmetry but outperforms it in most domains with symmetry breaking factors. This suggests that our method can act as a much more flexible alternative to exactly equivariant agents that can boost sample efficiency in a wider variety of settings and is also more robust to perturbations.</p><p>Limitations and Future Work While we did consider real-world data in the stock trading domain, our continuous control domains used simplified observations and synthetic symmetry breaking. Furthermore, exactly equivariant networks perform better in some modified domains than others (Reacher vs. Acrobot).</p><p>Another limitation is that, as with all equivariant networks, the symmetry group and how it acts on the state and action spaces need to be known in advance.</p><p>An interesting future direction could be to quantify exactly what types of symmetry breaking factors could lead to higher performance for approximately equivariant RL, possibly by measuring equivariance error.</p><p>Other future work includes proving bounds on the optimal policy &#960;(s) and &#960;(gs) or applying approximately equivariant RL in robotic manipulation, where kinematic constraints or obstacles can break symmetry.</p><p>1. For all models and algorithms presented, check if you include:</p><p>(a) A clear description of the mathematical setting, assumptions, algorithm, and/or model.</p><p>[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm.</p><p>[No] (c) (Optional) Anonymized source code, with specification of all dependencies, including external libraries.</p><p>[Yes]</p><p>2. For any theoretical claim, check if you include:</p><p>(a) Statements of the full set of assumptions of all theoretical results. [Yes] (b) Complete proofs of all theoretical results.</p><p>[Yes] (c) Clear explanations of any assumptions. <ref type="bibr">[Yes]</ref> 3. For all figures and tables that present empirical results, check if you include:</p><p>(a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL).</p><p>[Yes] (b) All the training details (e.g., data splits, hyperparameters, how they were chosen).</p><p>[Yes] (c) A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times).</p><p>[Yes] (d) A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider).</p><p>[Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses existing assets. [Yes] (b) The license information of the assets, if applicable. [Yes] (c) New assets either in the supplemental material or as a URL, if applicable. [Yes] (d) Information about consent from data providers/curators. [No] (e) Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable] 5. If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to participants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] (c) The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]</p><p>Proof. We will prove the results using induction. First, note that the result is true for T by definition. Suppose the result is true for t + 1, and consider the differential at time t,</p><p>gs t , ga t )| &#8804; |R(s t , a t ) -R(gs t , ga t )| + &#947; S V t+1 (s t+1 )P (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P (gs t+1 |gs t , ga t ) &#8804; &#1013; R + &#947; S V t+1 (s t+1 )P (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P (s t+1 |s t , a t ) + &#947; S V t+1 (gs t+1 )P (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P (gs t+1 |gs t , ga t )</p><p>The last inequality follows by using the decomposition using Minkowski's functional. Further, note that</p><p>by induction assumption, and the fact that when g is onto sup</p><p>The result follows.</p><p>Proof. We have by definition,</p><p>Similarly, we have</p><p>We now prove Theorem 1. Let B(S) denote the Banach space of bounded real-valued functions on S. We define the Bellman optimality operator B : B(S) &#8594; B(S) such that for any uniformly bounded function V &#8712; B(S),</p><p>It is known that V * is the (unique) fixed point of B, i.e., BV * = V * . We note that V * also satisfies the following equation for any s V * (gs) = sup a&#8712;A R(gs, ga) + &#947; S V * (gs &#8242; )P (gs &#8242; |gs, ga) .</p><p>To see this, consider the following arguments.</p><p>Q * (gs, ga)P (s &#8242; |gs, ga).</p><p>Since g &#8712; G permutes the elements of G, re-indexing the integral using s&#8242; = gs &#8242; , we have</p><p>Q(gs &#8242; , ga &#8242; )P (gs &#8242; |gs, ga).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof of Theorem 1:</head><p>Consider a sequence of value functions V (n) on the symmetry transformed domain as follows: V (0) (gs) = 0 and V (n+1) = BV (n) . For an arbitrary T , we have using Proposition 1 for any t &#8712; {1, &#8226; &#8226; &#8226; , T },</p><p>where</p><p>From Proposition 2, we have, noting that V(s) = sup a Q(s, a), that</p><p>By Banach fixed point theorem, we know that lim</p><p>. Therefore, taking the limit, we have</p><p>A similar argument establishes the result for Q using the onto function g. The claims in Theorem 1 follows by recognizing that V t and Q t exactly equal V * and Q * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B CASE OF FINITE HORIZON: NO DISCOUNTING</head><p>As is clear from Theorem 1, when &#947; &#8594; 1, the bound becomes trivial and not useful. In this section we will briefly discuss the case when the discount factor &#947; = 1. In this setting, we allow the transition functions to be a function of t.</p><p>Proposition 3. Let |R(gs t , ga t ) -R(s t , a t )| &#8804; &#1013; R and d F P t (gs &#8242; t | gs t , ga t ), P t (s &#8242; t | s t , a t ) &#8804; &#1013; P (t). For a finite-horizon MDP of duration T , we have |Q t (s t , a t ) -Q t (gs t , ga t )| &#8804; &#945; t , |V t (s t ) -V t (gs t )| &#8804; &#945; t where &#945; T +1 = 0 and for t &#8712; {1, 2, &#8226; &#8226; &#8226; , T },</p><p>Proof. The proof proceeds as in Proposition 1. We have</p><p>gs t , ga t )| &#8804; |R(s t , a t ) -R(gs t , ga t )| + S V t+1 (s t+1 )P t (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P t (gs t+1 |gs t , ga t ) &#8804; &#1013; R + S V t+1 (s t+1 )P t (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P t (s t+1 |s t , a t ) + S V t+1 (gs t+1 )P t (s t+1 |s t , a t ) -S V t+1 (gs t+1 )P t (gs t+1 |gs t , ga t )</p><p>The result follows by recursion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C BACKGROUND AND METHOD</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Equivariance with Group Convolutions</head><p>Group convolutions <ref type="bibr">(Cohen and Welling, 2016)</ref> generalize standard convolutions, which are translationequivariant, to be equivariant to a group G. Group convolutions act on signals over the group f : G &#8594; R. As many data samples are not natively of this form (e.g. an image), the input must first be lifted onto a function in G. For example, let f 0 : Z 2 &#8594; R be the input signal, a grayscale image, and H = D 2 be the group. The lifting convolution lifts</p><p>where h &#8712; H. Practically, the lift operation creates |H|, the order of group H, images by acting on x by h -1 . Typically the lift operation is the first layer of the network, followed by subsequent group convolutions, nonlinearities, or other equivariant layers. We use relaxed versions of the lift and group convolutions as described in <ref type="bibr">Wang et al. (2022c)</ref> and the main paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Steerable Convolutions</head><p>As an alternative to group convolutions, one can use steerable convolutions <ref type="bibr">(Weiler et al., 2018)</ref> that use weight tying to generalize to continuous groups and are more parameter-efficient. Let H &lt; O(2) be the subgroup which acts on R 2 by matrix multiplication on the input and output channel spaces R c and R d by &#961; in and &#961; out , respectively. Then G = H &#8905; R 2 . Given input signal f : R 2 &#8594; R c , then standard convolution over R 2 with kernel &#968;</p><p>for all h &#8712; H. Intuitively, this kernel constraint ensures that the output features transform by &#961; out when the input features are transformed by &#961; in . Kernels that satisfy this constraint have been solved for many common subgroups of E(2), see <ref type="bibr">Weiler and Cesa (2019)</ref> for more details.</p><p>Using the example of grayscale images as in Section C.1, let the input feature be f : Z 2 &#8594; R and {&#968; k } K k=1 be a precomputed, nontrainable equivariant kernel basis of K kernels that satisfy Eq. ( <ref type="formula">9</ref>). Assume that both the number of input and output channels is 1 and let w &#8712; R K be the trainable coefficients of the kernels. Then a G-steerable convolution is defined as</p><p>where x &#8712; Z 2 is the spatial position and w k is the weight associated with kernel &#968; k .</p><p>Relaxed Steerable Convolution As described in <ref type="bibr">Wang et al. (2022c)</ref>, one can also use relaxed versions of steerable convolutions by letting the trainable weights w also depend on y. A relaxed G-steerable convolution is defined as</p><p>Allowing the trainable weights w k to also depend on the absolute spatial position y breaks the equivariance constraint in Eq. ( <ref type="formula">9</ref>).</p><p>By replacing relaxed group convolutions with relaxed steerable convolutions, we can also design a variant of our proposed approximately equivariant RL architecture (Figure <ref type="figure">8</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D EXPERIMENT DETAILS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1 Continuous Control</head><p>Acrobot We use the swingup task. The domain consists of two joints where the goal is to apply torque to the inner joint so that both joints are near vertical. We use D 1 as the symmetry group, i.e. vertical reflection, and the action a &#8712; R transforms via the sign representation &#961; sign , where &#961; sign (flip)(a) = -a. For variants, we consider 1) repeat action -the action is repeated when the inner joint is in the fourth quadrant and 2) gravity -gravity &#8407; g = [0, 0, -9.81] is modified to <ref type="bibr">[-2, 2, -9.81]</ref>.</p><p>CartPole We consider the swingup task. The domain consists of a pole swinging on a cart and the goal is to move the cart left or right (a &#8712; R) to make the pole upright. The symmetry group and action representation are the same as in Acrobot, D 1 and &#961; sign . For variants, we consider 1) repeat action -the action is repeated when the pole is in the first quadrant, 2) gravity -gravity is modified to [0.2, -0.2, -9.81], and 3) reflect action -the pole angle is in [0, &#960; 4 ]. Gravity is modified less than in Acrobot as too high values forced the cart out of frame.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cup Catch</head><p>The domain consists of a ball attached to the bottom of the cup and the goal is to move the cup to catch the ball inside the cup. The action (x, z) &#8712; R 2 is the cup's spatial position. The symmetry group is D 1 and the action representation is &#961; sign &#8853; &#961; 0 , where the x position transforms via the sign representation and the z transforms via the trivial representation &#961; 0 . For variants, we consider 1) repeat action -the ball x position greater than 0.0 and z position is greater than 0.3, 2) gravity -gravity is modified to <ref type="bibr">[-2, 2, -9</ref>.81], and 3) reflect action -same as repeat action.</p><p>Reacher We consider the hard task. The domain consists of two joints and the goal is to apply torques to make the end effector reach the target. The action a &#8712; R 2 . The symmetry group is D 2 , i.e. vertical reflections and &#960; rotations, and the action transforms via the quotient representation 2&#961; quot , where the torques for both joints are invariant to rotations and flip signs for vertical reflections. For variants, we consider 1) repeat action -the inner joint angle is in [0, &#960; 2 ] and 2) reflect action -the inner joint angle is in [ &#960; 2 , &#960;].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1.1 Training Details</head><p>For all DeepMind Control Suite (DMC) domains, we fix the episode length to 1000 and use RGB image of size 85 &#215; 85. We considered four domains of varying difficulty, of which Acrobot is the hardest. In the original DrQv2 implementation <ref type="bibr">(Yarats et al., 2021)</ref>, the encoder reduces the spatial dimensions to 35 &#215; 35, which is then flattened to be input to the policy and critic. We follow <ref type="bibr">Wang et al. (2022a)</ref> and further reduce the spatial dimensions to 7 &#215; 7 for faster training for all models. We reduce the replay buffer size from 1,000,000 to 500,000 to slightly reduce the memory footprint. All other hyperparameters are kept the same as in <ref type="bibr">Yarats et al. (2021)</ref>.</p><p>For the exactly equivariant and approximately equivariant models, we reduce the number of channels by |G| where |G| is the order of the group to preserve roughly the same number of parameters as the non-equivariant model. We use L = 1 filters for the approximately equivariant model in all experiments.</p><p>RPP contains both the non-equivariant layers and exactly equivariant layers and thus has roughly twice as many parameters as ExactEquiv. For the critic moving average speed &#964; , we use the default &#964; = 0.01 for CartPole and Reacher and &#964; = 0.009 for Acrobot and Ball in Cup.</p><p>The plots in Figure <ref type="figure">3</ref> show the mean reward of 10 episodes, evaluated every 20,000 environment steps. For the results in Table <ref type="table">1</ref>, we use &#963; = 0.02 for Acrobot and Reacher and &#963; = 0.06 for CartPole and Ball in Cup.</p><p>The continuous control experiments were run on single GPUs of different types. Acrobot was run on an Nvidia RTX 4090 and all other experiments were run on an Nvidia RTX 2080 Ti. We note that the wall clock time for training both exactly and approximately equivariant models is longer than that for a non equivariant model, even though they are generally more sample efficient. This is because equivariant neural networks often incur more overhead in implementation -for group convolutions, the kernel must be transformed and the outputs must be stacked and for steerable convolutions, the basis must be projected onto matrices at every forward pass.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Stock Trading</head><p>We formulate the stock trading problem as an MDP as described in <ref type="bibr">Liu et al. (2018)</ref>. The state consists of the cash balance c t , the stock prices p n t , the number of shares in the current portfolio h n t , and other technical indicators i n t for time t stock n &#8712; {1, . . . , N }. The actions x n t are the number of stocks to buy and sell for each stock n and are bounded to [-M, M ] where M was set to 100. The reward r t is the scaled difference in portfolio values between consecutive timesteps and we assume that the market dynamics are not affected by our trading. There is a small transaction cost &#1013; n = 0.001 for every trade. Initially, the portfolio contains 0 shares and the cash balance is 1,000,000. This can be formulated as a constrained program as follows </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E ROBUSTNESS WITH NOISE AUGMENTATION</head><p>Table <ref type="table">4</ref> shows the results from training policies with noisy inputs and evaluating their robustness to noise at test time. This experiment tests whether our approximately equivariant method is truly more robust to noise than other baselines trained with noise augmentation. We find that our approximately equivariant method is more robust than baselines for the modified domain, even when trained with noise augmentation.</p></div></body>
		</text>
</TEI>
