<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>11/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652220</idno>
					<idno type="doi">10.1109/LRA.2025.3606392</idno>
					<title level='j'>IEEE Robotics and Automation Letters</title>
<idno>2377-3774</idno>
<biblScope unit="volume">10</biblScope>
<biblScope unit="issue">11</biblScope>					

					<author>Boce Hu</author><author>Heng Tian</author><author>Dian Wang</author><author>Haojie Huang</author><author>Xupeng Zhu</author><author>Robin Walters</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging SE(2)equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 45% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Firstly, conventional network architectures struggle to represent the extensive state and action spaces associated with the pushgrasp task, leading to poor generalization in novel, cluttered scenarios. Second, these methods are often sample-inefficient, as they require extensive data, heavy augmentation, and long training times <ref type="bibr">[5]</ref>. Lastly, many existing approaches involve complex training processes, often relying on alternating optimization between grasping and pushing prediction networks <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>.</p><p>In this letter, we introduce the Equivariant PushGrasp (EPG) Network, a novel framework for efficient goalconditioned push-grasp policy learning in cluttered environments. EPG leverages inherent task symmetries to improve both sample efficiency and performance. Specifically, we model the pushing and grasping policies using SE(2)-equivariant neural networks, embedding rotational and translational symmetry as an inductive bias. This design substantially enhances the model's generalization and data efficiency. Furthermore, we propose a self-supervised training approach that optimizes the pushing policy with a reward signal defined as the change in grasping scores before and after each push. This formulation simplifies the training procedure and naturally couples the learning of pushing and grasping.</p><p>In summary, our contributions are threefold. First, we propose a fully SE(2)-equivariant push-grasp framework that leverages the symmetry of environment dynamics as an inductive bias to boost policy learning efficiency. Second, we introduce a novel training strategy that treats the learned grasping policy as part of the environment, serving as a critic to guide and optimize the learning of the pushing policy. Lastly, extensive experiments in both simulation and real-world environments validate the effectiveness of our approach. Our proposed EPG achieves a 49% is a typo improvement in grasp success rates in simulation and a 35% improvement in real-world scenarios compared to prior baselines <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[33]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Pushing and Grasping in Cluttered Environments</head><p>Target grasping in cluttered environments is challenging due to object overlap, occlusions, and the need for precise selection in densely populated scenes. Early approaches <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref> evaluated SE(2) grasp configurations from top-down images but primarily focused on isolated objects or sparse environments. Recent advances <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref> have made progress toward handling denser scenes, but often struggle in highly cluttered environments or when specific target objects must be retrieved.</p><p>Non-prehensile manipulations, such as pushing, provide effective solutions for separating objects or clearing clutter. The synergy between pushing and grasping has been widely studied to explore their combined potential. Zeng et al. <ref type="bibr">[5]</ref> established a self-supervised framework for unified push-grasp policies using deep Q-learning, demonstrating the benefit of strategic pushing in creating grasp opportunities, but with limited generalization to complex environments. Tang et al. <ref type="bibr">[6]</ref> extended the action space from SE(2) to SE(3) to enable more flexible and precise 6-DoF grasping. Building on <ref type="bibr">[5]</ref>, Xu et al. <ref type="bibr">[7]</ref> and Wang et al. <ref type="bibr">[8]</ref> proposed goal-conditioned push-grasp strategies for targeted retrieval. However, these methods suffer from simplistic network architectures and complex training procedures which limit their effectiveness in highly dynamic and cluttered environments. Compared with these methods, our approach incorporates SE(2)-equivariance to enhance the representational capacity of both pushing and grasping policies. We also introduce a simplified and straightforward training pipeline, which reduces the training complexity and hyperparameter sensitivity, thereby improving the generalizability and robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Equivariance in Robot Learning</head><p>The integration of symmetries and equivariance properties into robotic policy learning has been proven to enhance both efficiency and performance <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>. In deep reinforcement learning (DRL), recent methods <ref type="bibr">[14]</ref>, <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[24]</ref> demonstrate remarkable improvements in performance and convergence speed for SE(2) manipulation tasks. Similarly, equivariance has also shown effective in imitation learning (IL) <ref type="bibr">[4]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>, <ref type="bibr">[27]</ref>. Closest to our approach are <ref type="bibr">[14]</ref>, <ref type="bibr">[23]</ref>, which establish foundational techniques for SE(2)-equivariant policy learning. Unlike these prior methods that directly train a single equivariant policy via IL or RL to accomplish the entire task, our method introduces a novel pipeline that first employs IL to train a grasping network, which subsequently serves as the environment for DRL-based training of a pushing network. This two-step training strategy improves both training efficiency and generalization capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. METHOD</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem Statement</head><p>The target object retrieval task in cluttered environments requires the agent to execute a series of push actions to clear obstructions, followed by a final grasping action to pick up the target. At each time step t, the agent observes the state O t &#8712; O and the specified target object, represented by its mask k &#8712; K, where O denotes the observation space and K is the set of all object masks in the scene. We use a top-down RGB-D image as the observation, i.e., O t &#8712; R 4&#215;h&#215;w . The agent then selects an action a t &#8712; A, where A = Apush &#8746; Agrasp includes all top-down grasps and horizontal pushes. Each action is represented as a tuple (type, pose), with type &#8712; {push, grasp} and pose &#8712; SE(2). To model the policy, we represent the endeffector pose as a distribution over discretized SE <ref type="bibr">(2)</ref> actions, encoded as a pixel-wise dense action map of shape n &#215; h &#215; w.</p><p>Here, the spatial translation component is discretized into h &#215; w bins and the rotation component into n bins, where each pixel in the action map corresponds to a translation and each channel to a rotation angle, so the entire map defines a function over the discretized SE(2) space, similar to prior <ref type="bibr">[14]</ref>, <ref type="bibr">[22]</ref>, <ref type="bibr">[29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Overview of the Approach</head><p>The key contribution of our work is a novel push-grasp framework for efficient target object retrieval. As illustrated in Fig. <ref type="figure">2</ref>, our workflow consists of three key components: a CriticNet &#963;, a GraspNet &#960;, and a PushNet &#966;. At each time step, GraspNet and PushNet generate a grasp action and a push action with respect to the target object. CriticNet then evaluates the grasp action by assigning it a score. If the score exceeds a predefined threshold &#964; or the maximum number of push attempts is reached, the grasp action is executed. Otherwise, the push action is executed, and the process repeats with an updated observation. In the following subsections, we first describe the training process for each agent, followed by the design of equivariant networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Two-Step Agent Learning</head><p>Previous works often rely on complex alternating training between grasp and push networks, which can lead to unstable convergence and difficulty in balancing learning dynamics. In contrast, we propose a simple two-step training process. First we train a universal, goal-agnostic GraspNet together with a CriticNet that evaluates predicted grasps and returns a score. Then, we use the difference in grasp scores before and after pushing, computed from the CriticNet, as a reward signal to train a goal-conditioned PushNet. This decoupled training strategy eliminates the need for alternating optimization and its scheduling-related hyperparameters, making the training more stable, controllable, and efficient.</p><p>Step 1: GraspNet and CriticNet Training: We first train a universal target-agnostic GraspNet &#960; and a target-conditioned CriticNet &#963; using supervised data collected in simulation, which contains each step observation O t , object mask sets K, grasp poses, and binary success labels. GraspNet &#960; takes only the depth channel D t &#8712; R h&#215;w from O t as input and outputs dense, pixel-wise grasp score maps for all objects in the scene, i.e., &#960; : R h&#215;w &#8594; R n&#215;h&#215;w . Each entry represents a grasp quality for a specific location and orientation, where h &#215; w corresponds to the spatial resolution and n denotes the number of grasp orientations considered.</p><p>We train GraspNet &#960; with the Binary Cross Entropy (BCE):</p><p>where Q a is the predicted score for grasp pose a, and y a indicates grasp success. Since simulation allows supervision of many grasp poses per mask in each O t , the pixel-wise optimization enables the network to capture diverse, multimodal grasp strategies per scene. The target object, specified by human instruction, is highlighted with a red mask (e.g., a banana). At each step, the push action direction is represented by an arrow. Our method iteratively predicts and executes push actions to create sufficient space for grasping the target.</p><p>The final grasp pose is shown as a blue rectangle, with green blocks indicating the gripper's fingers. Fig. <ref type="figure">2</ref>. Given an RGB-D observation, SAM2 <ref type="bibr">[28]</ref> generates a set of object masks. GraspNet and PushNet then use the depth image and these masks to predict candidate grasp and push actions. The target object's grasp pose is filtered using its corresponding mask, and the best candidate is selected. Finally, CriticNet evaluates the selected grasp pose against a threshold &#964; to determine whether to execute the grasp or a push action.</p><p>Similarly, CriticNet &#963; shares the dataset but receives D t , the target object mask k, the full mask set K, and a single grasp pose.</p><p>Both k and K are represented as binary maps, and the grasp pose is rendered as an image. All three maps have the same spatial size as D t . The network then outputs a scalar score evaluating the grasp quality. Formally, &#963; : R 4&#215;h&#215;w &#8594; R. It is trained using the Mean Square Error (MSE) loss:</p><p>where &#375; is the predicted grasp quality score, and y is the groundtruth label corresponding to the given grasp pose. While both &#960; and &#963; output grasp scores, their roles differ: &#960; estimates pixelwise qualities, whereas &#963; measures a more accurate feasibility of a given grasp pose for the target.</p><p>Step 2: PushNet Training and CriticNet Fine-tuning. This step is formulated as a contextual bandit problem (Fig. <ref type="figure">3</ref>). Unlike previous methods that perform complex alternative training, we treat &#960; and &#963; as part of the bandit environment to supervise the PushNet &#966; training. We introduce a Grasp Imagination Module, which provides a pushing reward by (1) simulating the optimal grasp predicted by &#960; in the post-push scene, and (2) evaluating the optimal grasp using &#963;. After evaluation, the simulation is restored to the post-push scene (i.e., before the grasp). As a result, the bandit environment is composed of two components: the cluttered physical scene itself and the Grasp Imagination Module.</p><p>Specifically, after the simulation scene is initialized, segmentation is first applied to obtain object masks. The Grasp Imagination Module stores the initial state and simulates grasp attempts for each mask sequentially. After each simulated grasp, the environment is restored to the initial state. The training episode for pushing begins once a grasp attempt fails. The PushNet &#966; will predict the Q value for all pushing actions, and an -greedy policy will be executed. After the push action, the Grasp Imagination Module simulates the grasp action again to assess the new grasp feasibility. If the grasp succeeds, the push action is considered optimal, and the reward is 1. If the grasp fails, we use an adaptive reward defined as the difference between the predicted grasp scores before and after the push estimated by &#963;, as a good push should improve the grasp feasibility. The episode terminates when a simulated grasp succeeds or the maximum pushing attempts are reached. The system then moves on to the next target mask or re-initializes the scene if all masks are iterated.</p><p>The PushNet &#966; takes D t , target object mask k, and mask set K as input and outputs dense push score maps: &#966; : R 3&#215;h&#215;w &#8594; R n&#215;h&#215;w . The learning objective is Huber loss:</p><p>where a and r are the selected action and corresponding reward, and Q a is the predicted push score for action a. In this stage, CriticNet &#963; is further finetuned using grasps predicted by &#960; in the Grasp Imagination Module, together with their outcomes, to mitigate the distribution shift from random to policy-driven grasps.</p><p>Our method offers several advantages over <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[33]</ref>. First, it enables self-supervised learning by deriving rewards from network predictions, removing the need for manual push evaluation. Second, it is compatible with arbitrary grasp networks, enhancing robustness. Experiments show that our framework can also improve the performance of baseline grasp networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Equivariance and Invariance in Agent Learning</head><p>A network h is equivariant to a symmetry group G if for all g &#8712; G, it satisfies: h(g &#8226; x) = g &#8226; h(x). This property ensures that applying a transformation g to the input results in an equivalent transformation in the output. In particular, if the symmetry group is G = SE(2) (i.e., rotation around the z-axis of the world frame and translation along the x and y-axes), a planar rotation and translation of the input results in the same rotation and translation to the output. This symmetry naturally reflects the inherent structure of many table-top robotic tasks, such as grasping and pushing, while avoiding learning unnecessary out-of-plane rotation equivariance (e.g., full SO(3) rotations), which is both redundant and computationally expensive.</p><p>Specifically, we design GraspNet &#960; and PushNet &#966; to be equivariant under the product group C n &#215; T 2 , where C n =  {2&#960;m/n : 0 &#8804; m &lt; n} &#8834; SO(2), with n &#8712; Z, is a finite cyclic group of discrete planar rotations, and T 2 represents the 2D translation group. For either network f &#8712; {&#960;, &#966;}, the equivariance property holds:</p><p>, where I denotes the network input (which differs for &#960; and &#966;). The output of each network is a stack of n orientation-specific maps of shape n &#215; h &#215; w. Under group action g &#8712; C n , the spatial dimensions h &#215; w are rotated, and the orientation channels indexed by n are cyclically permuted. Fig. <ref type="figure">4</ref> shows an example where the input is rotated by 90 &#8226; . Suppose in the original scene, the maximum Q-value occurs at pixel (x, y) in the 0 &#8226; channel. After rotation, equivariance ensures that this maximum shifts to the 90 &#8226; channel (i.e., permutation across orientation channels) and appears at the rotated pixel (y, x) (i.e., rotation across spatial positions). This structured transformation maintains the consistency of action selection under input transformations. CriticNet &#963; is designed to be invariant under the same SE(2) group. In this case, the transformation g is applied to both the observation and grasp action simultaneously, and the output scalar remains unchanged: &#963;(g &#8226; I) = &#963;(I). This invariance ensures that the predicted grasp quality is independent of the scene's absolute orientation or position.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Network Architectures</head><p>We leverage Fully Convolutional Networks <ref type="bibr">[30]</ref> for inherent translational equivariance and use the escnn library <ref type="bibr">[31]</ref> to implement explicit SO(2) rotational equivariance. Separate architectures are designed for grasping and pushing policies to capture task-specific features.</p><p>In particular, GraspNet &#960; and CriticNet &#963; are designed to predict and evaluate grasp poses, relying primarily on accurate perception of local geometric structures. We adopt a ResNet <ref type="bibr">[32]</ref> architecture for &#963; and a U-Net architecture for &#960;, both equivariant under the cyclic group C 6 . A group pooling layer at the end of &#963; transforms its representation from equivariant to invariant. To accurately predict grasp orientations, we introduce a finer-grained orientation representation within the C 6 framework of &#960;. Each C 6 group element acts on six sub-channels, yielding a 36-dimensional orientation space with 10 &#8226; resolution. These sub-channels are cyclically permuted in 60 &#8226; increments. Furthermore, the gripper's bilateral symmetry implies that grasp orientations 180 &#8226; apart yield identical outcomes, reducing the prediction range from 360 &#8226; to 180 &#8226; . This symmetry corresponds to an SO(2)/C 2 quotient representation, which identifies antipodal directions as equivalent <ref type="bibr">[14]</ref>. As a result, GraspNet &#960; achieves an angular resolution of 10 &#8226; over a 180 &#8226; rotation range while preserving C 6 equivariance.</p><p>In contrast to &#963; and &#960;, PushNet &#966; requires both global geometric context of the scene and local features of surrounding objects. As shown in Fig. <ref type="figure">5</ref>, &#966; first extracts global features through an equivariant U-Net. To integrate local context, we introduce a feature fusion block. Here, feature maps from the U-Net are segmented by object masks, with each masked region serving as a node in a graph. Edges are defined by spatial distances, and an Equivariant Graph Attention Layer captures object interactions. The enriched graph features are merged with the original U-Net features and further refined by a second equivariant U-Net, yielding the final Q-value map for push action selection. Similar to &#960;, &#966; employs three orientation sub-channels for each C 6 group element. However, unlike grasping, pushing requires full 360 &#8226; rotational coverage due to its directional nature, which breaks 180 &#8226; rotational invariance. Consequently, &#966; achieves 20 &#8226; orientation resolution across the full 360 &#8226; rotation range.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EXPERIMENTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Training Details</head><p>In step 1, we randomly initialize scenes in simulation with 2-15 objects and use SAM2 to generate the mask set. For each mask, we randomly sample 600 grasp poses and record the grasp outcomes. In total, we collect 3.6 M grasp data points (approximately 2 M positive and 1.5 M negative). GraspNet is trained for 30 epochs, while CriticNet is trained for 15 epochs. In step 2, PushNet is trained for 2,000 steps, with CriticNet finetuned for the same number of steps. Following prior work <ref type="bibr">[5]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, we use fixed 10 cm open-loop pushes along the target direction to reduce action space complexity. While this choice limits adaptability, it suffices for revealing occluded objects in dense clutter. We leave learning more flexible and adaptive push actions to future work. All networks are trained on simulation data and directly transferred to real-world settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Experienment Setups and Tasks</head><p>To evaluate our push-grasp framework, we conduct experiments in both simulation and real-world environments, with the setup illustrated in Fig. <ref type="figure">6</ref>. We use PyBullet <ref type="bibr">[34]</ref> as our simulation environment, as it provides sufficient accuracy for our open-loop push and grasp primitives, which involve simple rigid-body interactions. The evaluation consists of three tasks:</p><p>Goal-Conditioned Push-Grasp in Clutter This task assesses our framework's ability to retrieve a specific object from a cluttered scene. Following <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, objects are randomly initialized, TABLE I GOAL-CONDITIONED PUSH-GRASP IN CLUTTER IN SIMULATION</p><p>but with one designated as the target. The robot performs push actions if needed before grasping the target.</p><p>Clutter Clearing. This task evaluates the ability to clear an entire scene without any predefined target or grasp sequence. The setup follows the previous task.</p><p>Goal-Conditioned Push-Grasp in Tight Layouts. Fig. <ref type="figure">7</ref> shows the task configuration. Objects are arranged in challenging geometric configurations (e.g., tight clusters, narrow gaps). This is a hard task because the robot must push strategically to create graspable space in a constrained environment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Evaluation Metrics and Baselines</head><p>We use three evaluation metrics: Grasp Success Rate (GSR), the ratio of successful grasps to total grasp attempts; Declutter Rate (DR), the proportion of grasped objects relative to the total number of objects; and Motion Efficiency (ME) <ref type="bibr">[5]</ref>, the fraction of grasp actions among all executed actions. GSR is used for all tasks, with DR applied to clutter-clearing and ME to goal-conditioned tasks.</p><p>Our method are compared with three baselines: (1) Xu et al. <ref type="bibr">[7]</ref>, a goal-conditioned push-grasp framework that utilizes multi-stage training to jointly optimize push and grasp action prediction. (2) Wang et al. <ref type="bibr">[8]</ref>, an extension of <ref type="bibr">[7]</ref> that improves performance by relaxing the constraints on Q-value selection and using object masks to guide actions. (3) Ren et al. <ref type="bibr">[33]</ref> simplify task coordination with a two-stage training framework (goal-agnostic followed by goal-conditioned) and propose a bifunctional network that produces accurate, high-resolution Q-value maps to enhance sample efficiency. In addition, we introduce three ablation variants to highlight our framework design. The first integrates the grasp module from <ref type="bibr">[8]</ref> into our framework, while the second applies our GraspNet within the framework of <ref type="bibr">[8]</ref>. The third replaces the equivariant network with non-equivariant counterparts, trained with data augmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Comparison With Baseline Methods in Simulation</head><p>We report the comparison result for the Goal-conditioned Push-Grasp in Clutter task in Table <ref type="table">I</ref>. Our method achieves the best performance, significantly outperforming all baselines. On average, across all the settings with different number of objects, it surpasses the best baseline <ref type="bibr">[33]</ref> by 44.7% in GSR. The first two variations (Table <ref type="table">I</ref>, row 3 and 4) show that integrating our approach into existing baselines further improves their performance, which highlights our design's effectiveness. However, our PushNet within the framework of <ref type="bibr">[8]</ref> does not yield significant improvement over the original method. This is likely because, while PushNet successfully creates graspable space, the baseline grasp module lacks sufficient capability to retrieve targets. The third variant serves two purposes: it first proves the advantage of equivariant networks over non-equivariant counterparts with data augmentation, and it further validates the effectiveness of our train pipeline compared to baseline training strategies. Although our method's ME is similar to baselines, this is expected, as additional push actions are necessary to ensure more successful grasps.</p><p>We also conduct an ablation study for this task, as shown in Fig. <ref type="figure">8</ref>. The bar chart compares the improvement in GSR with and without push actions. Our PushNet improves GSR by approximately 12% in highly cluttered environments. Additionally, we observe that the push module in Wang el al. <ref type="bibr">[8]</ref> contributes little to improving GSR, whereas integrating our PushNet leads to a more significant improvement.</p><p>Table <ref type="table">II</ref> shows the results of the Clutter Clearing task. Although this task is target-agnostic, push actions remain beneficial in cluttered environments. Since there is no specific target, we</p><p>TABLE II CLUTTER CLEARING IN SIMULATION</p><p>TABLE III REAL-WORLD GOAL-CONDITIONED PUSH-GRASP IN TIGHT LAYOUTS RESULTS, REPORTED AS "SUCCESSFUL ITERATIONS / TOTAL ITERATIONS" Fig. <ref type="figure">8</ref>. Improvements in GSR with and without push actions, measured the difference between 5 pushes and 0 pushes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TABLE IV REAL WORLD GOAL-CONDITIONED PUSH-GRASP IN CLUTTER</head><p>COMPARISON RESULTS</p><p>use the object with the highest score from GraspNet as the target object for each step. The results show that our method's grasping capability exceeds all baselines by a large margin in both with and without push actions. Notably, even without push actions, our method consistently outperforms all baselines that employ pushing, across all settings with different numbers of objects. This highlights the strong capacity of our GraspNet to handle cluttered scenes. Furthermore, when push actions are enabled, our method achieves additional improvements. The magnitude of this improvement is significantly greater than that observed in any of the baselines, demonstrating the strong contribution of our PushNet in creating graspable space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Real World Experiments</head><p>We conduct a large-scale real-world evaluation that far exceeds the number of trials in prior baseline studies. This extensive setup reduces the influence of randomness and increases the reliability of our results. To assess the performance of our method, we evaluate it on two tasks: Goal-conditioned Push-Grasp in Clutter and Goal-Conditioned Push-Grasp in Tight Layouts. The trained model is directly transferred from simulation to the real-world environment without any fine-tuning.</p><p>The Goal-conditioned Push-Grasp in Clutter task involves grasping randomly selected targets from a set of 20 household objects placed randomly in the workspace. The real-world setup and object sets are shown in Fig. <ref type="figure">6</ref>(b) and (c). The experimental protocol follows the simulation setup. Each run consists of attempting to retrieve five target objects, with the scene randomly rearranged after each grasp to create a new cluttered layout for the next target. Each method is evaluated over 20 runs (i.e., 100 target objects in total). The target object's mask is still tracked via SAM2. Table <ref type="table">IV</ref> presents the results, comparing our method with several baselines. Our EPG significantly outperforms all baselines by at least 35% in GSR. The primary failure cases are: 1) inaccurate object masks from SAM2, which further affect PushNet and CriticNet outputs; 2) imprecise grasp poses predicted by the GraspNet. Despite these challenges, our method demonstrates strong overall stability.</p><p>The configuration of the Goal-Conditioned Push-Grasp in Tight Layouts task is in Fig. <ref type="figure">7</ref>. It contains eight different cases, each with a varying number of small boxes placed in specific positions. The objective is to grasp the yellow box, which is consistently placed at the center of surrounding boxes. These tasks are unseen during training and require effective strategies to solve, placing a strong demand on the generalization ability. For each case, experiments are conducted over 10 iterations, where each iteration involves a randomized scene rotation and a different arrangement of the boxes. The results in Table <ref type="table">III</ref> indicate that despite increasing task complexity, our method consistently outperforms the baselines while maintaining stable performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION AND LIMITATION</head><p>This letter introduces the Equivariant Push-Grasp (EPG) Network, a goal-conditioned grasping method that incorporates push actions to improve performance. EPG leverages SE(2)equivariance to enhance sample efficiency and generalization. We also propose a flexible training framework that optimizes PushNet using grasp score differences as rewards, avoiding manually designed reward functions and complex alternating training. Extensive experiments show that EPG consistently outperforms strong baselines across various tasks and settings.</p><p>However, our method has several limitations. First, it operates in an open-loop manner with fixed push distances and no force feedback, which can lead to imprecise actions in contact-sensitive scenarios. Incorporating closed-loop control with tactile sensing is a promising direction for future work. Second, EPG is limited to 4-DoF, which is sufficient for tabletop settings but does not generalize well to more complex 6-DoF scenarios. Nonetheless, our equivariant design can be naturally extended to support full SE(3) action spaces with minimal architectural changes. Third, EPG assumes a single-view input. In more complex or occluded scenes, our framework can generalize to multi-view inputs by fusing observations into 3D representations. Finally, EPG may require manual selection of target masks for consistency, which is inconvenient. We plan to integrate vision-language models for automatic mask generation.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: Northeastern University. Downloaded on November 19,2025 at 16:59:18 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
