<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Factored Pose Estimation of Articulated Objects using Efficient Nonparametric Belief Propagation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10130823</idno>
					<idno type="doi">10.1109/ICRA.2019.8793973</idno>
					<title level='j'>2019 International Conference on Robotics and Automation (ICRA)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Karthik Desingh</author><author>Shiyang Lu</author><author>Anthony Opipari</author><author>Odest Chadwicke Jenkins</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other jointed objects. Such articulated objects can take an infinite number of possible poses, as a point in a potentially high-dimensional continuous space. A robot must perceive this continuous pose in order to manipulate the object to a desired pose. This problem of perception and manipulation of articulated objects remains a challenge due to its high dimensionality and multi-modal uncertainty. In this paper, we propose a factored approach to estimate the poses of articulated objects using an efficient non-parametric belief propagation algorithm. We consider inputs as geometrical models with articulation constraints, and observed 3D sensor data. The proposed framework produces object-part pose beliefs iteratively. The problem is formulated as a pairwise Markov Random Field (MRF) where each hidden node (continuous pose variable) models an observed object-part's pose and each edge denotes an articulation constraint between a pair of parts. We propose articulated pose estimation by a Pull Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP) and evaluate its convergence properties over scenes with articulated objects.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other kinematically jointed objects. For example, the cabinet with three drawers shown in Figure <ref type="figure">1</ref> functions as a storage container. To accomplish storage and retrieval tasks on this container, a robot would need to perform a sequence of open and close actions on the various drawers. Executing such tasks involves repeated sense-plan-act phases, which occur under uncertainty in the robot's observations and demand a pose estimation framework capable of tracking this uncertainty. The presence of observation uncertainty and environmental occlusions poses a challenge for robots attempting to model cluttered human environments. Additionally, the occurrence of partial sensor observation due to self and environmental occlusions makes the inference problem multi-modal. Further, as the number of object parts in the environment grows, the inference problem becomes high-dimensional.</p><p>Pose estimation methods have been proposed that take a generative approach to this problem <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>. These methods aim to explain an observed scene as a collection of object/parts poses using a particle filter formulation to iteratively maintain belief over possible states. Though these 1 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Department</head><p>of Electrical Engineering and Computer Science, Robotics Institute, University of Michigan, Ann Arbor {kdesingh,shiyoung,topipari,ocj}@umich.edu approaches hold the power of modeling the world generatively, they have an inherent drawback of scaling inefficiently as the number of rigid bodies being modeled increases.</p><p>In this paper, we focus on overcoming this drawback by factoring the state as individual object parts constrained by their articulations to create an efficient inference framework for pose estimation. Generative methods exploiting articulation constraints are widely used in human pose estimation problems <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref> where human body parts have constrained articulation. We take a similar approach and factor the problem using a Markov Random Field (MRF) formulation where each hidden node in the probabilistic graphical model represents an observed object-part's pose (continuous variable), each observed node indicates the information observed from a particular object-part, and each edge in the graph denotes the articulation constraint between a pair of parts. Inference on the graph is performed using a message passing algorithm that shares information between the parts' pose variables, to produce pose beliefs for each part, collectively giving the estimated state of the articulated object.</p><p>Existing message passing approaches <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref> represent a message as a mixture of Gaussian components and provide Gibbs sampling based techniques to approximate the message product and update operations. Their message representation and message product techniques limit the number of samples used for inference and are not applicable to our application domain that is high-dimensional and multimodal. In this paper we provide a more efficient "Pull" Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP). The key idea of pull message updating is to evaluate samples taken from the belief of the receiving node with respect to the densities informing the sending node. The mixture product approximation can then be performed individually per sample, and later normalized to form a distribution. This pull updating of message distributions avoids the computational pitfalls of push updating used in <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>.</p><p>Our system takes a 3D point cloud from sensor measurement and an object geometry model in the form of a URDF (Unified Robot Description Format) as input and outputs belief samples in the continuous pose domain. We use these belief samples to compute a maximum likelihood estimate of an object-part's pose enabling the robot to act on the object. Contributions of this paper include: a) proposal of an efficient belief propagation algorithm (PMPNBP) to estimate articulated object poses, b) articulated object pose estimation experiments and comparisons with a traditional particle filter baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Existing methods in the literature have set out to address the challenge of manipulating articulated objects by robots in complex human environments. Particular focus has been placed on addressing the task of estimating the kinematic models of articulated objects by a robot through interactive perception. Hausman et al. <ref type="bibr">[9]</ref> propose a particle filtering approach to estimate articulation models and plan actions that reduce model uncertainty. In <ref type="bibr">[10]</ref>, Martin et al. suggest an online interactive perception technique for estimating kinematic models by incorporating low-level point tracking and mid-level rigid body tracking with high-level kinematic model estimation over time. <ref type="bibr">Sturm et al. [11]</ref>, <ref type="bibr">[12]</ref> addressed the task of estimating articulation models in a probabilistic fashion by human demonstration of manipulation examples.</p><p>All of these approaches discover the articulated object's kinematic model by alternating between action and sensing and are important methods for a robot to reliably interact with novel articulated objects. In this paper we assume that such kinematic models once learned for an object can be reused to localize their articulated pose under real world ambiguous observations. The method proposed in this paper could compliment the existing body of work towards task completion in unstructured human environments.</p><p>Existing filtering based articulated object tracking frameworks <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref> are initialized with ground truth object poses. Our method could complement these existing tracking frameworks by providing an initial pose estimate. Additionally, belief propagation is applied to articulated pose tracking after initial pose estimation <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>. We consider comparisons with the tracking frameworks as a direction for future work.</p><p>Probabilistic graphical model representations such as Markov random fields (MRF) are widely used in computer vision problems where the variables take discrete labels such as foreground/background. Many algorithms have been proposed to compute the joint probability of the graphical model. Belief propagation algorithms are guaranteed to converge on tree-structured graphs. For graph structures with loops, Loopy Belief Propagation (LBP) <ref type="bibr">[16]</ref> is empirically proven to perform well for discrete variables. The problem becomes non-trivial when the variables take continuous values. Sudderth et.al (NBP) <ref type="bibr">[8]</ref> and Particle Message Passing (PAMPAS) by Isard et.al <ref type="bibr">[7]</ref> provide sampling approaches to perform belief propagation with continuous variables. Both of these approaches approximate a continuous function as a mixture of weighted Gaussians and use local Gibbs sampling to approximate the product of mixtures. NBP has been effectively used in applications such as human pose estimation <ref type="bibr">[4]</ref> and hand tracking <ref type="bibr">[5]</ref> by modelling the graph as a tree structured particle network. Scene understanding problems where a scene is composed of household objects with articulations demand a large number of sampled hypotheses to infer in the high-dimensional and multi-modal state space. The algorithm proposed in this paper produces promising results and shown to handle such demands. We reported comparisons with an existing NBP algorithm <ref type="bibr">[7]</ref> in <ref type="bibr">[17]</ref> with 2D examples.</p><p>Model based generative methods <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref> are increasingly being used to solve scene estimation problems where heuristics from discriminative approaches <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref> are used to infer object poses. These approaches do not model object-object interactions or articulations and rely significantly on the effectiveness of the discriminative methods. Our framework doesn't rely on any prior detections but can benefit from them while inherently handling noisy priors <ref type="bibr">[8]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[17]</ref>. Chua et. al <ref type="bibr">[23]</ref> proposed a scene grammar representation and belief propagation over factor graphs, for generating scenes with multiple-objects satisfying the scene grammars. While their objective is similar to ours, we specifically deal with 3D observations along with continuous variables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROBLEM STATEMENT</head><p>We consider an articulated object O to be comprised of N object-parts and N -1 points of articulation. Such an object description conforms to the Unified Robot Description Format (URDF) commonly used in the Robot Operating System (ROS) <ref type="bibr">[24]</ref>. A kinematic model of this format can be represented as an undirected graph G = (V, E) with nodes V for object-parts and edges E for points of articulation. If G is a Markov Random Field (MRF), it may contain two types of variables X and Y , representing hidden and observed variables respectively. Let</p><p>where Y s = P s &#8838; P , with P being a point cloud observed by the robot's 3D sensor. Each object-part has an observed node in the graph G. P s serves as a region of interest if a trained object detector is used to find the object in the scene, but is optional in our current approach. Each observed node Y s is connected to a hidden node X s that represents the pose of the underlying object part. Let X = {X s | X s &#8712; V }, where X s &#8712; H D is a dual quaternion pose of an object-part. Dual quaternions <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref> are a quaternion equivalent to dual numbers representing a 6D pose X s = (x, y, z, q w , q x , q y , q z ) as X s = q r + q d where q r is the real component and q d is the dual component. Alternatively it is represented as</p><p>Constructing a dual quaternion X s is similar to rotation matrices, with a product of dual quaternions representing translation and orientation as X s = dq pos * dq ori , where * is a dual quaternion multiplication. dq ori = [q w , q x , q y , q z ][0, 0, 0, 0] is the dual quaternion representation of pure rotation and dq pos = [1, 0, 0, 0][0, x 2 , y 2 , z 2 ] is the dual quaternion representation of pure translation. This dual quaternion representation is widely used for rigid body kinematics, where the * operation is efficient and elegant compared with matrix multiplication. In addition to representing the hidden variable X s , dual quaternions can capture the constraints in the edges E and represent articulation types such as prismatic, revolute, and fixed effectively. This will be discussed in detail in Section IV-D.2.</p><p>Pose estimation of the articulated object involves inferring the hidden variables X s that maximize the joint probability of the graph G considering only second order cliques, and is given as:</p><p>where &#968; s,t (X s , X t ) is the pairwise potential between nodes X s and X t , &#966; s (X s , Y s ) is the unary potential between the hidden node X s and observed node Y s , and Z is a normalizing factor. The problem is to infer belief over the possible articulation poses assigned to hidden variables X that are continuous, such that the joint probability is maximized. This inference is generally performed by passing messages between hidden variables X until convergence of their belief distributions over several iterations. After convergence, a maximum likelihood estimate of the marginal belief gives the pose estimate X est s of an object-part corresponding to the node in the graph G. The collection of all such objectpart pose estimates forms the entire object's pose estimate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. NONPARAMETRIC BELIEF PROPAGATION A. Overview</head><p>A message is denoted as m t&#8594;s directed from node t to node s if there is an edge between the nodes in the graph G. The message represents the distribution of what node t thinks node s should take in terms of the hidden variable X s . Typically, if X s is in the continuous domain, then m t&#8594;s (X s ) is represented as a Gaussian mixture to approximate the real distribution:</p><p>where</p><p>ts is the weight associated with the i th component, &#181;</p><p>ts are the mean and covariance of the i th component, respectively. We use the terms components, particles and samples interchangeably in this paper. Hence, a message can be expressed as M triplets:</p><p>Assuming the graph has a tree or loopy structure, computing these message updates is nontrivial computationally. The message update at iteration n in a continuous domain from node t &#8594; s is given by</p><p>where &#961;(t) is the set of neighbor nodes of t. The marginal belief over each hidden node at iteration n is given by</p><p>where T is the number of components used to represent the belief.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. "Push" Message Update</head><p>NBP <ref type="bibr">[8]</ref> provides a Gibbs sampling approach to compute an approximation of the product u&#8712;&#961;(t)\s m n-1 u&#8594;t (X t ). Assuming that &#966; t (X t , Y t ) is pointwise computable, a "premessage" <ref type="bibr">[27]</ref> is defined as</p><p>which can be computed in the Gibbs sampling procedure. This reduces Equation <ref type="formula">4</ref>to</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm -Message update</head><p>Given input messages m n-1 u&#8594;t (Xt) = {(&#181;</p><p>ut )} M i=1 for each u &#8712; &#961;(t) \ s, and methods to compute functions &#968;ts(Xt, Xs) and &#966;t(Xt, Yt) point-wise, the algorithm computes m n t&#8594;s (Xs) = {(&#181;</p><p>1. Draw M independent samples {&#181; </p><p>d The final weights are computed as w</p><p>unary . <ref type="bibr">3</ref> The weights {w </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm -Belief update</head><p>Given incoming messages m n t&#8594;s (Xt) = {(w</p><p>for each t &#8712; &#961;(s), and to compute functions &#966;s(xs, ys) point-wise, the algorithm computes bel n s (Xs) &#8733; &#966;s(Xs, Ys) t&#8712;&#961;(s) m n t&#8594;s (Xs) = {(w</p><p>1 For each t &#8712; &#961;(s) a Update weights w</p><p>ts , Ys). b Normalize the weights such that M i=1 w (i) ts = 1. 2 Combine all the incoming messages to form a single set of samples and their weights {(w</p><p>, where T is the sum of all the incoming number of samples. 3 Normalize the weights such that T i=1 w</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Perform a resampling step followed by diffusion with</head><p>Gaussian noise, to sample new set {&#181;</p><p>s } T i=1 that represent the marginal belief of Xs.</p><p>NBP <ref type="bibr">[8]</ref> sample X(i) t from the "pre-message" followed by a pairwise sampling where &#968; st (X s , X t ) is acting as</p><p>The Gibbs sampling procedure is itself an iterative procedure and hence makes the computation of the "pre-message" (as the Foundation function described for PAMPAS) more expensive as M increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. "Pull" Message Update</head><p>Given the overview of Nonparametric Belief Propagation above in Section IV-A, we now describe our "pull" message passing algorithm. We represent each message as a set of pairs instead of triplets as in Equation <ref type="formula">3</ref>, which is</p><p>Similarly, the marginal belief is summarized as a sample set</p><p>where T is the number of samples representing the marginal belief. We assume there exists a marginal belief over X s , as bel n-1 s (X s ), from the previous iteration. To compute the m n t&#8594;s (X s ) at iteration n, we initially sample {&#181;</p><p>ts } M i=1 from the belief bel n-1 s (X s ). We then pass these samples to the neighboring nodes &#961;(t) \ s and compute weights {w</p><p>i=1 . This step is described in Algorithm -Message update. The computation of bel n s (X s ) is described in Algorithm -Belief update. The key difference between the "push" approach of earlier methods (NBP <ref type="bibr">[8]</ref> and PAMPAS <ref type="bibr">[7]</ref>) and our "pull" approach is the message m t&#8594;s generation procedure. In the "push" approach, incoming messages to t determine the outgoing message t &#8594; s. Whereas in the "pull" approach, samples representing s are drawn from its belief bel s at the previous iteration and weighted by the incoming messages to t. This weighting strategy is computationally efficient. Additionally, the product of incoming messages to compute bel s is approximated by a resampling step as described in Algorithm -Belief update.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Potential Functions</head><p>1) Unary potential: Unary potential &#966; t (X t , Y t ) is used to model the likelihood by measuring how pose X t explains the point cloud observation P t . The hypothesized object pose X t is used to position the given geometric object model and generate a synthetic point cloud P * t that can be matched with the observation P t . The synthetic point cloud is constructed using the object-part's geometric model available a priori. The likelihood is calculated as</p><p>where &#955; r is the scaling factor, d(P t , P * t ) is the sum of 3D Euclidean distance between the observed point p &#8712; P t and rendered point p * &#8712; P * t at each pixel location in the region of interest.</p><p>2) Pairwise Potential and Sampling: Pairwise potential &#968; t,s (X t |X s ) gives information about how compatible two object poses are given their joint articulation constraints captured by the edge between them. As mentioned in Section III, these constraints are captured using dual quaternions. Most often, the joint articulation constraints have minimum and maximum range in either prismatic or revolute types. We capture this information from URDF to get R t|s = [dq a t|s , dq b t|s ] giving the limits of articulations. For a given X s and R t|s , we find the distance between X t and the limits as A = d(X t , dq a t|s ) and B = d(X t , dq b t|s ), as well as the Fig. <ref type="figure">3</ref>: Convergence of pose estimation on two different scenes: the first column shows the RGB image of each scene, second to fourth columns show the convergence results of PMPNBP. The second column shows randomly initialized belief particles, the third column shows the belief particles after 100 iterations, and the fourth column shows the maximum likelihood estimates of each part. The fifth column shows the estimation error (0.95 confidence interval) using PMPNBP with respect to the baseline particle filter method across 10 runs (400 particles and 100 iterations each). It can be seen that the baseline suffers from local minimas while PMPNBP is able to recover from them effectively.</p><p>distance between the limits C = d(dq a t|s , dq b t|s ). Using a joint limit kernel parameterized by (&#963; pos , &#963; ori ), we evaluate the pairwise potential as:</p><p>The pairwise sampling uses the same limits R t|s to sample for X t given an X s . We uniformly sample a dual quaternion Xt that is between [dq a t|s , dq b t|s ] and transform it back to the X s 's current frame of reference by X t = X s * Xt .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTS AND RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experimental Setup</head><p>We use a Fetch robot, a mobile manipulation platform for our data collection. 3D data is collected using an ASUS Xtion RGBD sensor mounted on the robot. We make use of the intrinsic and extrinsic parameters of the sensor. We use CUDA-OpenGL interoperation to render synthetic scenes on a large set of poses in a single render buffer on a GPU. We render scenes as depth images, then project them back to 3D point clouds via camera intrinsic parameters.</p><p>We use a cabinet with three drawers and Fetch robot as our articulated objects to evaluate our method. A CAD model of the objects were obtained from the Internet and annotations of the object's articulations were added manually using Blender to generate a URDF model (Fetch robot comes with URDF model). Obtaining geometrical models and articulation models can either be crowd-sourced <ref type="bibr">[28]</ref> or learned using human or robot interactions <ref type="bibr">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Baseline</head><p>We implemented a Monte Carlo localization (particle filter) method that includes an object specific state representation. For example, the Cabinet with 3 drawers has a state representation of (x, y, z, &#966;, &#968;, &#967;, t a , t b , t c ) where the first 6 elements describe the 6D pose of the object in the world and t a , t b , t c represent the prismatic articulation. The measurement model in the implementation uses the unary potential described in Section IV-D.1. Instead of rendering a point cloud of each object-part, the entire object in the hypothesized pose is rendered for measuring the likelihood in the particle filter. As the observations are static, the action model in the standard particle filter is replaced with a Gaussian diffusion over the object poses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Convergence Results</head><p>In Figure . 3, we show the convergence of the proposed method visually for two scenes containing different point cloud observations. We collected point cloud observations of the cabinet object in arbitrary poses and performed inference using both the proposed PMPNBP and the baseline Monte Carlo localization. The entire point cloud measurement is used as the observation for all object-parts. The first column shows the scene (RGB not used during inference). The second column shows the uniformly initialized pose samples of the object-parts over the entire point cloud. The third column shows the propagated belief particles for each object-part after 100 iterations. The fourth column shows the Maximum Likelihood Estimate (MLE) of each object-part using the belief particles from the third column.</p><p>For the results shown in Figure . 3, we ran our inference for 100 iterations with 400 particles per message. 10 different trials were used to generate the convergence plot that shows the mean and variance in error across the trials. We adopt the average distance metric (ADD) proposed in <ref type="bibr">[29]</ref>, <ref type="bibr">[20]</ref> for comparison between the methods. The point cloud model of the object-part is transformed to its ground truth dual quaternion (dq) and to the estimated pose's dual quaternion </p><p>where ( d q c ) and (dq c ) are the conjugates of the dual quaternions <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref>, m is the number of 3D points in the model set M.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Partial and incomplete observations</head><p>Articulated models suffer from self-occlusions and often environmental occlusions. By exploiting the articulation constraints of an object in the pose estimation, our inference method is able to produce a physically plausible pose that explains the partial or incomplete observations. In Figure . 4, we show two compelling cases that indicate the strength of our inference method. In the first case, the cabinet is occluded by the robot's arm, while in the second case, a blanket suspended from drawer 1 occludes half of the object. PMPNBP is able to recover from these occlusions and produce a plausible estimate along with belief of possible poses. The factored approach proposed in this paper scales to objects such as a Fetch robot with higher number of links and joints with combinations of articulations compared to a cabinet (see Figure . 5 and <ref type="bibr">[30]</ref> for extended results).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION</head><p>We proposed Pull Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP), an efficient algorithm to estimate the poses of articulated objects. This problem was formulated as a graph inference problem for a Markov Random Field (MRF). We showed that the PMPNBP outperforms a baseline Monte Carlo localization method quantitatively. Qualitative results were provided to show the pose estimation accuracy of PMPNBP under a variety of occlusions. We also showed the scalability of the algorithm to articulated objects such as a Fetch robot. The notion of uncertainty in the inference is inevitable in robotic perception. Our proposed PMPNBP algorithm is able to accurately estimate the pose of articulated objects and maintain belief over possible poses that can benefit a robot in performing manipulation tasks.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2019" xml:id="foot_0"><p>International Conference on Robotics and Automation (ICRA) Palais des congres de Montreal, Montreal, Canada, May 20-24, 2019 978-1-5386-6027-0/19/$31.00 &#169;2019 IEEE</p></note>
		</body>
		</text>
</TEI>
