<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Online Learning of Unknown Dynamics for Model-Based Controllers in Legged Locomotion</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10319492</idno>
					<idno type="doi">10.1109/LRA.2021.3108510</idno>
					<title level='j'>IEEE Robotics and Automation Letters</title>
<idno>2377-3774</idno>
<biblScope unit="volume">6</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Yu Sun</author><author>Wyatt L. Ubellacker</author><author>Wen-Loong Ma</author><author>Xiang Zhang</author><author>Changhao Wang</author><author>Noel V. Csomay-Shanklin</author><author>Masayoshi Tomizuka</author><author>Koushil Sreenath</author><author>Aaron D. Ames</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The performance of a model-based controller can severely suffer when its model inaccurately represents the real world dynamics. We propose to learn a time-varying, locally linear residual model along the robot's current trajectory, to compensate for the prediction errors of the controller's model. Supervised learning is performed online, as the robot is running in the unknown environment, using data collected from its immediate past. We theoretically investigate our method in its general formulation, then apply it to a bipedal controller derived from the full-order dynamics of virtual constraints, and a quadrupedal controller derived from a simplified model of contact forces. For a biped in simulation, our method consistently outperforms the baseline and a recent learning-based method. We also experiment with a 12 kg quadruped in simulation and real world, where the baseline fails to walk with 10 kg of payload but our method succeeds.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>M ANY popular frameworks for controller design are based on the robot's model of dynamics. In the real world, however, this model can often turn out to be inaccurate, due to, for example, misspecification of the robot's physical parameters, mechanical wear and tear, and deployment-time interventions such as additional payload. While a well designed controller is robust to small inaccuracies in the dynamics, large deviations may significantly degrade its performance.</p><p>Our goal is to make corrections to the model behind the controller during deployment, through online learning using onboard sensors. Since the nature of a model is to predict the future given the past, data for supervised learning of dynamics can be collected automatically without human supervision, as time goes on and the future is revealed.</p><p>Because data are generated along the controller's trajectory that we are trying to improve, they might not contain enough information about the entire system. Nevertheless, we find it sufficient to limit the scope of learning to a local neighborhood of the current point in the current trajectory, instead of the entire system, if the learned model is updated in real time as the trajectory evolves. Fortunately, even globally complex systems, such as the highly nonlinear hybrid systems for legged locomotion, can be locally simple. Therefore, we also find it sufficient to learn with only a time-varying, locally linear model, which is computationally feasible to be updated in real time.</p><p>We first develop the intuition of online learning into a method for controllers that drive the outputs to the desired behavior based on control-affine models. We then analyze this method's theoretical properties, and evaluate it in two applications for legged locomotion.</p><p>A. Related Work 1) System identification: Given a system with known form but unknown parameters, system identification (sysID) estimates these parameters from signals given by the system ( <ref type="bibr">[1]</ref>). Recent papers have applied sysID for inertial parameters of a humanoid ( <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>). The parameters are assumed to be constant in time, and estimation is performed before the deployment of a controller. Thinking of identification as training and deployment as testing, sysID trains a model before deployment, and keeps the model fixed during testing. Since the goal is to model the system's behavior globally across the entire state space, sysID usually requires driving the system to diverse enough states, using diverse enough inputs. This requirement is known as persistence of excitation in control theory, and might be difficult to satisfy without many samples from the plant. In contrast, we only model the system's behavior locally, around the small neighborhood of our current state, learning a linear model even for complex systems with relatively few samples.</p><p>2) Learning dynamics: There is also a developing community in machine learning, modeling dynamics of the environment from interactions and observations ( <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>). It has roughly the same goal as sysID, but often uses powerful tools from deep learning, and does not assume any specific form of the system; here, learning often produces a general prediction model. We diverge from this community in the global vs. local aspect (like from sysID), but embrace its philosophy of learning a general model with parameters that might not be interpretable.</p><p>3) Adaptive control: The intuition of adaptive control is to change the controller's parameters during deployment ( <ref type="bibr">[12]</ref>). Online system identification ( <ref type="bibr">[13]</ref>) is the most relevant subfield, since it directly concerns the model behind the controller. It has been successfully applied in manipulation ( <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>), and for the location and inertial parameters of the center of mass of a quadruped ( <ref type="bibr">[19]</ref>). For online sysID, the parameters considered are very specific, and estimation relies on the physics of the model and the particular controller for the application. Our work considers parameters in a much more general sense closer to that of the machine learning community. Our parameters are functions of the state, thus are inherently time-varying and abstract. In fact, in the controlaffine form, every term of our dynamics is updated in real time as the state evolves. Furthermore, unlike sysID (online or not) whose goal is to identify the parameters, our goal is simply to give accurate predictions for the next timestep, again closer to the goal of learning. This allows our method to not rely on the specific meanings of the parameters and instead work with general model-based controllers. Another relevant subfield is L1 adaptive control ( <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>), which, like our work, concerns the residual dynamics, but does not use learning.</p><p>4) Online learning: Our work performs supervised learning online, which has long been a subject of research in machine learning ( <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[24]</ref>). The two central questions are: where does the label come from, and how is learning evaluated. Traditionally ( <ref type="bibr">[25]</ref>), learning has been evaluated with regret, and labels can come from a potentially adversarial oracle. Recently, the computer vision community has been using self-supervised tasks to provide labels ( <ref type="bibr">[26]</ref>, <ref type="bibr">[27]</ref>, <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>), and the continual learning community has been evaluating with forward and backward predictions ( <ref type="bibr">[4]</ref>) c.f. Subsection II-B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Conventions</head><p>In this paper, vectors (a, &#945;) are bold and lowercase, matrices (A, &#8486;) are bold and uppercase, scalars and functions (of all type signatures) are not bold. We assemble matrices and vectors like in MATLAB: [A, B] concatenates A and B horizontally with a comma, and [A; B] concatenates them vertically with a semicolon. 0 n denotes the n &#215; n matrix of zeros, and 1 n denotes the n &#215; n identity matrix. Also, &#8226; denotes the 2-norm for vectors (Euclidean norm) and matrices (spectral norm), unless stated otherwise. We express quantities in the nominal dynamics &#8113; with a bar, in the residual dynamics &#945; with a tilde, and in the true (plant) dynamics &#945; without anything on top.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. METHOD A. Unknown Dynamics and Linear Residual Models</head><p>Given a robotic system that is characterized by rigid-body dynamics, we denote x &#8712; R n as its state, u &#8712; R m its vector of control inputs, and y &#8712; R d its vector of outputs. The output dynamics can almost always be written as a secondorder system of the following form ( <ref type="bibr">[31]</ref>), known as controlaffine ( <ref type="bibr">[32]</ref>):</p><p>We consider model-based controllers whose goal is to drive the vector of tracking errors &#951; = [y; &#7823;] to zero. The bars on top of the variables imply that they come from our assumed nominal model, which in reality can never be completely accurate. The unknown real-world dynamics are called the true (plant) model, denoted without the bars as &#945;, &#946;. We often use an alternative set of notations to write equation (1) simply as:</p><p>in order to emphasize the role of &#8113; and &#946; as time-varying parameters of the output dynamics.</p><p>To make corrections to the nominal model, we incorporate two residual parameters and obtain the following form:</p><p>where &#945; is called the weight and &#946; is called the bias. They are written as time-varying parameters, and have the same dimensions as &#8113; and &#946; respectively. The tildes on top of them emphasize that they are estimated from data.</p><p>To better understand these residual parameters, we manipulate equation (3) into:</p><p>Intuitively, the above equation says that the goal of learning is to make the residual model on the right-hand side account for the prediction errors of the nominal model on the left-hand side. It also reveals the role of labels vs. covariates, as we explain next in the context of learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Data Collection and Online Learning</head><p>For real systems, sensor data can only be collected at discrete sampling intervals. We denote each sampling timestep by an integer subscript, which converts equation ( <ref type="formula">4</ref>) into:</p><p>Note that we are merely sampling a continuous system at discrete timesteps, so continuous-time concepts such as acceleration are still well defined. We collect a dataset of the form D := {label s , covariate s } s=t-k,...,t-1 , where s is the index of discrete timesteps, and k denotes the fixed size of the sliding time window. From equation ( <ref type="formula">5</ref>), we have</p><p>Given a dataset, our method solves regularized least squares a.k.a. ridge regression on the labels and covariates. The weight of the solution is &#945;t , and bias is &#946;t . Note that in textbook-style least squares, the weight is a vector, and the label and bias are scalars; for our learning problem, the weight is a matrix in R d&#215;m , and the label and bias are vectors in R d . But we can simply reduce this to d independent vector-scalar least squares problems. The same regularization is added independently to these d problems, since they share the same covariates; thus inversion of the covariance matrix, the most computationally costly step, is only performed once.</p><p>The solved parameters are then immediately used by the model-based controller to produce u t . In both of our later examples, the baseline controller solves for u t in an applicationspecific optimization problem with the assumed nominal parameters &#8113;t and &#946;t . We simply substitute these with &#8113;t + &#945;t and &#946;t + &#946;t respectively, as shown in Figure <ref type="figure">2</ref>.</p><p>Learning is performed online, as the controller is running with the learned parameters. At the beginning, all residual parameters are initialized to zero, because there is not enough data to learn them. Once we are k steps into the trajectory, we have enough data to form D as above and solve for the residual parameters; informed by them, the controller generates an improved trajectory, which in turn generates new data that are more relevant as time goes on.</p><p>The fact that D only keeps the k most recent data points implements a natural forgetting mechanism. In reinforcement learning terms, D is called the replay buffer, which stores the off-policy data that are not generated by the current controller; in our case, data in D are generated by the old controllers using the residual parameters from previous timesteps. Because we learn small, local models, we encourage forgetting so that our model capacity can be used only for the neighborhood of our current state. This is in contrast to the vast literature in reinforcement learning <ref type="bibr">[33]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[4]</ref>, where the goal is to learn a large, global model; there the replay buffer contains as much historical data as possible, and various techniques are implemented to discourage forgetting.</p><p>Our method can also be viewed as bootstrapping from a "bad" controller based on an inaccurate model to a better one. This might not be feasible, however, if the initial model deviates too much from the plant. For example, if the nominal model is so far off that the robot loses balance immediately, no useful information will be contained in the data collected. Fortunately, when deviations happen gradually over time, there will more likely be enough information for learning to maintain a controller that keeps generating useful data. We study this phenomenon empirically in Section IV.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Theoretical Analysis</head><p>Suppose the true (plant) output dynamics is control-affine:</p><p>We prove that our method stabilizes the tracking errors under two assumptions. The main theorem illustrates our intuition of learning in a local time window under smoothly varying dynamics, and characterizes the role of k, our window size. Denote errors in the nominal model's prediction as</p><p>with &#945;t := &#945; t -&#945;t , and &#946;t := &#946; t -&#946;t . Denote the prediction of the residual model as</p><p>Assumption 1: The model-based controller can stabilize the tracking errors &#951; = [y; &#7823;] if for some &gt; 0,</p><p>In words, Assumption 1 says that the proposed model-based controller works when the proposed (nominal plus residual) model is relatively accurate; Assumption 2 says that the deviations in dynamics are relatively smooth (in the space of parameters) over time.</p><p>In addition, we denote the motor torque saturation as u &lt; B. Denote u t = [u t ; 1] &#8712; R m+1 , and</p><p>We set k &#8805; m + 1, so &#963; min (U ) &gt; 0, i.e. the covariance matrix of ordinary least squares (OLS) has rank m + 1. Theorem 1: Given the above assumptions, if</p><p>then the model-based controller stabilizes &#951;.</p><p>Note that any claim of stability in Theorem 1 is completely inherited from the baseline controller, when Assumption 1 holds. Our method is agnostic to the exact type of stability e.g. exponential / asymptotic, which depends on the underlying baseline, and is orthogonal to the theory we develop.</p><p>In Theorem 1, B, d, &#948; &#945; and &#948; &#946; are constants determined by the application. is the model-based controller's tolerance for model inaccuracy, also independent of our method. The only quantity we tune is k, the window size, which strongly affects &#963; min (U ). With a large k, we pay a factor of k &#8730; k, intuitively due to the lag in our dataset. With a small k, we pay for the decrease in &#963; min (U ), as &#945; and &#946; become more sensitive to noise. The user should tune k to find a sweet spot in the middle. In practice, we use regularized least squares instead of OLS, so &#963; min (U ) is always &gt; 0 and more noise tolerant, making the balance less delicate w.r.t. choice of k. We use k = 100 in both of our applications (100 and 200 ms respectively).</p><p>Before proving Theorem 1, we state two lemmas, whose proofs are given in Subsection A of the appendix.</p><p>Lemma 1:</p><p>. Let y t = A t u t for t = 1, ..., k, and &#195; be the OLS estimator of the dataset {(y 1 , u 1 ), ..., (y k , u k )}. If for t = 1, ..., k + 1, A t+1 -A t &lt; &#948; A , and u t &lt; B, then</p><p>where </p><p>By definition, &#195;t is the least squares solution on D. We then apply Assumption 1 and Lemma 2 to finish the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. APPLICATIONS</head><p>We now apply our method to two model-based controllers, derived from two different perspectives for different robotic platforms: a Lyapunov perspective to control the full-order dynamics of bipedal robots, and a simplified dynamics based control architecture for robust quadrupedal locomotion. We focus on identifying the components of our method in the context of each controller, without elaborating on derivations of the nominal dynamics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. CLF-QP for Bipedal Locomotion</head><p>Let q be the robot's configuration, and x = [q; q] be the robot's state. We define y = h(x), where h is called the virtual constraints ( <ref type="bibr">[35]</ref>). For a biped, stabilizing &#951; = [y; &#7823;] means, for example, that the torso maintains a constant posture, and the legs walk in a scissor-symmetric gait.</p><p>The nominal output dynamics, whose derivation we omit, can then be written in the familiar form of equation ( <ref type="formula">1</ref>), using Lie-derivatives of the nominal dynamics in the state space as &#8113; and &#946;:</p><p>where D is the inverse of the mass-inertia matrix, C is the Coriolis matrix and &#7713; is the gravity vector. While &#8113;(x) might not be square (d = m) in general, this particular bipedal controller has the same number of virtual constraints as actuated joints. Now the control law</p><p>We can then design v to stabilize the output dynamics using control Lyapunov functions (CLFs), a common tool in control theory for providing stability guarantees in legged locomotion ( <ref type="bibr">[36]</ref>). Because &#951; is linear in &#951; and v, it is straightforward to find a CLF by solving the Lyapunov equation V (&#951;) ( <ref type="bibr">[37]</ref>). It is then a well known fact that V (&#951;, v) &lt; -cV implies exponential stability of &#951;(t), with a constant c &gt; 0. This motivates the following CLF-based quadratic program (CLF-QP) to solve for v:</p><p>where u min and u max are bounds of the torque saturation constraints. Since the output dynamics is already in the form of equation ( <ref type="formula">1</ref>), it is straightforward to apply our method to obtain &#945; and &#946;. We can then modify the C2 in the optimization problem <ref type="bibr">(17)</ref> to have</p><p>In Section IV we show that this simple modification leads to significant improvements under uncertain dynamics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. MPC with Contact Force for Quadrupedal Locomotion</head><p>To control a quadrupedal system walking stably under large disturbance (such as heavy loads), we take the model predictive control (MPC) approach using the simplified dynamics from <ref type="bibr">[38]</ref> as our baseline controller.</p><p>For quadrupedal dynamics, let p, &#7767;, p &#8712; R 6 be the position, velocity and acceleration of the robot's center of mass (CoM). Let f i &#8712; R 3 be the ground reaction force at the robot's i th foot, with i &#8712; {1, 2, 3, 4}. We also denote f = [f 1 ; f</p><p>The nominal dynamics of the CoM is given by</p><p>where &#7713; &#8712; R 6 is the gravity vector, D &#8712; R 6&#215;6 is the inverse mass matrix, and G &#8712; R 6&#215;12 is called the grasp map, which depends on the robot's state and is assumed to be accurate.</p><p>The goal of the model-based controller is to have p and &#7767; track the desired position and velocity p d and &#7767;d , generated from user command. In Sec.II notations, y = pp d , we want to stabilize &#951; = [y, &#7823;] around zero. This is achieved by having p track some desired acceleration pd , generated from PD control on p d and &#7767;d . The model-based controller then uses equation <ref type="bibr">(19)</ref> to solve for f :</p><p>stance and swing leg constraints, friction pyramid condition.</p><p>(</p><p>where more details can be found in <ref type="bibr">[39]</ref>, Following the outline in Section II, we modify <ref type="bibr">(19)</ref> to incorporate the linear residual model:</p><p>where D is the weight, and -g is the bias. Note that the nominal dynamics in <ref type="bibr">(19)</ref> has no Coriolis terms, a simplification often adopted in the literature for model-based controller design of quadrupeds with small angular velocity. While this simplification has been validated in many implementations, it is never completely accurate. Therefore, even if D = D and &#7713; = g i.e. they are both accurate parameters, ( <ref type="formula">19</ref>) is still an inaccurate description of the plant. We make no distinction, philosophically or algorithmically, between unknown dynamics e.g. payload, and unmodeled dynamics e.g. the Coriolis terms discarded by design. Our true output dynamics can take any general form. Also note that Assumption 1 is in fact not satisfied by our baseline controller due to its simplifications e.g. massless legs. In this case, stability is left to empirical validation.</p><p>Moving on, we sample equation ( <ref type="formula">21</ref>) at discrete timesteps:</p><p>and form the dataset as</p><p>After solving for D and &#7713;, we use them to modify the objective function in equation ( <ref type="formula">20</ref>) as:</p><p>By definition, D + D must be positive definite; this is also necessary for the optimization problem above to make sense. For computational efficiency, we solve for D unconstrained, and find that our least squares solution in fact always gives D + D positive definite for our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. RESULTS</head><p>Video of our experiments is available at <ref type="url">https://youtu.be/Je 2Y-FQpKw</ref>  <ref type="bibr">([40]</ref>). Simulations are performed in the PyBullet ( <ref type="bibr">[41]</ref>) physics engine.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Simulation for Bipedal Walking</head><p>Our baseline controller discussed in Subsection III-A is taken from <ref type="bibr">[33]</ref>, which introduces its own setting and method for unknown dynamics. We perform simulation in their setting, and make comparison with their method.</p><p>The problem setting is based on RABBIT ( <ref type="bibr">[42]</ref>), an underactuated planar five-link bipedal robot with seven degree-offreedom; virtual constraints and controller design are based on <ref type="bibr">[37]</ref>. Model uncertainty is introduced in <ref type="bibr">[33]</ref> by scaling the mass of each link by a factor of two in the real environment. The baseline CLF-QP controller falls in a few steps in this setting, due to the significant difference in dynamics between the nominal and true model.</p><p>By querying the plant, <ref type="bibr">[33]</ref> uses model-free reinforcement learning (RL) to train a policy that directly adds on the original control inputs u, without reasoning about the unknown dynamics in the model space. Specifically, the commanded control inputs take the form u + u &#952; (x), where u is a neural network policy with parameters &#952;. Their reward is designed  <ref type="figure">3</ref>. Bipedal walking with mass of each link scaled by two. Both our method and that of <ref type="bibr">[33]</ref> walk stably. Their RL-based method trains on 20,000 samples from the real environment before deployment. Our method trains completely online and does not sample from or anticipate the real environment, treating it as truly unknown until the robot is deployed, and results in smaller impulses of control inputs and better tracking performance. The top panel visualizes the gait generated by our method. to encourage V &lt; -cV , where the value of V is obtained by simulating in the plant. After 20,000 samples from the plant simulated using the true dynamics, their method trains a policy which walks in the true dynamics without falling.</p><p>Our method walks stably in the same setting, training completely online without querying the plant at all before deployment. In fact, Fig. <ref type="figure">3</ref> shows that our method enjoys smaller impulses of control inputs and better tracking performance than the RL-based method, even though the latter had privileged access to the plant before deployment to optimize exactly for these metrics.</p><p>Online learning enables us to treat the plant as truly unknown, in terms of both data and mathematical representation, while only the latter is unknown for methods that train offline like in <ref type="bibr">[33]</ref>. This philosophical difference prevents our controller from overfitting on the training environment. In particular, our controller still walks stably under the original dynamics without scaling, where the policy trained with the scaled links fails, because it overfits to the scaled dynamics.</p><p>In addition, our controller walks stably in all environments below, where the baseline and the RL-based method cannot:</p><p>1) scaling the control inputs by half, in order to simulate transmission inefficiencies and motor wear and tear; 2) scaling the mass of the torso by four, in order to simulate payload on the back of the humanoid; 3) scaling the mass of the right leg by four, as an example of asymmetric changes in dynamics. We keep the same hyper-parameters for all the experiments above, including a windows size of 100 ms (where k = 100 and each timestep is 1 ms). The robot is still able to walk The baseline has completely fallen in 2 s, but the proposed method still walks stably after 10 s (50 kg). The bottom visualizations are captured when the payload reaches the specified mass. The torque limit is reached at 25 kg.</p><p>under the scaled dynamics with a window size of 10 or 1000, but has higher norm of control inputs and tracking errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Simulation for Quadrupedal Robot</head><p>Our baseline controller, as discussed in Subsection III-B, is based on <ref type="bibr">[38]</ref> and used subsequently in <ref type="bibr">[39]</ref> and <ref type="bibr">[43]</ref>. Our implementation is modified from the publicly available code of <ref type="bibr">[43]</ref> on an Unitree A1 quadruped, and keeps their original parameters unless stated otherwise. The A1 weighs 12 kg and has 12 motors, three for each leg, with the stated torque limit of 35.5 Nm. We experiment in PyBullet using Unitree's URDF description, and also on a real robot. In both simulation and real world, we use a window size of k = 100 (like for the biped); the controller runs at a frequency of 500 Hz, making the dataset window 200 ms.</p><p>We command robot to walk with linear velocity of 0.5 m / s in the x-direction, while maintaining CoM height of 0.24 m. Both the baseline and the proposed method can walk stably without payload, while tracking the desired velocity and height. With 6 kg of payload, however, the baseline can barely walk at 2 / 3 the desired velocity, and sags to 2 / 3 the desired height; the robot falls with 7 kg.</p><p>The proposed method walks stably with 12 kg of payload (same as its body mass), while tracking the desired velocity up to 0.05 m / s, and the desired height up to 0.01 m; all motors torques are less than 35.5 Nm. With more than 12 kg, however, tracking becomes less accurate, and with 15 kg the robot falls. Since the payload is carried from very the beginning of simulation, the robot visibly sags for the first fifth of a second, as we collect data before we can estimate the residual parameters. With 12 kg it soon recovers from the sag, but for larger payloads it struggles to get back. Next, we experiment with gradually changing dynamics. We start with an empty payload, and increase its mass by 5 kg / s, that is, 0.001 kg per timestep, once simulation begins. The tracking errors are shown in Figure <ref type="figure">4</ref>. The baseline falls within 2 s. We have tried to improve the baseline by tuning the PD gains for pd , but found it ineffective. This observation is reasonable, since larger gains only make pd more aggressive, but cannot help if the model-based controller fails to achieve it using the nominal dynamics. The proposed method walks stably even when the payload reaches 50 kg. Motor torques reach the specified limit at 25 kg (5 s), but the URDF allows simulation to keep running.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Hardware Experiments for Quadrupedal Robot</head><p>To facilitate hardware testing, we fit the Unitree A1 quadruped with a loading rig designed to hold up three standard 1 inch weight plates. The rig allows for incremental, discrete changes in load while the quadruped is in operation. The rig itself weighs 0.9 kg.</p><p>The experiments were designed to compare the performance of the baseline and proposed controllers under varying load conditions during operation. Two tests for each controller were performed: a step-in-place test and a 0.1 m/s forward motion test. The load conditions for the tests are shown in Table <ref type="table">1</ref>. Due to the manual loading process, the duration of each load varies by a small amount of transition time, typically less than 1 s. To protect the hardware from possible damage, we do not load beyond 10 kg, and limit operation at this load to 5 s.</p><p>In the transition from simulation to hardware, we had to address the problem of acceleration estimation from noisy measurements. <ref type="bibr">[43]</ref> uses a Kalman filter to fuse IMU and </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. DISCUSSION</head><p>We have introduced a method to update model parameters through online learning, while a controller is running with past, inaccurate versions of these parameters. Unlike methods in adaptive control, our method uses learned parameters that are general functions of the state, thus inherently time-varying. To the best of our knowledge, this is the first method that applies machine learning to real-time updates (500 to 1000 Hz) in hardware walking experiments.</p><p>While the nominal models in our applications are derived from classical mechanics, our method can be applied to any black box nominal model e.g. a simulator. While our baselines are derived from classical control principles, our method can also be applied to any controller using the black box model, even a policy trained in simulation. We hope to explore these potentials in future work, under broader definitions of unknown dynamics, such as sim-to-real transfer. 2) Proof of Lemma 2: We first prove the vector version in a claim, which is used in the proof of the lemma.</p><p>Claim 1: Consider y t = u t , a t &#8712; R for t = 1, . . . , k, and u t , a t &#8712; R m . Suppose u t &lt; B and a t+1 -a t &lt; . Let &#227; be the OLS estimator on this dataset, then</p><p>.</p><p>Proof: Define the feasible set of weights for a dataset as A = {(a 1 , a 2 , &#8226; &#8226; &#8226; , a k ) : y t = u t , a t , a t -a t+1 &#8804; , &#8704;t} .</p><p>Then a k can only exist in the kth component of A, denoted</p><p>Our goal is to bound max a k &#8712;A k &#227;a k . Define e t = a t -a k , where e k = 0 and e t+1 -e t &lt; . We can rewrite A using these conditions as A = {(a 1 , &#8226; &#8226; &#8226; , a k ) : y t = u t , a t , e t = a t -a k , e k = 0, e t+1 -e t &lt; , &#8704;t}.</p><p>Define E = [e T 1 ; ...; e T k ] and A = [a T 1 ; ...; a T k ]. Also, by definition of OLS, &#227; = (U T U) -1 U T y. Therefore</p><p>where &#8226; denotes the Hadamard operator, &#963; min (U) is the minimum non-zero singular value of U; the last equality follows from singular value decomposition of U. Note that</p><p>which finishes the proof of Claim 1. Now we extend the result of Claim 1 to prove Lemma 2, Note that in the context of Lemma 2, A t &#8712; R d&#215;m , and is different from the definition of A in the proof of Claim 1. We use the standard matrix norm relationship ( <ref type="bibr">[44]</ref>)</p><p>for any matrix X &#8712; R d&#215;m . Combining the second half of equation <ref type="bibr">(25)</ref> with the lemma's assumption, we have</p><p>As explained in Subsection II-B, the matrix-vector least squares problem is solved by reducing to d independent vectorscalar sub-problems, for each dimension of y t . Each subproblem solves for one row of &#195;. From equation ( <ref type="formula">26</ref>), we already know that rows of the ground truth weight matrices satisfy the smoothness assumption in Claim 1. Therefore we can apply Claim 1 to each row of &#195;, yielding</p><p>Combining this with the first half of equation ( <ref type="formula">25</ref>) finishes the proof of Lemma 2.</p></div></body>
		</text>
</TEI>
