Learning agile and dynamic motor skills for legged robots

Learning agile and dynamic motor skills
for legged robots
Kohei Nishimura, DeepX,Inc.
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, Marco Hutter
Robotics System Lab, ETH, Switzerland

Introduction
• Proposed a control method for a multi-legged robot that requires
complicated motor control by combining simulation modeling
improvement and deep reinforcement learning

Background
• Multi-legged robots are attracting attention as robots capable of operating
under various environments
• In the research of on multi legged robots, there are many methods to model
and control the behavior of actuators,
• but studies have not been done in consideration of generalization performance,
ease of tuning, efficiency
https://www.bostondynamic
s.com/spot-mini
http://biomimetics.mit.edu https://www.anybotics.
com/anymal/

Previous research
• Control multi-legged robot as a combination of modules
- ex. Assume that only the center of gravity of the robot has mass and
joints with no mass are attached and consider the optimum control
- Disadvantage
• Modeling inaccuracy causes in control inaccuracies
• It is necessary to modify the control parameters for each new
robot, new model, and it becomes necessary to model and
parameterize the module from scratch every time the task is
changed
(It takes several months even by skilled engineers)
• Control by trajectory optimization
- Control using two modules of planning and tracking
- Parameter tuning to optimize trajectory is cumbersome and may fall
into local solutions
- The calculation of trajectory optimization is heavy and not suitable
for controlling the robot in real time

Robot control using RL
• Research on robot control using reinforcement learning has
been conducted as a learning-based method
• There are two major trends of machine control with sim2real
using reinforcement learning
1. Make the behavior of the simulator faithful to the reality and obtain
policies that are easy to transfer to real space
ex. Use direct drive type (requiring analytically behavior) actuators
(Sim-to-real: Learning agile locomotion for quadruped robots)
2. Randomize the variables in the simulator and obtain a policy with
high generalization performance
ex. Randomize dynamics, add noise to observation
(Learning Dexterous In-Hand Manipulation)

Overview of proposed method
• Policy is learned via reinforcement learning with simulator only
- The policy receives the state and outputs the action (joint angle) of the
actuator
- To fill up the difference between the simulator and the real world
• Accelerate simulation of ground contact
• Learn the relationship between action and torque of real world
actuators with NN
• Randomize simulator conditions(Stochastic model) and learn
policies

Technique details : Improve contact simulation

Technique details : Improve contact simulation
• Requires a simulator that can handle complicated contacts
generated by motion in a stable, accurate, and high speed manner
• The general method is the penalty method (which is also adopted
in mujoco)
- Small embossing of objects is permitted and a repulsive force
correspondingto that is generated
- Easy to implement and low computational complexity but poor simulation
accuracy for highly rigid objects
• More accurate simulation method is PGS (projected Gauss-Seidel)
method
- A method of calculating the contact force based on the physical constraint
condition
- Although it is a method of solving linear equations by convergence
calculation, there are drawbacks that the number of updates is not stable
• Convergence takes time, such as when colliding
• We extended the PGS method using the dichotomy method,
proposed a method to stably solve the solution at high speed, and
used it for this experiment

Technique details : Actuator Net
• Learn the relationship between action and torque of the real
world actuator with NN
- Input
• The position error (command and actual angular difference)
at time t, t - 0.01, t - 0.02 and the angular velocity
- Output
• Torque at time t
- Network structure
• MLP with 3 intermediate layers
• The activation function is softsign
- Learning data
• Collect joint angle, joint angular velocity, torque data at 400
Hz for 4 minutes
• Walk with a simple control model and add disturbance
during walking

Technique details : Actuator Net

Technique details : Reinforcement learning
At every time step t the agent obtains an observation 𝑜𝑡 ∈ O,
performs an action 𝑎 𝑡 ∈ A, and achieves a scalar reward 𝑟𝑡 ∈ R
Aim is to find a policy that maximizes the discounted sum of rewards over
an infinite horizon:

Technique details : Control Policy
• Control policy
- Input
• Position of the robot
• Base orientation of the robot
• Series of joint angles (most recent 3 steps)
• Series of control signals (most recent 3 steps)
• Operation signal (by human controller)
- Output
• Control signal (angle control signal for each actuator)
• Control strategy learning algorithm uses TRPO
- Use TRPO's default parameters for original papers
Trust Region Policy Optimization (TRPO) [22], a policy gradient algorithm that
has been demonstrated to learn locomotion policies in simulation

Technique details : Learning Control Policy
• Make a stochastic model of Robot to include modelling error
- 15% mass error in the estimation due to un-modeled cabling and
electronics
• Randomize the condition of the simulator and robustify the policy
- by training with 30 different ANYmal models with stochastically
sampled inertial properties.
- The center of mass positions, the masses of links, and joint positions
are randomized by adding a noise sampled from
U(-2, 2) cm, U(-15, 15) %, and U(-2, 2) cm

Technique details : Learning Control Policy
• Can not learn well by naive learning
- Reducing constraints on torque and angular velocity results in
unnatural movement
- Increasing constraints on torque and angular velocity will result in
a local solution that does not move at all
• Learn the whole movement broadly and learn to refine
movement afterwards
- ex. Constraints on torque and joint speed are initially small,
increasing in the second half
- Introduce curriculum variables 𝑘 𝑐, 𝑘 𝑑, 𝑚𝑎𝑘𝑒 𝑘 𝑐=1 correspond to
difficult movements
- Update 𝑘 𝑐 𝑤𝑖𝑡ℎ 𝑓𝑜𝑟𝑚𝑢𝑙𝑎𝑟 𝑘 𝑐,𝑗+1 ← 𝑘 𝑐,𝑗
𝑘 𝑑
- j is a step of reinforcement learning
- In experiments in this paper, we used 𝑘0 = 0.3, 𝑘 𝑑 = 0.997

Diagram of proposed method
• The network and the variables are summarized

Technique details : Deployment on the physical system
• Custom MLP implementation and the trained parameter set were
ported to the robot’s onboard PC.
• This network was evaluated at 200 Hz for command-conditioned/high
speed locomotion and at 100 Hz for recovery from a fall.
• Performance was surprisingly insensitive to the controlrate.
• Even at 100 Hz, evaluation of the network uses only 0.25% of the
computation available on a single CPU core.

actuator net ideal model
train 0.740 [N•m] 3.55[N•m]
valid 0.996 [N•m] 5.74 [N•m]
Accuracy of Actuator net
• Collected data is divided into 9:1 and verified
• Comparison with numerical solutions assuming the ideal state
- no communication delay, zero mechanical response time
- The RMS was smaller than the numerical solution assuming the ideal state

Exp. 1: Command-conditioned locomotion
• Experiment contents
- Experiment that gives a command and controls it to move
according to the command
- The command is the speed in the straight running direction, the
speed in the lateral direction, the direction of the robot
• Reward function
- Angular velocity, moving speed, torque, joint speed (see appendix)
• Learning strategies
- 4 hours in the real world (time steps of 9 days in the simulator space)

Exp. 1 : Comparison method
• As a comparative article, we use a model-based approach
- Define the cost function for the task
- Constraint Condition · Calculate the Hessian and Jacobian of the
cost function and take the optimum position of the center of gravity
and the coordinates of each foot as quadratic programming.
- Calculate the optimum acceleration and friction force, solve the
torque as a quadratic programming method, and send a signal to
the robot

• Experimental result(proposed method)
Result 1: Command-conditioned locomotion

• Comparison of actuator modeling methods
– left : analytical actuator model, right : ideal actuator model

• Evaluate the difference between the simulator and the
actual machine as the fidelity performance of the moving
speed of the robot
- Behavior of the simulator is quite close to that of the real machine

• Control error with respect to command, control efficiency
(torque, power consumption)
- Comparison with previous studies

Exp. 2 : High-speed locomotion
- Task to run as fast as possible
• Reward design and learning time are the same as in
Experiment 1

Exp. 2 : High-speed locomotion
• Result
- Maximum speed in previous research: 1.2 m/s
- Maximum speed in this method: 1.6 m/s
• Consideration on results
- Maximum speed depends on hardware such as actuators and
parts
- With the existing control method, planning calculation
processing is heavy, control in real environment can not be
made in time so it can not be controlled at high speed

Exp. 3: Recovery from a fall
- Tasks that gets up from a state of being sprinkled
- Experiment with nine initial conditions
• Reward function
- Constraints on torque, joint speed, joint acceleration ... (see appendix)
• Learning time
- 11 hours in the real world (time step of 76 days in the simulator space)

Conclusions
• propose a method to control accurately and efficiently by
reinforcement learning of simulator only, and applied it to actual
machine
• able to learn a robust control strategy for the machine state by
the proposed method
- It was possible to control even if applied to actual machine without reshaping the
policy with in 3 months
• Future tasks
- It is hard to decide the distribution of the reward design and the initial state, so
would like to improve it
- Would like to be able to perform multiple tasks by giving hierarchical structure
control policy
- It has already been posted to arxiv (https://arxiv.org/pdf/1901.07517.pdf)

Impressions
• It is great that the calculation specifications required for both
training and inference are not so large and can be controlled
• A lot of agent simulator videos were uploaded to youtube, but
I would like to know details about the simulator
• Reward design seems to be very difficult
• Kicking the robot with the animation of Experiment 1 is gentle

References
• Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning Agile
and Dynamic Motor Skills for Legged Robots. Science Robotics,
4(26):eaau5872, 2019.
• J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for
solving contact dynamics. IEEE Robot. Autom. Lett. 3, 895–902
(2018).
• C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic
locomotion through online nonlinear motion optimization for
quadrupedal robots. IEEE Robot. Autom. Lett. 3, 2261–2268 (2018).

Notation used for Reward function

Reward function in Exp. 1 & 2
• K that appears below uses logistic kernel

• The Reward function is the sum of the following rewards
– 𝑘 𝑐 is a curriculum variable
Reward function in Exp. 1 & 2

Appendix. Reward function in Exp. 3
• The angle Diff () that appears below uses the smaller
difference between the two angles

Learning agile and dynamic motor skills for legged robots

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Learning agile and dynamic motor skills for legged robots

Similar a Learning agile and dynamic motor skills for legged robots (20)

Más de 홍배 김

Más de 홍배 김 (20)

Último

Último (20)

Learning agile and dynamic motor skills for legged robots