SlideShare una empresa de Scribd logo
1 de 35
Learning agile and dynamic motor skills
for legged robots
Kohei Nishimura, DeepX,Inc.
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, Marco Hutter
Robotics System Lab, ETH, Switzerland
Introduction
• Proposed a control method for a multi-legged robot that requires
complicated motor control by combining simulation modeling
improvement and deep reinforcement learning
Background
• Multi-legged robots are attracting attention as robots capable of operating
under various environments
• In the research of on multi legged robots, there are many methods to model
and control the behavior of actuators,
• but studies have not been done in consideration of generalization performance,
ease of tuning, efficiency
https://www.bostondynamic
s.com/spot-mini
http://biomimetics.mit.edu https://www.anybotics.
com/anymal/
Previous research
• Control multi-legged robot as a combination of modules
- ex. Assume that only the center of gravity of the robot has mass and
joints with no mass are attached and consider the optimum control
- Disadvantage
• Modeling inaccuracy causes in control inaccuracies
• It is necessary to modify the control parameters for each new
robot, new model, and it becomes necessary to model and
parameterize the module from scratch every time the task is
changed
(It takes several months even by skilled engineers)
• Control by trajectory optimization
- Control using two modules of planning and tracking
- Parameter tuning to optimize trajectory is cumbersome and may fall
into local solutions
- The calculation of trajectory optimization is heavy and not suitable
for controlling the robot in real time
Robot control using RL
• Research on robot control using reinforcement learning has
been conducted as a learning-based method
• There are two major trends of machine control with sim2real
using reinforcement learning
1. Make the behavior of the simulator faithful to the reality and obtain
policies that are easy to transfer to real space
ex. Use direct drive type (requiring analytically behavior) actuators
(Sim-to-real: Learning agile locomotion for quadruped robots)
2. Randomize the variables in the simulator and obtain a policy with
high generalization performance
ex. Randomize dynamics, add noise to observation
(Learning Dexterous In-Hand Manipulation)
Overview of proposed method
• Policy is learned via reinforcement learning with simulator only
- The policy receives the state and outputs the action (joint angle) of the
actuator
- To fill up the difference between the simulator and the real world
• Accelerate simulation of ground contact
• Learn the relationship between action and torque of real world
actuators with NN
• Randomize simulator conditions(Stochastic model) and learn
policies
Technique details : Improve contact simulation
Technique details : Improve contact simulation
• Requires a simulator that can handle complicated contacts
generated by motion in a stable, accurate, and high speed manner
• The general method is the penalty method (which is also adopted
in mujoco)
- Small embossing of objects is permitted and a repulsive force
correspondingto that is generated
- Easy to implement and low computational complexity but poor simulation
accuracy for highly rigid objects
• More accurate simulation method is PGS (projected Gauss-Seidel)
method
- A method of calculating the contact force based on the physical constraint
condition
- Although it is a method of solving linear equations by convergence
calculation, there are drawbacks that the number of updates is not stable
• Convergence takes time, such as when colliding
• We extended the PGS method using the dichotomy method,
proposed a method to stably solve the solution at high speed, and
used it for this experiment
Technique details : Actuator Net
• Learn the relationship between action and torque of the real
world actuator with NN
- Input
• The position error (command and actual angular difference)
at time t, t - 0.01, t - 0.02 and the angular velocity
- Output
• Torque at time t
- Network structure
• MLP with 3 intermediate layers
• The activation function is softsign
- Learning data
• Collect joint angle, joint angular velocity, torque data at 400
Hz for 4 minutes
• Walk with a simple control model and add disturbance
during walking
Technique details : Actuator Net
Technique details : Reinforcement learning
At every time step t the agent obtains an observation 𝑜𝑡 ∈ O,
performs an action 𝑎 𝑡 ∈ A, and achieves a scalar reward 𝑟𝑡 ∈ R
Aim is to find a policy that maximizes the discounted sum of rewards over
an infinite horizon:
Technique details : Control Policy
• Control policy
- Input
• Position of the robot
• Base orientation of the robot
• Series of joint angles (most recent 3 steps)
• Series of control signals (most recent 3 steps)
• Operation signal (by human controller)
- Output
• Control signal (angle control signal for each actuator)
• Control strategy learning algorithm uses TRPO
- Use TRPO's default parameters for original papers
Trust Region Policy Optimization (TRPO) [22], a policy gradient algorithm that
has been demonstrated to learn locomotion policies in simulation
Technique details : Learning Control Policy
• Make a stochastic model of Robot to include modelling error
- 15% mass error in the estimation due to un-modeled cabling and
electronics
• Randomize the condition of the simulator and robustify the policy
- by training with 30 different ANYmal models with stochastically
sampled inertial properties.
- The center of mass positions, the masses of links, and joint positions
are randomized by adding a noise sampled from
U(-2, 2) cm, U(-15, 15) %, and U(-2, 2) cm
Technique details : Learning Control Policy
• Can not learn well by naive learning
- Reducing constraints on torque and angular velocity results in
unnatural movement
- Increasing constraints on torque and angular velocity will result in
a local solution that does not move at all
• Learn the whole movement broadly and learn to refine
movement afterwards
- ex. Constraints on torque and joint speed are initially small,
increasing in the second half
- Introduce curriculum variables 𝑘 𝑐, 𝑘 𝑑, 𝑚𝑎𝑘𝑒 𝑘 𝑐=1 correspond to
difficult movements
- Update 𝑘 𝑐 𝑤𝑖𝑡ℎ 𝑓𝑜𝑟𝑚𝑢𝑙𝑎𝑟 𝑘 𝑐,𝑗+1 ← 𝑘 𝑐,𝑗
𝑘 𝑑
- j is a step of reinforcement learning
- In experiments in this paper, we used 𝑘0 = 0.3, 𝑘 𝑑 = 0.997
Diagram of proposed method
• The network and the variables are summarized
Technique details : Deployment on the physical system
• Custom MLP implementation and the trained parameter set were
ported to the robot’s onboard PC.
• This network was evaluated at 200 Hz for command-conditioned/high
speed locomotion and at 100 Hz for recovery from a fall.
• Performance was surprisingly insensitive to the controlrate.
• Even at 100 Hz, evaluation of the network uses only 0.25% of the
computation available on a single CPU core.
actuator net ideal model
train 0.740 [N•m] 3.55[N•m]
valid 0.996 [N•m] 5.74 [N•m]
Accuracy of Actuator net
• Collected data is divided into 9:1 and verified
• Comparison with numerical solutions assuming the ideal state
- no communication delay, zero mechanical response time
- The RMS was smaller than the numerical solution assuming the ideal state
Exp. 1: Command-conditioned locomotion
• Experiment contents
- Experiment that gives a command and controls it to move
according to the command
- The command is the speed in the straight running direction, the
speed in the lateral direction, the direction of the robot
• Reward function
- Angular velocity, moving speed, torque, joint speed (see appendix)
• Learning strategies
- 4 hours in the real world (time steps of 9 days in the simulator space)
Exp. 1 : Comparison method
• As a comparative article, we use a model-based approach
- Define the cost function for the task
- Constraint Condition · Calculate the Hessian and Jacobian of the
cost function and take the optimum position of the center of gravity
and the coordinates of each foot as quadratic programming.
- Calculate the optimum acceleration and friction force, solve the
torque as a quadratic programming method, and send a signal to
the robot
Exp. 1 : Comparison method
• Experimental result(proposed method)
Result 1: Command-conditioned locomotion
• Comparison of actuator modeling methods
– left : analytical actuator model, right : ideal actuator model
Result 1: Command-conditioned locomotion
Result 1: Command-conditioned locomotion
• Evaluate the difference between the simulator and the
actual machine as the fidelity performance of the moving
speed of the robot
- Behavior of the simulator is quite close to that of the real machine
Result 1: Command-conditioned locomotion
• Control error with respect to command, control efficiency
(torque, power consumption)
- Comparison with previous studies
Exp. 2 : High-speed locomotion
• Experiment contents
- Task to run as fast as possible
• Reward design and learning time are the same as in
Experiment 1
Exp. 2 : High-speed locomotion
• Result
- Maximum speed in previous research: 1.2 m/s
- Maximum speed in this method: 1.6 m/s
• Consideration on results
- Maximum speed depends on hardware such as actuators and
parts
- With the existing control method, planning calculation
processing is heavy, control in real environment can not be
made in time so it can not be controlled at high speed
Exp. 3: Recovery from a fall
• Experiment contents
- Tasks that gets up from a state of being sprinkled
- Experiment with nine initial conditions
• Reward function
- Constraints on torque, joint speed, joint acceleration ... (see appendix)
• Learning time
- 11 hours in the real world (time step of 76 days in the simulator space)
Exp. 3: Recovery from a fall
Conclusions
• propose a method to control accurately and efficiently by
reinforcement learning of simulator only, and applied it to actual
machine
• able to learn a robust control strategy for the machine state by
the proposed method
- It was possible to control even if applied to actual machine without reshaping the
policy with in 3 months
• Future tasks
- It is hard to decide the distribution of the reward design and the initial state, so
would like to improve it
- Would like to be able to perform multiple tasks by giving hierarchical structure
control policy
- It has already been posted to arxiv (https://arxiv.org/pdf/1901.07517.pdf)
Impressions
• It is great that the calculation specifications required for both
training and inference are not so large and can be controlled
• A lot of agent simulator videos were uploaded to youtube, but
I would like to know details about the simulator
• Reward design seems to be very difficult
• Kicking the robot with the animation of Experiment 1 is gentle
References
• Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso,
Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning Agile
and Dynamic Motor Skills for Legged Robots. Science Robotics,
4(26):eaau5872, 2019.
• J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for
solving contact dynamics. IEEE Robot. Autom. Lett. 3, 895–902
(2018).
• C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic
locomotion through online nonlinear motion optimization for
quadrupedal robots. IEEE Robot. Autom. Lett. 3, 2261–2268 (2018).
Notation used for Reward function
Reward function in Exp. 1 & 2
• K that appears below uses logistic kernel
• The Reward function is the sum of the following rewards
– 𝑘 𝑐 is a curriculum variable
Reward function in Exp. 1 & 2
Appendix. Reward function in Exp. 3
• The angle Diff () that appears below uses the smaller
difference between the two angles

Más contenido relacionado

La actualidad más candente

Dek3223 chapter 3 industrial robotic
Dek3223 chapter 3 industrial roboticDek3223 chapter 3 industrial robotic
Dek3223 chapter 3 industrial robotic
mkazree
 
Introduction to ROBOTICS
Introduction to ROBOTICSIntroduction to ROBOTICS
Introduction to ROBOTICS
elliando dias
 

La actualidad más candente (20)

Introduction to robotics, Laws,Classification,Types, Drives,Geometry
Introduction to robotics, Laws,Classification,Types, Drives,Geometry  Introduction to robotics, Laws,Classification,Types, Drives,Geometry
Introduction to robotics, Laws,Classification,Types, Drives,Geometry
 
Inverse Kinematics
Inverse KinematicsInverse Kinematics
Inverse Kinematics
 
Dek3223 chapter 3 industrial robotic
Dek3223 chapter 3 industrial roboticDek3223 chapter 3 industrial robotic
Dek3223 chapter 3 industrial robotic
 
Introduction to Mobile Robotics
Introduction to Mobile RoboticsIntroduction to Mobile Robotics
Introduction to Mobile Robotics
 
Trajectory
TrajectoryTrajectory
Trajectory
 
GAIT and it abnormality by Dr Umar Mohammed NOHIL
GAIT and it abnormality by Dr Umar Mohammed NOHIL GAIT and it abnormality by Dr Umar Mohammed NOHIL
GAIT and it abnormality by Dr Umar Mohammed NOHIL
 
Robot Arm Kinematics
Robot Arm KinematicsRobot Arm Kinematics
Robot Arm Kinematics
 
Kinematic Model vs Dynamic Model
Kinematic Model vs Dynamic ModelKinematic Model vs Dynamic Model
Kinematic Model vs Dynamic Model
 
Robotics sensors
Robotics sensorsRobotics sensors
Robotics sensors
 
2. robotics
2. robotics2. robotics
2. robotics
 
Denavit Hartenberg Algorithm
Denavit Hartenberg AlgorithmDenavit Hartenberg Algorithm
Denavit Hartenberg Algorithm
 
Robot force control
Robot force controlRobot force control
Robot force control
 
MR3491 SENSORS AND INSTRUMENTATION ( UNIT-I INTRODUCTION)
MR3491 SENSORS AND INSTRUMENTATION ( UNIT-I INTRODUCTION)MR3491 SENSORS AND INSTRUMENTATION ( UNIT-I INTRODUCTION)
MR3491 SENSORS AND INSTRUMENTATION ( UNIT-I INTRODUCTION)
 
Deep parking
Deep parkingDeep parking
Deep parking
 
Robotics lec 6
Robotics lec 6Robotics lec 6
Robotics lec 6
 
Me robotics with qb
Me robotics with qb   Me robotics with qb
Me robotics with qb
 
Gait analysis
Gait analysisGait analysis
Gait analysis
 
the gait.pptx
the gait.pptxthe gait.pptx
the gait.pptx
 
Gate mathematics
Gate mathematicsGate mathematics
Gate mathematics
 
Introduction to ROBOTICS
Introduction to ROBOTICSIntroduction to ROBOTICS
Introduction to ROBOTICS
 

Similar a Learning agile and dynamic motor skills for legged robots

Chaos Presentation
Chaos PresentationChaos Presentation
Chaos Presentation
Albert Yang
 
Model-based Investigation of the Effect of Tuning Parameters o.docx
Model-based Investigation of the Effect of Tuning Parameters o.docxModel-based Investigation of the Effect of Tuning Parameters o.docx
Model-based Investigation of the Effect of Tuning Parameters o.docx
raju957290
 
Fontys Driving Simulator for Fontys
Fontys Driving Simulator for FontysFontys Driving Simulator for Fontys
Fontys Driving Simulator for Fontys
Ben Pyman
 
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
DvbRef1
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 

Similar a Learning agile and dynamic motor skills for legged robots (20)

Rapid motor adaptation for legged robots
Rapid motor adaptation for legged robotsRapid motor adaptation for legged robots
Rapid motor adaptation for legged robots
 
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
 
Seminar Nima Yousefi 2015 Engineering University of Alberta
Seminar Nima Yousefi 2015 Engineering University of Alberta Seminar Nima Yousefi 2015 Engineering University of Alberta
Seminar Nima Yousefi 2015 Engineering University of Alberta
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
 
SPLT Transformer.pptx
SPLT Transformer.pptxSPLT Transformer.pptx
SPLT Transformer.pptx
 
Grad presentation
Grad presentationGrad presentation
Grad presentation
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Start MPC
Start MPC Start MPC
Start MPC
 
Introduction to simulation modeling
Introduction to simulation modelingIntroduction to simulation modeling
Introduction to simulation modeling
 
Chaos Presentation
Chaos PresentationChaos Presentation
Chaos Presentation
 
Lec 03(VDIdr shady)
Lec 03(VDIdr shady)Lec 03(VDIdr shady)
Lec 03(VDIdr shady)
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 
Model-based Investigation of the Effect of Tuning Parameters o.docx
Model-based Investigation of the Effect of Tuning Parameters o.docxModel-based Investigation of the Effect of Tuning Parameters o.docx
Model-based Investigation of the Effect of Tuning Parameters o.docx
 
Developments In Precision Positioning Stages with High Speed Range
Developments In Precision Positioning Stages with High Speed RangeDevelopments In Precision Positioning Stages with High Speed Range
Developments In Precision Positioning Stages with High Speed Range
 
Project-Mohit Suri
Project-Mohit Suri Project-Mohit Suri
Project-Mohit Suri
 
Mit16 30 f10_lec01
Mit16 30 f10_lec01Mit16 30 f10_lec01
Mit16 30 f10_lec01
 
Fontys Driving Simulator for Fontys
Fontys Driving Simulator for FontysFontys Driving Simulator for Fontys
Fontys Driving Simulator for Fontys
 
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
53_36765_ME591_2012_1__1_1_Mechatronics System Design.pdf
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Robot Drives And End Effectors.pptx
Robot Drives And End Effectors.pptxRobot Drives And End Effectors.pptx
Robot Drives And End Effectors.pptx
 

Más de 홍배 김

Más de 홍배 김 (20)

Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Lecture Summary : Camera Projection
Lecture Summary : Camera Projection Lecture Summary : Camera Projection
Lecture Summary : Camera Projection
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
 
Convolution 종류 설명
Convolution 종류 설명Convolution 종류 설명
Convolution 종류 설명
 
Learning by association
Learning by associationLearning by association
Learning by association
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Learning agile and dynamic motor skills for legged robots

  • 1. Learning agile and dynamic motor skills for legged robots Kohei Nishimura, DeepX,Inc. Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, Marco Hutter Robotics System Lab, ETH, Switzerland
  • 2. Introduction • Proposed a control method for a multi-legged robot that requires complicated motor control by combining simulation modeling improvement and deep reinforcement learning
  • 3. Background • Multi-legged robots are attracting attention as robots capable of operating under various environments • In the research of on multi legged robots, there are many methods to model and control the behavior of actuators, • but studies have not been done in consideration of generalization performance, ease of tuning, efficiency https://www.bostondynamic s.com/spot-mini http://biomimetics.mit.edu https://www.anybotics. com/anymal/
  • 4. Previous research • Control multi-legged robot as a combination of modules - ex. Assume that only the center of gravity of the robot has mass and joints with no mass are attached and consider the optimum control - Disadvantage • Modeling inaccuracy causes in control inaccuracies • It is necessary to modify the control parameters for each new robot, new model, and it becomes necessary to model and parameterize the module from scratch every time the task is changed (It takes several months even by skilled engineers) • Control by trajectory optimization - Control using two modules of planning and tracking - Parameter tuning to optimize trajectory is cumbersome and may fall into local solutions - The calculation of trajectory optimization is heavy and not suitable for controlling the robot in real time
  • 5. Robot control using RL • Research on robot control using reinforcement learning has been conducted as a learning-based method • There are two major trends of machine control with sim2real using reinforcement learning 1. Make the behavior of the simulator faithful to the reality and obtain policies that are easy to transfer to real space ex. Use direct drive type (requiring analytically behavior) actuators (Sim-to-real: Learning agile locomotion for quadruped robots) 2. Randomize the variables in the simulator and obtain a policy with high generalization performance ex. Randomize dynamics, add noise to observation (Learning Dexterous In-Hand Manipulation)
  • 6. Overview of proposed method • Policy is learned via reinforcement learning with simulator only - The policy receives the state and outputs the action (joint angle) of the actuator - To fill up the difference between the simulator and the real world • Accelerate simulation of ground contact • Learn the relationship between action and torque of real world actuators with NN • Randomize simulator conditions(Stochastic model) and learn policies
  • 7. Technique details : Improve contact simulation
  • 8. Technique details : Improve contact simulation • Requires a simulator that can handle complicated contacts generated by motion in a stable, accurate, and high speed manner • The general method is the penalty method (which is also adopted in mujoco) - Small embossing of objects is permitted and a repulsive force correspondingto that is generated - Easy to implement and low computational complexity but poor simulation accuracy for highly rigid objects • More accurate simulation method is PGS (projected Gauss-Seidel) method - A method of calculating the contact force based on the physical constraint condition - Although it is a method of solving linear equations by convergence calculation, there are drawbacks that the number of updates is not stable • Convergence takes time, such as when colliding • We extended the PGS method using the dichotomy method, proposed a method to stably solve the solution at high speed, and used it for this experiment
  • 9. Technique details : Actuator Net • Learn the relationship between action and torque of the real world actuator with NN - Input • The position error (command and actual angular difference) at time t, t - 0.01, t - 0.02 and the angular velocity - Output • Torque at time t - Network structure • MLP with 3 intermediate layers • The activation function is softsign - Learning data • Collect joint angle, joint angular velocity, torque data at 400 Hz for 4 minutes • Walk with a simple control model and add disturbance during walking
  • 10. Technique details : Actuator Net
  • 11. Technique details : Reinforcement learning At every time step t the agent obtains an observation 𝑜𝑡 ∈ O, performs an action 𝑎 𝑡 ∈ A, and achieves a scalar reward 𝑟𝑡 ∈ R Aim is to find a policy that maximizes the discounted sum of rewards over an infinite horizon:
  • 12. Technique details : Control Policy • Control policy - Input • Position of the robot • Base orientation of the robot • Series of joint angles (most recent 3 steps) • Series of control signals (most recent 3 steps) • Operation signal (by human controller) - Output • Control signal (angle control signal for each actuator) • Control strategy learning algorithm uses TRPO - Use TRPO's default parameters for original papers Trust Region Policy Optimization (TRPO) [22], a policy gradient algorithm that has been demonstrated to learn locomotion policies in simulation
  • 13. Technique details : Learning Control Policy • Make a stochastic model of Robot to include modelling error - 15% mass error in the estimation due to un-modeled cabling and electronics • Randomize the condition of the simulator and robustify the policy - by training with 30 different ANYmal models with stochastically sampled inertial properties. - The center of mass positions, the masses of links, and joint positions are randomized by adding a noise sampled from U(-2, 2) cm, U(-15, 15) %, and U(-2, 2) cm
  • 14. Technique details : Learning Control Policy • Can not learn well by naive learning - Reducing constraints on torque and angular velocity results in unnatural movement - Increasing constraints on torque and angular velocity will result in a local solution that does not move at all • Learn the whole movement broadly and learn to refine movement afterwards - ex. Constraints on torque and joint speed are initially small, increasing in the second half - Introduce curriculum variables 𝑘 𝑐, 𝑘 𝑑, 𝑚𝑎𝑘𝑒 𝑘 𝑐=1 correspond to difficult movements - Update 𝑘 𝑐 𝑤𝑖𝑡ℎ 𝑓𝑜𝑟𝑚𝑢𝑙𝑎𝑟 𝑘 𝑐,𝑗+1 ← 𝑘 𝑐,𝑗 𝑘 𝑑 - j is a step of reinforcement learning - In experiments in this paper, we used 𝑘0 = 0.3, 𝑘 𝑑 = 0.997
  • 15. Diagram of proposed method • The network and the variables are summarized
  • 16. Technique details : Deployment on the physical system • Custom MLP implementation and the trained parameter set were ported to the robot’s onboard PC. • This network was evaluated at 200 Hz for command-conditioned/high speed locomotion and at 100 Hz for recovery from a fall. • Performance was surprisingly insensitive to the controlrate. • Even at 100 Hz, evaluation of the network uses only 0.25% of the computation available on a single CPU core.
  • 17. actuator net ideal model train 0.740 [N•m] 3.55[N•m] valid 0.996 [N•m] 5.74 [N•m] Accuracy of Actuator net • Collected data is divided into 9:1 and verified • Comparison with numerical solutions assuming the ideal state - no communication delay, zero mechanical response time - The RMS was smaller than the numerical solution assuming the ideal state
  • 18. Exp. 1: Command-conditioned locomotion • Experiment contents - Experiment that gives a command and controls it to move according to the command - The command is the speed in the straight running direction, the speed in the lateral direction, the direction of the robot • Reward function - Angular velocity, moving speed, torque, joint speed (see appendix) • Learning strategies - 4 hours in the real world (time steps of 9 days in the simulator space)
  • 19. Exp. 1 : Comparison method • As a comparative article, we use a model-based approach - Define the cost function for the task - Constraint Condition · Calculate the Hessian and Jacobian of the cost function and take the optimum position of the center of gravity and the coordinates of each foot as quadratic programming. - Calculate the optimum acceleration and friction force, solve the torque as a quadratic programming method, and send a signal to the robot
  • 20. Exp. 1 : Comparison method
  • 21. • Experimental result(proposed method) Result 1: Command-conditioned locomotion
  • 22. • Comparison of actuator modeling methods – left : analytical actuator model, right : ideal actuator model Result 1: Command-conditioned locomotion
  • 23. Result 1: Command-conditioned locomotion • Evaluate the difference between the simulator and the actual machine as the fidelity performance of the moving speed of the robot - Behavior of the simulator is quite close to that of the real machine
  • 24. Result 1: Command-conditioned locomotion • Control error with respect to command, control efficiency (torque, power consumption) - Comparison with previous studies
  • 25. Exp. 2 : High-speed locomotion • Experiment contents - Task to run as fast as possible • Reward design and learning time are the same as in Experiment 1
  • 26. Exp. 2 : High-speed locomotion • Result - Maximum speed in previous research: 1.2 m/s - Maximum speed in this method: 1.6 m/s • Consideration on results - Maximum speed depends on hardware such as actuators and parts - With the existing control method, planning calculation processing is heavy, control in real environment can not be made in time so it can not be controlled at high speed
  • 27. Exp. 3: Recovery from a fall • Experiment contents - Tasks that gets up from a state of being sprinkled - Experiment with nine initial conditions • Reward function - Constraints on torque, joint speed, joint acceleration ... (see appendix) • Learning time - 11 hours in the real world (time step of 76 days in the simulator space)
  • 28. Exp. 3: Recovery from a fall
  • 29. Conclusions • propose a method to control accurately and efficiently by reinforcement learning of simulator only, and applied it to actual machine • able to learn a robust control strategy for the machine state by the proposed method - It was possible to control even if applied to actual machine without reshaping the policy with in 3 months • Future tasks - It is hard to decide the distribution of the reward design and the initial state, so would like to improve it - Would like to be able to perform multiple tasks by giving hierarchical structure control policy - It has already been posted to arxiv (https://arxiv.org/pdf/1901.07517.pdf)
  • 30. Impressions • It is great that the calculation specifications required for both training and inference are not so large and can be controlled • A lot of agent simulator videos were uploaded to youtube, but I would like to know details about the simulator • Reward design seems to be very difficult • Kicking the robot with the animation of Experiment 1 is gentle
  • 31. References • Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning Agile and Dynamic Motor Skills for Legged Robots. Science Robotics, 4(26):eaau5872, 2019. • J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for solving contact dynamics. IEEE Robot. Autom. Lett. 3, 895–902 (2018). • C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic locomotion through online nonlinear motion optimization for quadrupedal robots. IEEE Robot. Autom. Lett. 3, 2261–2268 (2018).
  • 32. Notation used for Reward function
  • 33. Reward function in Exp. 1 & 2 • K that appears below uses logistic kernel
  • 34. • The Reward function is the sum of the following rewards – 𝑘 𝑐 is a curriculum variable Reward function in Exp. 1 & 2
  • 35. Appendix. Reward function in Exp. 3 • The angle Diff () that appears below uses the smaller difference between the two angles