Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Deep Analytics - Differential Machine Learning

285 visualizaciones

Publicado el

We combine machine learning with automatic differentiation (AAD) to reliably learn pricing approximations from small simulated datasets and effectively resolve the computation load of risk management and capital calculations like CVA/XVA, CCR, FRTB or SIMM-MVA

Publicado en: Economía y finanzas
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Deep Analytics - Differential Machine Learning

  1. 1. Material • All the slides: www.deep-analytics.org • Working paper on arXiv: arxiv.org/abs/2005.02347 Differential Machine Learning • GitHub: github.com/differential-machine-learning • Demonstration notebooks: github.com/differential-machine-learning/notebooks Designed for Google Colab Discuss practical implementation details Contain TensorFlow code • Appendices: github.com/differential-machine-learning/appendices Mathematical proofs Implementation details Extensions
  2. 2. Automatic Differentiation (AAD) • This is not strictly speaking a talk on AAD • But we extensively apply AAD and backpropagation • A good understanding of these techniques is a strong prerequisite • We refer to: The AAD textbook amazon.com/Modern- Computational-Finance- Parallel-Simulations-dp- 1119539455/dp/1119539455 A 15min video tutotial towardsdatascience.com/automatic-differentiation-15min-video-tutorial-with- application-in-machine-learning-and-finance- 333e18c0ecbb?source=friends_link&sk=c11be895fffa9a8276d3e74a598dbbc3
  3. 3. AAD tutorial on YouTubeyoutube.com/watch?v=IcQkwgPwfm4 slides and more here: towardsdatascience.com/a utomatic-differentiation- 15min-video-tutorial-with- application-in-machine- learning-and-finance- 333e18c0ecbb?source=frie nds_link&sk=c11be895fffa 9a8276d3e74a598dbbc3
  4. 4. AAD in finance • AAD made differentials massively available • Gave us realtime risk reports and instantaneous calibration • Unlocked new research and development potential, with many powerful applications • In this talk, we discuss a very special application: we learn the pricing function of Derivatives from AAD pathwise differentials
  5. 5. Pricing function approximation • Pricing in closed-form a la Black & Scholes is only available for simple instruments in simple models • In all other cases, prices are computed with numerical methods, often Monte-Carlo • Monte-Carlo is orders of magnitude slower than analytics • We need fast analytics for:  Risk management of option books  Exotic risk reports in multiple scenarios  Backtesting  Value at Risk / Expected Loss  Regulations like XVA, CCR, FRTB, SIMM-MVA • Where trading books are repeatedly evaluated in different market scenarios • Hence, the intensive research on pricing function approximation for 2-3 decades  The objective is to derive approximate pricing functions of the market state  With speed similar to analytics  And accuracy similar to Monte-Carlo
  6. 6. 2000: SABR • Market realizes the importance of stochastic volatility (SV) for risk management • Challenge: no closed form formula, Monte-Carlo far too slow • Pat Hagan works out a mathematical approximation in Managing Smile Risk (Wilmott, 2002) • Resulting in the famous SABR formula • SABR brings SV models to trading desks and quickly becomes a market standard, which it remains to this day dF F dW d dZ dWdZ dt            , ?T t tE F K F       
  7. 7. 2019: Deep Learning Volatility • Age of AI: similar problem find different solution • Example: ‘Rough volatility’ (RV) models  New family of stochastic volatility models with excellent risk management properties  (Read all about RV in Jim Gatheral’s papers and presentations)  Unfortunately, no closed form or accurate approximation for European calls • Deep Learning Volatility (Horvath and al. 2019)  Proposed pricing of calls in RV models by neural network  Trained on examples produced by Monte-Carlo  Trained network is the approximation: fast pricing by feedforward induction, risk sensitivities by backpropagation  Permits calibration and risk management in realistic time, making RV suitable for production • Note:  Pricing function is learned from data, not derived from underlying mathematics  A very large training set is necessary to achieve accurate approximation  Authors train the net with ground truth labels, where each training example is computed by Monte-Carlo  Training set is obtained at massive computation cost  Pricing function is learned once, offline, and reused forever
  8. 8. 2020: Deep Analytics • Our ambition: generalize and automate the production of pricing approximations  Learn the pricing function for arbitrary schedules of cashflows, not only calls: options, exotics, hybrids, netting sets, trading books…  In arbitrary simulation models: multifactor, multiunderlying, multicurrency, stochastic volatility…  Online, in realtime, as part of a risk computation  From data alone, simulated in reasonable time  With convergence guarantees • Achieved by: covered in the presentation 1. Training on payoff samples, a la Longstaff-Schwartz (2001) 2. Leveraging modern deep learning 3. Training on pathwise differentials obtained with AAD not covered in the presentation, see appendix 4 github.com/differential-machine-learning/appendices/blob/master/App4-UnsupervisedTraining.pdf 4. Obtaining training convergence guarantees with special network architecture 5. Controlling asymptotics with specialized algorithms
  9. 9. Context and notations • We have a simulation model:  Simulates a Markov state vector S under some martingale measure Q  S represents the state of the market together with path dependencies (barriers, exercises, etc.) • We represent transactions and trading books as collections of event driven cashflows:  Cashflows CFp paid at time Tp are Tp-measurable variables  We assume all cashflows are appropriately smoothed, we are going to differentiate them!  For example, discontinuous digital cashflows are represented as tight call spread  Smoothing is standard practice, necessary for the computation of Monte-Carlo risk reports  We ignore discounting/numeraire to simplify notations  We call payoff the sum of cashflow paid after some horizon date (maybe today) • The price is the conditional expectation of the payoff  It is therefore a deterministic (but unknown) function f of the current state S  Could be evaluated by Monte-Carlo for every different input state  Our objective is learn an accurate approximation of f that evaluates with analytic speed   ,0p p t pCF g S t T    Q t t tV E S f S     p p T t CF   
  10. 10. Context and notations (2) • Assuming appropriate smoothing • Differentiation and expectation commute: • Risk sensitivities are conditional expectations of pathwise diferentials  Pathwise differentials = differentials of final payoff wrt initial state along a path  Random variables, measurable on maturity = payment date of the final cashflow  Computed very efficiently with AAD, recall movie tutorial • Monte-Carlo risk reports are computed by averaging pathwise differentials • See detailed discussion in appendix 1 github.com/differential-machine-learning/appendices/blob/master/App1-LSM.pdf   Q t Q t t E S xf x E S x x x S                pathwise differentials risk sensitivities
  11. 11. Example: European call in Black & Scholes • State vector: spot price • Model: geometric Brownian motion • State on horizon date: spot price at T1 • Payoff: European call, paid at T2 • Price at T1: expected payoff • Given by Black & Scholes formula • Pathwise differential:  In Black & Scholes:  Computed in closed form  Effectively, together with payoff  Random var with expectation = delta:  Extend to general case:  Always efficiently computed with AAD  Always unbiased estimates of risk sensitivities    1 1 2 1T T T Tf S E S E S K S           1TS  2TS K        2 2 1 2 1 2 1 2 T TT T W W T T dS dW S S e S                    2 2 1 2 1 1 1 1 2 1 2 T TT T W W T T Tf S E S e K S N d KN d                     S             2 2 1 2 12 2 2 2 2 2 1 1 2 1 1 2 1 1 T T T T T T W WT T T T S K S K T T T T T S K S K S S e S S S S S                                12 1 1 2 1 1 1 1 2 11 T TT T TS K T T T S N d KN dS E S E S N d S S S                       
  12. 12. Learning from samples
  13. 13. Simulated dataset • Simulate a training set a la Longstaff-Schwartz (2001) • Each training example is simulated with one Monte-Carlo path:  Training example (i) picked on simulated path (i)  Training input i: the state vector on some horizon date (possibly today):  Training label: the final payoff on the same path: • The entire training set of m examples is simulated with m Monte-Carlo paths in a time similar to one pricing by Monte-Carlo • Hence, dataset is prepared in realistic time suitable for realtime training     ex i i TX S     p ex ii p T T Y CF   
  14. 14. Training example X (vec of dim n) = state at Tex Y (real number) = payments at Tp > Tex 1 2 ... 1. Sample state on horizon date    1 1 exTX S    1 1 exTX S
  15. 15. 2. Run Monte-Carlo path, accumulate future payments Training example X (vec of dim n) = state at Tex Y (real number) = payments at Tp > Tex 1 2 ...    1 1 exTX S    1 1 p ex p T T Y CF       1 1 exTX S    1 1 p ex p T T Y CF   
  16. 16. 3. Repeat m times independently Training example X (vec of dim n) = state at Tex Y (real number) = payments at Tp > Tex 1 2 ...    1 1 exTX S    1 1 p ex p T T Y CF       2 2 exTX S    2 2 p ex p T T Y CF       2 2 exTX S    2 2 p ex p T T Y CF   
  17. 17. Training on samples 1. Practical  Reuse the Monte-Carlo pricing engine, reconfigured as a training set simulator 2. Efficient  The entire training set is simulated for the cost of one Monte-Carlo pricing  If you are going to price your book, you get the training set for free  Just store states and payoffs for every Monte-Carlo path  Variance reduction methods (e.g. Sobol sequences) automatically reused 3. Consistent  Dataset produced with the same model as front office pricing 4. General  If you can price a transaction, then you can simulate a dataset  Script your transaction  Calibrate your model  Simulate the training set
  18. 18. Can we really learn prices from samples? • The machine is expected to learn a pricing function • Without having seen a single price, only samples • Example: European call in Black-Scholes • Can the machine figure Black & Scholes formula from payoff samples? learn this      2 1 1 1 2T T TE S K S S N d KN d         from this       1 2 ,i i T TS S K                     2 2 1 2 1 1 2 1 2 and i i T TT T W Wi i i i i T T TX S Y S K S e K                    
  19. 19. Yes, we can • Mathematical guarantee:  A universal approximator with learnable weights w  Capable of approximating any function when capacity (~number of weights) grows to infinity  Includes classic basis function regression and neural networks  Trained on sample payoffs by minimization of the mean squared error  Converges to the true pricing function, a.k.a. conditional expectation  Asymptotically in the size m of the training set and the capacity of the approximator • Proof and discussion: appendix 1 github.com/differential-machine-learning/appendices/blob/master/App1-LSM.pdf • Intuition:  True price is the conditional expectation of the payoff  Hence, payoff is an unbiased (noisy) estimate of the true price  Independent noise diversified away in a growing dataset  Approximators converge to the closest function to f they are able to represent  Universal approximators converge exactly to f when capacity grows         2 1 1 ˆ ; m i i i MSE w Y f X w m        ˆ ;f x w  i Y  f X E Y X      where 0Y f X E X         Q Tf x E Y X x E S x           predictionssampled labels independent noise
  20. 20. Caveat 1 • The result is asymptotic in the size of the dataset size and the capacity • Finite capacity approximators (e.g. finite size neural nets) do not converge to true prices  When dataset grows, finite size neural nets converge to the projection of the true pricing function onto the subspace of functions attainable by the net  The difference is called bias approximation error that remains when training with a dataset of infinite size  The space of functions spanned by neural nets grow linearly in units and exponentially in layers  Bias vanishes quickly with neural depth  This is the main point of deep neural networks  Excellent presentation by Matthew Dixon @ Department of Mathematics, Illinois Institute of Technology, Nov 4th 2019 • With finite size datasets  Noise is not fully diversified away  Training may overfit noise in the training set and fail to approximate the true function  Overfitting is related to variance: dependency of final result to noise of a particular training set
  21. 21. Caveat 2 • The global minimum of the MSE converges to the true pricing function • There is no guarantee that your optimization algorithm finds it in finite time • With neural nets, MSE is a nonconvex function of weights • There exists no algorithm guaranteed to find the global minimum • Modern heuristics, see demonstration notebook github.com/differential-machine-learning/notebooks/blob/master/DifferentialML.ipynb  Xavier-Glorot weight initialization  Normalization of inputs and labels  ADAM optimizer  Leslie Smith’s one-cycle learning rate schedule • Help training quickly converge to “good” solutions in “most” situations • But without hard guarantee, not good enough for online learning in a risk system • In appendix 4, we introduce a neural architecture with worst case training guarantees github.com/differential-machine-learning/appendices/blob/master/App4-UnsupervisedTraining.pdf
  22. 22. Training with differentials • In summary:  We work with a sampled dataset  Where is the initial state on path i, and is the final payoff on the same path  The entire dataset is computed for the cost of one Monte-Carlo simulation  With guaranteed convergence to true pricing function by training a universal approximator to minimize MSE • We can do better than that:  With AAD we can compute pathwise differentials for very little cost:  Pathwise differentials are differential labels = differentials of labels wrt inputs  We make them part of an extended training set to teach the machine the shape of the function       , i i X Y         , , i i i i Y X Y X                ex i i TX S    i i Y          ex i i i i T Y S X       i Y  i X
  23. 23. Deep Learning
  24. 24. • Results produced with simple self contained notebooks available on github.com/asavine/CompFinLecture/tree/master/MLFinance • Implement polynomial regression and simplistic neural nets on simulated samples Simple example: Black & Scholes polynomial regression: instantaneous simplistic neural net: < one second
  25. 25. High(er) dimensional example: basket option • Available on github.com/asavine/CompFinLecture/tree/master/MLFinance • Toy basket option, correlated Gaussian model dimension 10 • Correct price known in closed form, (Bachelier) function of current basket • We know its a 1d problem, but the machine sees it in dimension 10 polynomial regression: now ~one minute! simplistic neural net: still < one second
  26. 26. Classic regression in high dimension • It looks like classic regression fails in high dimension  Hence unsuitable for general trading books  Where dimension is usually high  Think Libor Market Models on all yield curves: Markov dimension grows in the 100s • On the contrary, neural nets seem immune to dimensionality • Let us try to understand this
  27. 27. Curse of dimensionality • Classic regression: • The number d of basis functions g increases exponentially with the dimension n of the input x • Example: polynomial basis with degree up to p:  Number d of monomials of the form  Is • So we have a large number d of regression parameters w and that makes regression:  Expensive, hence computation time ~1min  Vulnerable to overfitting, hence the poor performance: Training tries to fit samples and ends up interpolating between them instead of learning the underlying pattern  ! ! ! n p d n p       1 ˆ ; d j j j f x w w g x      11 ,ij n n k j i ij ii g x x k p   
  28. 28. Classic LSM • Least Square Method (LSM, Longstaff-Schwartz, Carriere, 1999-2001)  Approximates prices by classic regression on samples  Served the industry well for 20y for Bermudan options and other types of early exercisable transactions  How is this consistent with our results? • This is because implementations of LSM do not regress on the entire state  They regress on a low number of regression variables  Regression variables are fixed, hardcoded function of the state  Derived from prior knowledge and analysis of the transaction and model • Example: Bermudan options in the Libor Market Model (LMM)  The reason why LSM was invented in the first place  LMM is a high dimensional model  The state vector contains all forward Libor rates up to the final maturity  With 3m Libors and maturity 30y, this is dimension 120  However, it is known that Bermudan options in LMM mainly depend on only two variables:  Swap rate to maturity  Discount rate to next call  Typical implementations approximate continuation values by regression on these two variables
  29. 29. LSM: manual feature extraction 3m 6m 9m ... 30y df 1y sw 20y 1x 2x 1x 2x 2 1x 2 2x 1 2x x “input” layer model state all 3m fwd rates up to 30y dimension 120 “hidden” layer regression variables swp to mat + df to call dimension 2 “regression” layer quadratic monomials dimension 5 “output” layer continuation value by regression fixed, hardcoded transformation state to basis learned linear regression
  30. 30. Limitations and neural networks • These particular regression variables only apply to standard Bermudas in LMM • With different transactions or different models, we need different reg. variables • E.g. with stochastic volatility, state of volatility is a 3rd variable, crucial in determination of value • What about arbitrary schedules of cashflows in arbitrary models? • Cannot use prior knowledge to turn a high dim state into a low dim regression vector • Only simulated data • We must learn regression features meaningful for this schedule of cashflows in this model • Automatically, from data alone: called automatic feature extraction in Machine Learning • Neural networks excel at it (hence their success in tasks like computer vision or NLP)
  31. 31. Neural networks: automatic feature extraction ... “input” layer model state “hidden” layers learn transformation of state into regression vars “regression” layer learnt basis functions “output” layer still a linear regression learned transformation state -> regr. vector learned linear regression
  32. 32. • Neural networks learn useful regression features in their hidden layers • And encode them in their connection weights • This is how neural nets adapt regression features to data • And break the curse of dimensionality • Just like we break it manually in the case of e.g. Bermudan options • But neural nets do it automatically, during training • And adapt to arbitrary schedules of cashflows and simulation models Neural networks: breaking the curse
  33. 33. Deep learning pricing functions • Appendix 4 github.com/differential-machine-learning/appendices/blob/master/App4-UnsupervisedTraining.pdf  Discusses the details of neural compared with classic regression  Combines neural nets with classic regression to obtain training convergence guarantees • Demonstration notebook github.com/differential-machine-learning/notebooks/blob/master/DifferentialML.ipynb  Discusses practical implementation details: architecture, initialization, normalization, optimization  TensorFlow implementation code
  34. 34. Deep learning limitations 1. Requires unrealistically large datasets 2. Computes poor risk sensitivities 3. Prone to overfitting on smaller datasets
  35. 35. Unrealistically large datasets • Back to basket option example  Dimension 30  Results produced with demonstration notebook github.com/differential-machine-learning/notebooks/blob/master/DifferentialML.ipynb
  36. 36. Poor risk sensitivities • Differentials of a good approximation are not always good approximations of differentials  Risk sensitivities of the approximation converge very slowly  Showing delta to the first stock in the basket
  37. 37. Computational cost • Most of the computation cost is in the simulation of the training set  Not training, usually ~1sec on entry level GPU  But the simulation of the model states  And the evaluation of the many cashflows in a large trading book • Hence, in realistic contexts we train on small datasets  In practice, 1,024 to 32,768 paths but no more • Training with small datasets is vulnerable to overfitting
  38. 38. Overfitting and variance • High capacity ML models like neural nets have many parameters -- with small training sets  Optimizers try to fit all the samples  End up interpolating between samples instead of picking underlying patterns  And fail to achieve meaningful approximation • Simple example: fit degree 6 polynomial on 7 samples in Black & Scholes  Perfect fit of the training set  Completely missing the correct function  With high variance: very different results depending on noise in training set
  39. 39. Regularization • Conventional regularization mitigates overfitting with constraints on learnable weights w • Example: Tikhonov (also called ridge)  Extend cost function with penalty on the size of weights:  Expressing preference for small weights  Hyperparameter lambda: regularization strength • Other common regularizations include  Lasso: L1 norm penalty  Dropout: randomly drop connections when training neural nets • Regularization constraints effectively stop optimizers from fitting samples • Therefore mitigate overfitting and reduce variance       2 1 221 ˆ* arg min - ; m i i w i X w ww Y f m       
  40. 40. Tikhonov regularization
  41. 41. Bias-variance tradeoff • Regularization Bias  Conventional regularization establishes arbitrary penalties  Why should we prefer smaller weights?  The only purpose is stopping optimizers from overfitting  By reducing effective capacity  Thereby introducing bias • Regularization strength is key  Too weak  overfitting – high variance  Too strong  underfitting – high bias  Finding the right amount of regularization is called bias-variance tradeoff  Theory intensively analysed in classic ML, see e.g. Bishop  Practically costly implementation, e.g. cross-validation
  42. 42. • Reducing variance doesn’t have to increase bias  Conventional regularization only introduces bias due to arbitrary preferences  Variance can be reduced without bias  E.g., trivially, by increasing the size of the training set  Although this generally very costly • Example: data augmentation in computer vision  Training id recognition requires a vast number of labeled images  Labelling images is costly and requires human supervision  Data augmentation: Make many labeled pictures out of one zoom, crop, recolor, rotate…  Increases dataset size at no cost  And effectively reduces variance without bias • Data augmentation is a more powerful form of regularization • In finance, differential training is also a form of data augmentation Overcoming tradeoff
  43. 43. Differential Machine Learning
  44. 44. Deep learning risk • Risks are sensitivities of prices wrt market states • Approximated by sensitivities of output (approx. price) to inputs (states) • Efficiently computed by backpropagation  Feed state into input layer  Compute approx. price By feedforward induction: feedforward equation left to right  Compute differentials of price wrt to state By backpropagation: adjoint equations right to left    ˆ ˆ;t tt t t t f S f S wV Y S S S X          
  45. 45. Unrolling backprop: twin networks • Combine feedforward induction and backpropagation in a single (twin) network of twice the depth: • The twin net computes prices and risks for twice the computation expense
  46. 46. Differential training • Now we have a twin net, we approximate prices and risks in one computation • But we have seen that the approximation of risks is poor and converges slowly • The network struggles to learn the shape of the function from punctual examples • Hence the idea: train the network on differentials • Recall, from our AAD powered simulation engine  We have high quality pathwise differentials for little cost  These are unbiased, independent estimates of true risks  So we could train our net to minimize errors on differentials  With a asymptotic guarantee to represent true risks      ˆ ;ˆ; ; , f X w g X w f X w X                   i i i i i t Y Z X S                             hence p i i i ti i iQ ti i i T t t t f S f X E Z E S Z S S X                            2 1 ˆ ; w* arg min i m i w i i f X w Z X     
  47. 47. Differential regularization • In practice, we train on a combination of value and differential MSEs: • Note the similarity with regularization, e.g Tikhonov • Differential training is a form of regularization  Imposes penalty for wrong differentials  Without changing capacity (~number of learnable weights)  Effectively stopping optimizers from overfitting training values  Hence, reducing variance                 2 22 1 1 1 ˆw* arg min ; ˆ ;m m i i i iw i i i i f X w X f m m Y Y X X w                       2 1 2 21 ˆ* arg min ; m i i w i f w ww Y X m         differential labels computed e.g. with AAD fed to training set differential predictions computed with the twin net hyperparameter expressing preference for correct differentialsusual MSE
  48. 48. Differential regularization is different • We do not express preferences, whereby introducing a bias with arbitrary constraints • We enforce correctness of differentials, thereby reducing variance without bias • Proof:  With infinite preference for differentials lambda, we train on differentials alone  With asymptotic convergence guarantee towards a function with all the correct differentials (by the same argument as before: pathwise differentials are unbiased estimates of true risks)  That is, the correct function modulo additive constant! • There is no bias-variance tradeoff: variance is reduced for free • Hence, (virtually) no sensitivity to lambda, contrarily to other forms of regularization • See:  appendix 3 github.com/differential-machine-learning/appendices/blob/master/App3-Regression.pdf  and second demonstration notebook github.com/differential-machine- learning/notebooks/blob/master/DifferentialRegression.ipynb  for a comparison between classical and differential regularization in the context of regression
  49. 49. Toy Black-Scholes polynomial fit
  50. 50. Basket option example • Back to basket option  Dimension 30  Results produced with demonstration notebook github.com/differential-machine- learning/notebooks/blob/master/D ifferentialML.ipynb prices deltas
  51. 51. Differential training: further benefits • Empirically, differential training also stabilises sensitivity to other hyperparameters: network architecture, weight initialization, optimization settings • By construction, we solidly approximate not only prices but also risk sensitivities • The neural net learns the shape of the pricing function from differential labels • Thereby learning more effectively, and in a more stable manner • Differential regularization is superior because it injects additional information • Acting like additional datapoints, produced for very little cost with AAD • In this regard, it is similar to data augmentation in computer vision
  52. 52. • The examples enlighten the most critical benefit of differential training: we can now train with a lot less samples • This is because differential ML is a (particularly effective) form of data augmentation Differential augmentation Computer Vision • Make many pictures out of one • Increases dataset for negligible cost • Teaches important invariances: this is all me Finance • Differentials inject additional information • Act like additional data points (think finite difference) • Also increase dataset for negligible cost (e.g. with AAD) • And teach the machine the shape of the function original point 2 new points (think finite diff)
  53. 53. Real life example: A worst-of 4 autocallable • 4 correlated local volatility models • 8,192 paths • Script (Superfly Analytics ”Jive” language): choose(x,p,n,e) = 0 /2-/2 n p x explicit smoothing
  54. 54. A worst-of 4 autocallable no derivatives • Training: Starting horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  55. 55. A worst-of 4 autocallable no derivatives • Training: Running … horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  56. 56. A worst-of 4 autocallable no derivatives • Training: Running …… horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  57. 57. A worst-of 4 autocallable no derivatives • Training: Running ……… horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  58. 58. A worst-of 4 autocallable no derivatives • Training: Complete horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  59. 59. A worst-of 4 autocallable with derivatives • Training: Starting horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  60. 60. A worst-of 4 autocallable with derivatives • Training: Running … horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  61. 61. A worst-of 4 autocallable with derivatives • Training: Running …… horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  62. 62. A worst-of 4 autocallable with derivatives • Training: Running ……… horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  63. 63. A worst-of 4 autocallable with derivatives • Training: Complete horizontal axis: targets with nested simulations verticalaxis:networkprediction test set: out of sample states
  64. 64. A worst-of 4 autocallable No derivatives With derivatives
  65. 65. A real netting set no derivatives • Hybrid model with 20 state variables • 8,192 paths • Training: Starting
  66. 66. A real netting set no derivatives • Training: Running …
  67. 67. A real netting set no derivatives • Training: Running ……
  68. 68. A real netting set no derivatives • Training: Running ………
  69. 69. A real netting set no derivatives • Training: Complete
  70. 70. A real netting set no derivatives • Performance stopped improving • Midway during training • When overfitting kicked in • We finally managed to get (very) decent results • But with at least 65,536 paths • And some hyperparameter tweaking • With differential regularization • It just works with 8,192 paths • And no tweaking at all
  71. 71. A real netting set with derivatives • Training: Starting
  72. 72. A real netting set with derivatives • Training: Running …
  73. 73. A real netting set with derivatives • Training: Running ……
  74. 74. A real netting set with derivatives • Training: Running ………
  75. 75. A real netting set with derivatives • Training: Complete
  76. 76. A real netting set with derivatives • Orders of magnitude improvement • Almost perfect approximation • Small error remains on right asymptotic  Disappears with 16,384 samples  More at the end of the talk
  77. 77. A real netting set No derivatives With derivatives
  78. 78. Differential ML: implementation in TensorFlow • Demonstration code: github.com/differential-machine-learning/notebooks/blob/master/DifferentialML.ipynb • Including practical details like initialization, normalization and optimization • Further details on implementation code in this post: https://towardsdatascience.com/differential-machine-learning-f207c158064d?source=friends_link&sk=f2325c6686c11e22286f1d1d6e4daf87
  79. 79. • AI learns from your simulation system  To reproduce what your system does  Orders of magnitude faster • Efficiently learns from samples  Makes the problem tractable in practice  Guaranteed solid mathematics • Classic LSM regression is not good enough  Suffers the curse of dimensionality  Requires hardcoded features  Unrealistic for general trading books • Neural nets  Overcome the shortcomings of classic regression  Learn useful features from data, during training  Just work on textbook examples Conclusion: pricing and risk with AI • Neural nets are limited in real world situations  Require unrealistically large datasets  Prone to overfitting on small datasets  Poorly predict risk sensitivities / shape • The missing piece is differential ML  Train twin networks  On datasets augmented with differentials  Very efficiently computed with AAD  Sharply improves risk estimates  Corrects the shape of the pricing function  Regularizes and prevents overfitting without bias  Improves speed and stability of training  Makes training resilient to hyperparameters  Therefore, trains effectively on small datasets
  80. 80. System overview production risk system (parallel CPU) tensorflow trainer (GPU) simulated training set         , , i i i i Y X Y X         • states • sample payments • (path-wise differentials) learned weightspricing/risk function ˆ ˆ,f f XVA, CCR, FRTB, MVA…
  81. 81. Conclusion: unsupervised operation • Automated training behind the scenes requires convergence guarantee  Practical training of neural nets offers none: no algorithm is guaranteed to find minimum MSE  Modern heuristics: Xavier-Glorot initialization, Dataset normalization, ADAM optimization, one-cycle scheduling  Help converge to acceptable minima in most situations, but without strong guarantee  Risk management is not build on faith or empirical evidence  In appendix 4 https://github.com/differential-machine-learning/ appendices/blob/master/App4-UnsupervisedTraining.pdf  We explore Google’s ‘wide and deep’ architecture  And demonstrate hard worst case guarantees in this case  Allowing for unsupervised training • Some applications are dependent on asymptotics  Value at risk, expected loss, FRTB…  ML models struggle to learn correct asymptotics for lack of substantial asymptotic data  This is corrected with dedicated algorithms  Also in appendix 4
  82. 82. Conclusion: extension to other ML models • Differential ML is not limited to deep learning • We obtained equally remarkable results with other kinds of machine learning models • In the context of classic regression  Differentials also provide effective regularization  Without bias, contrarily to e.g. ridge/Tikhonov  Solution remains analytic  See appendix 3 github.com/differential-machine-learning/appendices/blob/master/App3-Regression.pdf  And demonstration notebook github.com/differential-machine-learning/notebooks/blob/master/DifferentialRegression.ipynb • In the context of PCA  Differential PCA identifies the principal risk factors of a transaction or trading book  Also provides an extremely effective data preparation and dimension reduction step  See appendix 2 https://github.com/differential-machine-learning/appendices/blob/master/App2-Preprocessing.pdf
  83. 83. Thank you for your attention In-House system of the Year 2015 Winner: Superfly Analytics at Danske Bank Excellence in Risk Management and Modelling, 2019 Winner: Superfly Analytics at Danske Bank

×