Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Sky Earth 2018 Introduction to machine learning

First course of the 2018 Big Sky Earth School which was held in Kaarina, Finland
A general introduction to Machine Learning

  • Inicia sesión para ver los comentarios

Big Sky Earth 2018 Introduction to machine learning

  1. 1. Pintxos/voileipäpöytä /закуски of Machine Learning BigSkyEarth 2018 Kaarina
  2. 2. Menu • What is Machine Learning? • Where does it come from? • What now? • Why now? • Machine Learning in the Sky • The Machine Learning Landscape • Machine Learning Pipeline • Neural Networks • Some useful concepts • Machine Learning Tools • Zoom on selected libraries • Zoom on a few algorithms • Random Forest • Gradient Boosting • Kohonen’s map • Autoencoder • Convolutional Neural Network • Generative Adversarial Network
  3. 3. What is Machine Learning? Machine… … Learning?!?
  4. 4. What is Machine Learning? Not quite so exciting… Learning? NO! Nor thinking… More algorithms enabling to fit complex data relationships More like advanced statistical inference More like implicit programming More like extracting information dynamic from data for generalization… Sobering thought: linear regression belongs to Machine Learning! That said, some mimicking taking place: ➢ Trying to improve a system’s response to novel perception thanks to experience ➢ Human biology inspired artificial neural networks
  5. 5. What is Machine Learning? Artificial Intelligence Machine Learning Neural Networks Deep Learning
  6. 6. What is Machine Learning? • “Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world.” – Nvidia • “Machine learning is the science of getting computers to act without being explicitly programmed.” – Stanford • “Machine learning is based on algorithms that can learn from data without relying on rules-based programming.”- McKinsey & Co. • “Machine learning algorithms can figure out how to perform important tasks by generalizing from examples.” – University of Washington • “The field of Machine Learning seeks to answer the question “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” – Carnegie Mellon University Source:
  7. 7. What is Machine Learning? To summarize: A set of computing and mathematical techniques whose aim is to achieve human-level or better-than- human performance at cognitive tasks such as: •Predicting •Classifying •Generating signals / interacting •Etc. Source:
  8. 8. Complementary fields Data Visualization DATA SCIENCE Cloud Computing Machine Learning “Business” Knowledge
  9. 9. Differences between ML and Statistical Modeling Statistical Modeling Machine Learning Parametric models that try to “explain” the world. The focus is on modeling causality Non-parametric models that try to “mimic” the world rather than “explain” it. Often uses correlations as proxies to causality Deduce relations for observed quantities by parameter estimation for a pre-specified model of the world Induce relations between observable quantities, main goal is predictive power Small data (1-100 attributes,100- 1000 examples) Large data (10-100K attributes, 1K- 100M examples) Scalability is typically not the major concern Scalability is often critical in applications Based on a probabilistic approach Some ML methods are not probabilistic (SVM, neural networks, clustering, etc.)
  10. 10. Where does it come from?
  11. 11. Where does it come from? Pioneer age 1943 – McCulloch-Pitts neurons (neuro-scientist and logician) 1950 - Alan Turing envisioned ML 1952 – Arthur Samuel self-improving chess program 1957 – Frank Rosenblatt, perceptron 1959 – David H. Hubel and Torsten Wiesel simple vs complex cells 1960 – Heny J. Kelley Control Theory  Backpropagation 1965 – Alexey Ivakhnenko and V.G. Lapa Group Method of Data Handling, 8-layer DNN 1980 – Kunihiko Fukushima Neocognitron (pattern recog’), led to CNN 1982 – John Hopfield, Hopfield Network, RNN 1985 – Terry Sejnowski NETtalk, English pronounciation 1986 – Rumelhart, Geoffrey Hinton and Romuald J. Williams, backpropagation 1989 – Yann LeCun, handwritten digits with CNN 1989 – Christopher Watkins, Q-learning for Reinforcement Learning Source:
  12. 12. What now? Modern days 1993 – Jürgen Schmidhuber, 1000-layers RNN 1995 – Corinna Cortes and Vladimir Vapnik, SVM 1997 - Jürgen Schmidhuber and Sepp Hochreiter, LSTM 1997 – IBM’s Deep Blue beat Garry Kasparov 1998 – Yann Lecun, stochastic gradient descent 2009 – Fei-Fei Li, ImageNet 2011 – Alex Krizhevsky, AlexNet CNN 2011 – IBM’s Watson wins Jeopardy 2012 – ImageNet won by AlexNet, better than humans 2014 – Facebook’s DeepFace 2014 – Ian Goodfellow, Generative Adversarial Network 2016-2017 - Google TensorFlow v1.0 in open source Source:
  13. 13. What now? Modern days Source:
  14. 14. Why now? ALGORITHMS COMPUTING RESSOURCES DATA MACHINE LEARNING End of the 90’s we had the algorithms but… … we were lacking in other departments…
  15. 15. Why now? Trillion-fold increase of computing power and storage Source:
  16. 16. Why now? Data, always more data
  17. 17. Why now? Data, always more data Source: Computerworld, 2011 Source: Forbes, 2017 TB = 1012 Bytes PB = 1015 Bytes EB = 1018 Bytes ZB = 1021 Bytes
  18. 18. Pause for thought: Artificial vs Natural Intelligence Name # of neurons / # of synapses Visuals Caenorhabditis elegans 302 Hydra vulgaris 5,600 Homarus americanus 100,000 Blatta Orientalis 1,000,000 Nile Crocodile 80,500,000 Digital Reasoning NN (2015) ~86,000,000 (est.) / 1.6E11 Rattus Rattatouillensis 200,000,000 Blue and yellow macaw 1,900,000,000 Chimpanzee 28,000,000,000 Homo Sapiens Sapiens 86,000,000,000 / 1.5E14 African Elephant 257,000,000,000 Source:
  19. 19. Machine Learning in the Sky • Machine Learning owes a lot to astronomy: least-square regression for orbital parameter estimation (Legendre-Laplace-Gauss)
  20. 20. Machine Learning in the Sky Data Big Bang in Astronomy too: 109 object photometric catalogs from USNO, 2MASS, SDSS… 106-8 spectroscopic catalogs from SDSS, LAMOST… 106-7 multi-wavelength source catalogs from WISE, eROSITA… 109 object x 102 epochs surveys like LSST, DES, PTF, CRTS, SNF, VVV, Pan- STARRS, Stripe 82 Spectral-image datacubes from VLA, ALMA, IUFs…
  21. 21. Machine Learning in the sky Supernovae of data Sources: Computer World, LSST DR11 37 109 objects, 7 1012 sources, 5.5 million 3.2 Gigapixel images 30 terabytes of data nightly Final volume of raw image data = 60 PB Final image collection (DR11) = 0.5 EB Final catalog size (DR11) = 15 PB Final disk storage = 0.4 Exabytes Peak number of nodes = 1750 nodes Peak compute power in LSST data centers = about 2 PFLOPS
  22. 22. Machine Learning in the Sky Explosion in number of papers too: • From 2010 till 2018, 446 astro-ph papers on arXiv with “Machine Learning” in the abstracts • Only 5 papers in 2010 • 80% of the total were published after September 2014 • In all fields of astrophysics
  23. 23. The Machine Learning landscape Supervised Learning Unsupervised Learning Regression Classification Learn real- valued function given (Xi , Yi) Learn discrete class function given (Xi , Ci) Clustering Representation Learning Learn discrete class function given (Xi ) only Learn representing function given (Xi ) only Rn →[1,k] Rn →[1,k] Rn → RkRn → R
  24. 24. The Machine Learning landscape Reinforcement Learning Policy Optimization Inverse RL Learn policy function given (si , si+1, ai , ri ) Learn reward function given (si , si+1, ai ) Rn → Rk Rn → R Additional categories Transfer learning Semi-supervised learning Active learning Sequence modeling RL methods for Supervised and Unsupervised Learning
  25. 25. The Machine Learning landscape Supervised Learning Unsupervised Learning Regression Classification Linear Regression Trees / CART SVM/SVR Ensemble methods Neural Networks Logistic Regression Naive Bayes Nearest neighbors SVM Decision trees Ensemble methods Clustering Representation Learning K-means Hierarchical clustering Gaussian mixtures Hidden Markov NN (SOM/ART) PCA/ICA Factor models Dim. reduction Manifold learning NN (GAN/VAE/AR)
  26. 26. Overview of the Machine Learning landscape Reinforcement Learning Policy Optimization Inverse RL Model-based RL Model-free RL Batch/online RL Linear models RL Neural Networks Model-based IRL Model-free IRL Batch/online IRL MaxEnt IRL Neural networks • Neural networks is the most universal (and scalable) approach • Two types of methods tend to dominate Kaggle competitions: • Ensemble methods (Random Forests and Gradient Boosting) • Deep Learning
  27. 27. Overview of the Machine Learning landscape
  28. 28. Overview of the Machine Learning landscape Top of the class Top of the class
  29. 29. Machine Learning Pipeline
  30. 30. Training Data Preparation Machine Learning Pipeline Raw Dataset Load Data Prepared Data Apply Algorithm Select Features Explore Data Clean Data Normalize ML Algorithms Evaluate & Tune Deploy Model Publish!
  31. 31. Machine Learning: neural networks Single neuron: computation structure inspired by nature |g(a)𝑎 = ∑𝑤𝑖 𝑥𝑖 x1 x2 … xi … xn w2 w1 wi wn Activation Activation function z If g = identity or sigmoid Linear/logistic regression
  32. 32. Machine Learning: neural networks Neural networks are connected layers of artificial neurons
  33. 33. Machine Learning: neural networks All sorts of architectures!
  34. 34. Machine Learning: neural networks Pick activation functions adapted to desired output For multi-class output, choose Softmax function:
  35. 35. Machine Learning: Deep Learning POWERFUL CPU/GPU x BIG DATA => LEVERAGE ALGORITHMS Size matters! Not deep Deep Try
  36. 36. Some Useful Concepts • Parameters and Hyperparameters • Underfitting / Overfitting / Bias-variance trade-off • Training/Dev/Test sets • Loss or cost function • Forward propagation / Back-propagation • Batch vs mini-batch vs stochastic descent • Dimensionality reduction • Data augmentation • Performance Metrics
  37. 37. Some Useful Concepts Parameters and Hyperparameters • Parameters are learned from the data • Hyperparameters are set a priori then tuned Examples : Model Parameters Hyperparameters Linear regression Coefficients Intercept Number of features k-means Indexing of clusters Number of clusters k Neural Network Weights Biases Number of layers Number of neurons per layers Activation functions Learning rate Epochs / batch size Etc.
  38. 38. Some Useful Concepts Underfitting and overfitting Mismatch between number of parameters and data
  39. 39. Some Useful Concepts Bias-variance trade-off • Related to underfitting and overfitting • Know data well but not too well for generalization Sweet spot
  40. 40. Some Useful Concepts Bias-variance trade-off Low bias: model learned data well Low variance: model can generalize well Remedies High Bias • Train longer • Increase model complexity • more features • more parameters, • richer architecture High Variance • Get more data • Decrease model complexity • less features • less parameters, • simpler architecture • Regularization • Early stopping • Drop-out
  41. 41. Some Useful Concepts Training/dev/test sets • Training set to fit model with a priori hyper-parameters • Dev or (cross-)validation set to tune hyper-parameters • Test set to assess final performance of model on unseen data • Typical splits 60/20/20 or 80/10/10 or 98/1/1 in deep learning
  42. 42. Some Useful Concepts Loss function • Depends on problem tackled • Measures the fit between current output and target output • Must decrease as training goes on: Source: On average!
  43. 43. Some Useful Concepts Forward propagation and backpropagation Forward propagation: get estimates during training and predictions after Backpropagation: apply chain rule to gradient of loss function to adjust weights/biases
  44. 44. Some Useful Concepts Batch vs mini-batch vs stochastic descent Batch: feed the whole training set at each training epoch Mini-batch: feed subsets (random or not) at each training epoch Stochastic descent: mini-batch of size 1 It’s a tradeoff!
  45. 45. Dimensionality reduction • Too many features • Expensive to store • Slowing down computation • Subject to Dimensionality curse • Sample space gets harder and harder to fill as dimensions grow • A reason why too many features lead to overfitting as data become sparse • More and more data needed to fill the same % of space: Select the features! And use PCA/ICA/SVD/LDA/QDA/Autoencoders… Some Useful Concepts
  46. 46. Data Augmentation • When more data are needed, make up new ones! • Translate, rotate, flip, crop, lighten/darken, add noise, dephase, etc. Some Useful Concepts
  47. 47. Some Useful Concepts Performance metrics • Compare error to simplest method as a benchmark, e.g. linear regression or logistic regression Classification problems • Accuracy • Precision-recall / F1-score • ROC-AUC • Confusion matrix • Log-Loss Regression problems • MSE / RMSE/ MSPE / MAE • R2 / Adjusted R2 Source: NOT DISCUSSED IN THIS DOCUMENT
  48. 48. Some Useful Concepts Classifier performance metric Accuracy Classified as Classified as True positive True negative False negative False positive
  49. 49. Some Useful Concepts Classifier performance metric Accuracy = (TP + TN) / All cases Classified as Classified as True positive True negative False negative False positive
  50. 50. Some Useful Concepts Classifier performance metric Accuracy = (TP + TN) / All cases Classified as Classified as • Counts whenever the classifier is right • Simple and intuitive metric BUT • Assigns same cost to false positives and false negatives • Use with caution because of the accuracy paradox: a dumb classifier based on majority class has better accuracy! • Absolutely avoid with highly imbalanced classes
  51. 51. Some Useful Concepts Classifier performance metric Precision vs Recall Classified as Classified as True positive True negative False negative False positive
  52. 52. Some Useful Concepts Classifier performance metric Precision vs Recall Classified as True positive False positive Precision = TP / (TP + FP) • High precision means high selectivity • A selected sample has high probability to belong to the correct class • Some actual positives have been brushed off • A low precision means lots of false positives
  53. 53. Some Useful Concepts Classifier performance metric Precision vs Recall Classified as Classified as True positive False negative Recall = TP / (TP + FN) • High recall means most positives have been identified as such, at the cost of (some) false positives • Low recall means lots of false negatives
  54. 54. Some Useful Concepts Classifier performance metric F1-score • F1-score synthesizes both precision and recall F1 = 2 * Precision x Recall / ( Precision + Recall) • Need to take into account the desirable trade-off: • E.g. cancer diagnostics, better to have a higher recall to minimize false negatives • E.g. spam detection, better to let pass some false negatives than to eliminate legit emails • E.g. zombie apocalypse scenario, better to have high precision to avoid letting infected people into the safe zone…
  55. 55. Some Useful Concepts Classifier performance metric AUC-ROC Area Under Curve – Receiver Operating Characteristics FPR : False Positive Rate A good classifier has high sensitivity and high specificity Source:
  56. 56. Some Useful Concepts Classifier performance metric AUC-ROC How good is the model at distinguishing between classes at different thresholds? How much do you pay your true positives? Source: Lots of false positives Lots of true positives Few false positives Few true positives Ideal case: AUC = 1
  57. 57. Some Useful Concepts Classifier performance metric AUC-ROC Source:
  58. 58. Some Useful Concepts Classifier performance metric Confusion matrix • Interesting for analysis of classifier performance on multiclass set True class Predicted class
  59. 59. Some Useful Concepts Source: − 1 𝑁 𝑖 𝑁 [𝑦𝑖 log 𝑝𝑖 + 1 − 𝑦𝑖 log(1 − 𝑝𝑖)] − 1 𝑁 𝑖 𝑁 𝑐=1 𝑀 𝑦𝑖,𝑐log(𝑝𝑖,𝑐) If more than 2 classes:GoodNot good Classifier performance metric Log-Loss • Adapted to binary outputs and multi-classes data sets (if not too imbalanced) • Punishes extreme probability values when these are wrong
  60. 60. Some Useful Concepts • So, which metric to choose??? Well, it depends… Source: 86d5649a5428 Classes Balanced Imbalanced Binary If probability differences are critical: Log-Loss If only class prediction important and threshold tuning: AUC-ROC score F1 score is sensitive to threshold, tune before comparing Small class >0 or <0: ROC-AUC score Small positive class: F1 Multi-class Confusion matrix Log-Loss Averaging of Precision/Recall over classes (macro-averaging) • There are other metrics: Cohen’s kappa, Jaccard index, G-score…
  61. 61. Machine Learning Tools Main Python libraries Name Use Logo Pandas Data Analysis Spark Distributed Computing Scikit-learn Machine Learning Toolbox Keras Deep Learning TensorFlow Deep Learning Open-cv Computer Vision
  62. 62. Machine Learning Tools Artificial Intelligence Machine Learning Neural Networks Deep Learning
  63. 63. Machine Learning Tools TensorFlow – Keras Domination
  64. 64. Machine Learning Tools TensorFlow – Keras Domination
  65. 65. Zoom on Scikit-Learn logic 1- Import model from sklearn import svm from sklearn.neighbors import KNeighborsClassifier 2 - Instantiate model class clf = svm.SVC(gamma=0.001, C=100.) knn = KNeighborsClassifier() 3 - Train with the fit() method, iris_y_train) 4 - Make predictions with predict() clf.predict([-1:]) knn.predict(iris_X_test)
  66. 66. Zoom on Keras logic 1- Import model class from keras.models import Sequential 2 - Instantiate model class model = Sequential() 3 - Add layers with the add() method specifying input_dim or input_shape model.add(Dense(32, input_dim=784)) 4 - Add activation functions model.add(Activation('relu')) 5 - Configure training with compile(loss=,optimizer=, metrics[]) model.compile(optimizer='rmsprop', loss='binary_crossentropy’, metrics=['accuracy']) 6 - Train with the fit() method, labels, epochs=10, batch_size=32) 7- Evaluate the model performance with the evaluate() method: score = model.evaluate(x_test, y_test, verbose=0) 8 – Make predictions with predict(): predictions = model.predict(x_test)
  67. 67. Zoom on TensorFlow logic 1 – Define a computation graph: 2 – Start a TensorFlow session 3 – Actually execute the graph implementing nested loops on epochs and batches Source: tf.variable to be optimized (weights and biases) tf.constant as needed tf.placeholder for inputs All operations have a tf counterpart
  68. 68. Zoom on TensorFlow: Logic Basic example: # tf Graph input a = tf.placeholder(tf.int16) b = tf.placeholder(tf.int16) # Define some operations add = tf.add(a, b) mul = tf.multiply(a, b) # Launch the default graph. with tf.Session() as sess: # Run every operation with variable input print("Addition with variables: %i" %, feed_dict={a: 2, b: 3})) print("Multiplication with variables: %i" %, feed_dict={a: 2, b: 3})) Source:
  69. 69. Random Forest concept Random forest uses decision trees as base learners Regression Classification Decision trees are built so that the splits are prioritised by the amount of information provided
  70. 70. Random Forest concept Random forests are built by applying many decision trees to random subsets and random feature subsets: ensemble learning (here bagging)
  71. 71. Gradient Boosting concept • Boosting: creates a series of weak learners where new ones focus on data hard to classify. At the end of the process all learners are weighted and combined • Boosting can lead to overfitting, stop early enough! • Many variants: Gradient Boosting, XGBoost, AdaBoost, Gentle Boost • XGBoost is state-of- the-art
  72. 72. Gradient Boosting concept Gradient Boosting vs Random Forest Gradient Boosting Random Forest Base learners Trees Linear regression Trees Bias-variance of learners Stumps: high bias and low variance Full trees: low bias and high variance Hyperparameters tuning Lots! (see next page) Number of trees! Performance #1 Close 2nd
  73. 73. Gradient Boosting concept Some important hyper parameters for gradient boosting (XGBoost) to limit the tree growth: • max_features • min_sample_split • min_samples_leaf • max_depth
  74. 74. Self-organizing map concept • Inspired by specialization of neural areas in natural brains • Initially random vectors with same dimension as the input at each neuron on the grid • The closest to given input vector and its neighbours are nudged toward current input • Clustering, classification and visualization • Kohonen 1984
  75. 75. Autoencoder concept • A neural network whose output equals the input • Hour-glass shape as data is encoded then decoded • A way to extract meaningful features Compressed signal with reduced dimensions
  76. 76. Autoencoder: MNIST example Encoder Decoder 32X32 16X16 8x8 8x8 16X16 32X32
  77. 77. CNN concept • Convolutional Neural Networks are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. CNNs have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars. • : Source:
  78. 78. CNN concept Source: 2D Convolution: • Apply a filter on the image moving at a certain stride to build a feature map • Use several filters (depth) Image Filter
  79. 79. CNN concept Sources: What are “features”?
  80. 80. CNN concept Source: 2D Convolution: • Apply a Rectified Linear Unit (ReLU)
  81. 81. CNN concept Source: 2D Convolution: • Applying pooling to rectified feature maps
  82. 82. CNN concept Source: 2D Convolution: • Applying convolution + ReLU + pooling several times • Pass output to a traditional Multi Layer Perceptron • SoftMax output layer provides probabilities per classes
  83. 83. GAN concept Trying to forge the data distribution Trying to sort out real from fake Sharing a common loss function with opposite goals (min max) • A generative adversarial network learns a distribution not a relationship • Alternatives are variational autoencoders
  84. 84. GAN concept Source: Who are these people???
  85. 85. Machine Learning: references Papers • Check out arXiv for machine learning in astro-ph… MOOCs • All ML courses on Coursera by Andrew Ng • Deep Learning A-Z™: Hands-On Artificial Neural Networks on Udemy • courses Books “Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data” by Željko Ivezić & al. “Data Science from Scratch with Python: Step-by-Step Beginner Guide for Statistics, Machine Learning, Deep learning and NLP using Python, Numpy, Pandas, Scipy, Matplotlib, Sciki-Learn, TensorFlow” by Peter Morgan Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron
  86. 86. Hands-on exercises roadmap 1) keras_log_reg_EX (data = MNIST) 1) Complete missing lines 2) tf_log_reg_EX (data = iris.csv) 1) Complete missing lines 2) Play with learning rate 3) sk_xgboost_regression_EX (data = Boston) 1) Complete missing lines 2) Play with learning rate 3) Find a good value for n_estimators 4) Have a look at feature importance and sample tree 4) sk_sdss_EX (data = sdss_data.csv) 1) Reply to questions in the notebook as you execute cells after cells 5) tf_AutoEncoder_Fashion_EX (data = fashion-mnist_train.csv and fashion- mnist_test.csv) 1) Reply to questions 2) Make suggested trials 6) keras_gan_bimodal2_EX (data generated in notebook) 7) Check out Others
  87. 87. KIITOS !!!