Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark tutorial pycon 2016 part 1

3.572 visualizaciones

Publicado el

Build a Machine Learning model with Apache Spark MLLib to predict flight delays using weather data

Publicado en: Datos y análisis

Spark tutorial pycon 2016 part 1

  1. 1. David Taieb STSM - IBM Cloud Data Services Developer advocate david_taieb@us.ibm.com HANDS-ON SESSION: DEVELOPING ANALYTIC APPLICATIONS USING APACHE SPARK™ AND PYTHON Part 1: Flight Delay Predict with Spark ML PyCon 2016, Portland
  2. 2. ©2016 IBM Corpora6on Agenda •  Pre-requisite steps to be completed before the session •  Flight Predict app descrip6on and architecture •  Train the models in the Notebook •  Accuracy Analysis and models refinement •  Deploy and run the models
  3. 3. ©2016 IBM Corpora6on Sign up for Bluemix •  Access IBM Bluemix website on hMps://console.ng.bluemix.net •  Click on Get Started for Free •  Complete the form and click Create account •  Look for confirma6on email and click on confirm you account link Sign up for flightstats
  4. 4. ©2016 IBM Corpora6on Sign up for a free trial at Flightstats.com •  Sign up at hMps://developer.flightstats.com/signup •  Fill out the form and monitor email for confirma6on link (access to APIs may take up to 24 hours) •  Once access is granted go to hMps://developer.flightstats.com/admin/applica6ons to view appId and appKey (you will need them in the simple-data-pipe tool to create training sets. •  Op6onal: get familiar with the various flightstats apis: –  hMps://developer.flightstats.com/api-docs/scheduledFlights/v1 –  hMps://developer.flightstats.com/api-docs/airports/v1 How to find your app id and key
  5. 5. ©2016 IBM Corpora6on Where to find the FlightStats app id and app key APP ID APP Key Prepare your bluemix space
  6. 6. ©2016 IBM Corpora6on Create a new space on Bluemix In prepara6on for running the project, we create a new space on Bluemix Create a Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  7. 7. ©2016 IBM Corpora6on Create a Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  8. 8. ©2016 IBM Corpora6on Create New Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  9. 9. ©2016 IBM Corpora6on Agenda •  Pre-requisite steps to be completed before the session •  Flight Predict app descrip6on and architecture •  Train the models in the Notebook •  Accuracy Analysis and models refinement •  Deploy and run the models
  10. 10. ©2016 IBM Corpora6on Flight App Project Description •  Use case –  Flight delays are a common disturbance during business trips –  Being able to predict how likely a flight will be delayed can remove uncertainty and enable users to plan around it. –  Idea: Weather data can be a good explanatory variable for building predic6ve models •  ImplementaSon –  Combine flight sta6s6cs from flightstats.com (System of records) with weather data from IBM Insight for Weather (System of opera6ons) to build a training, test and blind set –  Use Spark MLLib to train predic6ve models and cross validate them –  Create a custom card for Google Now that will automa6cally no6fy user of impending flight delay –  Propose alterna6ng flight routes (e.g. Freebird) Get/Build/Analyze
  11. 11. ©2016 IBM Corpora6on Get/Build/Analyze methodology
  12. 12. ©2016 IBM Corpora6on Flight Predict App Architecture Weather Simple Data Pipes Airports Flight Schedules Flight Status Metadata Training Set Test Set Blind Set Custom Connector run every 24 hours Notebook
  13. 13. ©2016 IBM Corpora6on Flow Diagram Data Acquisi6on Data Prepara6on Data Annota6on (Ground Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model •  Itera6ve in Nature: we are never done! •  We will be using this diagram as a roadmap throughout this course Deploy and Run Model
  14. 14. ©2016 IBM Corpora6on Get the data and build the training/test/blind sets In this step we’ll use Simple Data Pipes open source project to acquire data from Flightstats, combine it with Weather data from IBM Insight for Weather and save the data sets into a NoSQL Cloudant Database. Data Acquisi6on Data Prepara6on Data Annota6on (Ground Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy and Run Model
  15. 15. ©2016 IBM Corpora6on Acquiring the data •  In the next sec6on, we show how to acquire the training data by using the simple-data-pipe tool and flight predict connector. •  The flight predict connector combine historical flight data from flightstats.com with weather data from IBM Insight for Weather •  If you want to skip these steps, you can use the already built dataset by using the following creden6als: –  cloudantHost: dtaieb.cloudant.com –  cloudantUserName: weenesserliffircedinvers –  cloudantPassword: 72a5c4f939a9e2578698029d2bb041d775d088b5 Deploy simple-data-pipe
  16. 16. ©2016 IBM Corpora6on Deploy simple-data-pipe with flightstats connector •  Go to hMps://github.com/ibm-cds-labs/simple-data-pipe •  Click on Deploy to Bluemix buMon Click buMon will take you to Bluemix
  17. 17. ©2016 IBM Corpora6on Complete simple-data-pipe deployment Add Weather service
  18. 18. ©2016 IBM Corpora6on Add an instance of IBM Weather Service on Bluemix •  Return to the applica6on dashboard •  Weather service is required by the flight predict connector and must be installed before •  From app dashboard, click on Add a service or API
  19. 19. ©2016 IBM Corpora6on Create an instance of IBM Weather Service on Bluemix Search for Weather Make sure to select “premium plan” to have enough authorized API calls
  20. 20. ©2016 IBM Corpora6on Checkpoint: simple data pipe app dashboard •  Verify that your app is correctly bound to the right services Weather Service used to enrich flight records with weather observa6ons Cloudant Service used to store training, test and blind data sets You’ll need to click on this buMon for the step on the next page It is recommended to increase the app memory to 1GB
  21. 21. ©2016 IBM Corpora6on Install flight predict connector •  Click Edit Code buMon, edit package.json to add flight predict module: – "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git" add flight predict module to dependencies Save your changes don’t forget to add comma in the line before to keep json valid
  22. 22. ©2016 IBM Corpora6on Install flight predict connector •  Click File/Save to save your changes Redeploy simple data pipe
  23. 23. ©2016 IBM Corpora6on Redeploy simple data pipe app •  Use live edit Editor to redeploy the app Verify your sdp install
  24. 24. ©2016 IBM Corpora6on Verify connector install •  In this step, we verify that the flight predict connector is correctly installed through the UI Fight connector correctly installed Create new flightstats pipe
  25. 25. ©2016 IBM Corpora6on Create a new FlightStats pipe •  Follow each screen to create and configure a new pipe Run the pipe
  26. 26. ©2016 IBM Corpora6on Run the pipe •  Skip over the schedule tab •  In the ac6vity tab, click on Run Now to start the pipe Explore the data set Click Run Now Then open the log to monitor the ac6vity
  27. 27. ©2016 IBM Corpora6on Explore the data sets •  In this step, we take a moment to explore the different data sets that have been created by the simple data pipe tool •  From bluemix dashboard, click on the cloudant service 6le, then on the Launch buMon •  From the Cloudant dashboard, open the training database •  Open a document to look at the data structure Build the test set
  28. 28. ©2016 IBM Corpora6on Run the pipe again to build the test set Train the models
  29. 29. ©2016 IBM Corpora6on Train the Models •  In the previous sec6on we have created the training data and we are now ready to train the models. •  Steps in this sec6on: –  Create an IPython Notebook –  Load the data sets from the Cloudant database into a Spark Cluster –  Explore the data and train the machine learning models Data Acquisi6on Data Prepara6on Data Annota6on (Ground Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy and Run Model Create IPython Notebook
  30. 30. ©2016 IBM Corpora6on Create a new IPython Notebook
  31. 31. ©2016 IBM Corpora6on Notebook tour
  32. 32. ©2016 IBM Corpora6on Notebook tour: Notebook Info
  33. 33. ©2016 IBM Corpora6on Notebook tour: Environment
  34. 34. ©2016 IBM Corpora6on Notebook tour: Sharing `
  35. 35. ©2016 IBM Corpora6on Agenda •  Pre-requisite steps to be completed before the session •  Flight Predict app descrip6on and architecture •  Train the models in the Notebook •  Accuracy Analysis and models refinement •  Deploy and run the models
  36. 36. ©2016 IBM Corpora6on Before we start building the app… •  You can op6onally follow this tutorial from Github by using a fully built notebook: – hMps://github.com/ibm-cds-labs/simple-data- pipe-connector-flightstats/blob/master/ notebook/Flight%20Predict%20PyCon %202016.ipynb
  37. 37. ©2016 IBM Corpora6on Optional: use prebuilt notebook Import required Python packages • Create notebook from URL • Use hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/ raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
  38. 38. ©2016 IBM Corpora6on Using Python Packages •  Write code inline within cells •  Encapsulate helper APIs within Python package •  2 ways of using helper Python packages –  egg distribu6on package: pip install from PyPi server or file server (e.g. Github) •  Persistent install across sessions •  Recommended in Produc6on –  SparkContext.addPyFile •  Easy addi6on of a python module file •  Support mul6ple module files via zip format •  Recommended during development where frequent code changes occur Manage egg packages
  39. 39. ©2016 IBM Corpora6on Flight Predict Python Package on Github Setup script for installing Python Package Flight Predict Python library
  40. 40. ©2016 IBM Corpora6on Method 1: Install Flight Predict Package •  Use pip to Install Flight Predict package •  Recommended alterna6ve: build egg distribu6on package and deploy in PyPi
  41. 41. ©2016 IBM Corpora6on Manage Python packages •  Check status •  Uninstall package Install packages via sc.addPyFile method
  42. 42. ©2016 IBM Corpora6on Method 2: Install py modules via sc.addPyFile •  addPyFile install individual py modules and make them available to all executor processes •  Works with modules in zipped files Module containing apis for training the models Module containing apis for running the models Configure creden6als for various services
  43. 43. ©2016 IBM Corpora6on Setup credentials and Import required python modules In this step, we import python modules that will be needed throughout the notebook and setup creden6als to various services. How to get creden6als for Cloudant and Weather Creden6al for Cloudant NoSQL Service Creden6als for Weather Service
  44. 44. ©2016 IBM Corpora6on Get Credentials for Cloudant From the app dashboard, click on Environment Variables from the les sidebar
  45. 45. ©2016 IBM Corpora6on Get Credentials for Weather Load training set from Cloudant
  46. 46. ©2016 IBM Corpora6on Load training set in Spark SQL DataFrame … In this step, we use the cloudant-spark connector (hMps://github.com/cloudant-labs/spark-cloudant) to load data into Spark Make sure to change the db name to match the one created for your training set by your ac6vity (open the Cloudant dashboard to find the name)
  47. 47. ©2016 IBM Corpora6on Loading data: Behind the scene Use Spark SQL connector to load data into a DataFrame connector id Op6ons Cache data for op6mized reuse Create temp SQL Table ScaMer Plot Visualiza6on
  48. 48. ©2016 IBM Corpora6on Scatter plot visualization
  49. 49. ©2016 IBM Corpora6on Visualization api Create an RDD of LabeledPoint
  50. 50. ©2016 IBM Corpora6on Transform into an RDD of LabeledPoint Use Spark SQL connector to load data into a DataFrame
  51. 51. ©2016 IBM Corpora6on loadLabeledDataRDD api Train Machine Learning Models
  52. 52. ©2016 IBM Corpora6on Machine Learning Algorithms ConSnuous Output Discrete Output Supervised Learning (require Ground-Truth) •  Regression - Linear - Ridge - Lasso - Isotonic •  Decision Tree •  RandomForest •  GradientBoostedTree • Classifica6on - Logis6c Regression - SVM - NaiveBayes • Decision Tree • RandomForest • GradientBoostedTree • K-NN (available as add-on spark package) Unsupervised Learning (no Ground-Truth data required) •  Clustering - KMeans - Gaussian Mixture •  Dimensionality Reduc6on - PCA - SVD •  FP-Growth Train Logis6c Regression Model
  53. 53. ©2016 IBM Corpora6on Train Logistic Regression Model Train Naïve Bayes Models
  54. 54. ©2016 IBM Corpora6on Train NaiveBayes Model Train decision Tree Model
  55. 55. ©2016 IBM Corpora6on Train Decision Tree Model Train Random Forest Model
  56. 56. ©2016 IBM Corpora6on Train Random Forest Model Accuracy Analysis
  57. 57. ©2016 IBM Corpora6on Naïve Bayes vs Decision Tree •  Probabilis6c: compute the probability of a data instance to be in a specific class •  Assume that each feature (variable) is independent from the others •  Performance depends on the predic6ve nature of the features (non predic6ve features will affect the accuracy) •  Works well with low amount of training data. Doesn’t need all the possibili6es •  Doesn’t work with categorical features. • Non-Probabilistic: partition the data into subsets that best describe the variable • The deeper the tree, the better the model fits the data • Watch out for overfiting: need to prune the tree • Can handle categorical or continuous features • No need for input to be scaled or standardized: Set you features and go! • Requires a lot of data covering all possibilities
  58. 58. ©2016 IBM Corpora6on Accuracy Analysis of the Machine Learning Models In this sec6on, we will perform accuracy analysis on the test data. We will start by compu6ng the accuracy metrics for each model, including the confusion matrix. We will then use histogram chart to understand the data distribu6on and refine how to classes are computed. Data Acquisi6on Data Prepara6on Data Annota6on (Ground Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy and Run Model
  59. 59. ©2016 IBM Corpora6on Agenda •  Pre-requisite steps to be completed before the session •  Flight Predict app descrip6on and architecture •  Train the models in the Notebook •  Accuracy Analysis and models refinement •  Deploy and run the models
  60. 60. ©2016 IBM Corpora6on Load Test data Make sure to change the db name to match the one created for your test set by your ac6vity (open the Cloudant dashboard to find the name)
  61. 61. ©2016 IBM Corpora6on Accuracy Metrics
  62. 62. ©2016 IBM Corpora6on Confusion Matrix
  63. 63. ©2016 IBM Corpora6on Confusion Matrix
  64. 64. ©2016 IBM Corpora6on Confusion Matrix
  65. 65. ©2016 IBM Corpora6on Confusion Matrix
  66. 66. ©2016 IBM Corpora6on Accuracy metrics API Output HTML Display results HTML in Notebook Cell Compute Metrics from labeled and predic6on data Get the confusion matrix and build html table
  67. 67. ©2016 IBM Corpora6on Understand the distribution of your data with Histograms
  68. 68. ©2016 IBM Corpora6on Training Handler class •  Provide flexibility and extensibility to the applica6on •  Provide a fail fast and try something else mechanism •  Enable user to easily customize classes of data based on how data is distributed •  Enable user to easily add training features
  69. 69. ©2016 IBM Corpora6on Default Training Handler class Return descrip6on for each classes Return total number of classes: Default is 5 Re-classify a record: default uses s.classifica6on field in Json record Extra features Names to be added. None by default Extra features to be added. Array must match the one returned by customTrainingFeaturesNames
  70. 70. ©2016 IBM Corpora6on Customize Training Handler Provide new classifica6on and add day of departure as a new feature Inherit from defaultTrainingHandler Add day of the week using a technique called dummy coding
  71. 71. ©2016 IBM Corpora6on Re-train the models
  72. 72. ©2016 IBM Corpora6on Re-compute accuracy Models 1 Models 2 BeMer accuracy for NaiveBayes and Logis6c Regression Worse for DecisionTree and RandomForest
  73. 73. ©2016 IBM Corpora6on Agenda •  Pre-requisite steps to be completed before the session •  Flight Predict app descrip6on and architecture •  Train the models in the Notebook •  Accuracy Analysis and models refinement •  Deploy and run the models
  74. 74. ©2016 IBM Corpora6on Deploy and Run the models In the last sec6on, we will simulate deployment and running of the models through the notebook by calling APIs from the run package. Data Acquisi6on Data Prepara6on Data Annota6on (Ground Truth) Model Training •  Cleansing •  Shaping •  Enrichment Model Tes6ng Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model Deploy and Run Models
  75. 75. ©2016 IBM Corpora6on Run the predictive model
  76. 76. ©2016 IBM Corpora6on runModel API
  77. 77. ©2016 IBM Corpora6on Get Weather Predictions
  78. 78. ©2016 IBM Corpora6on Show prediction results
  79. 79. ©2016 IBM Corpora6on Resource •  hMps://developer.ibm.com/clouddataservices/ •  hMps://github.com/ibm-cds-labs/simple-data-pipe •  hMps://github.com/ibm-cds-labs/pipes-connector-flightstats •  hMp://spark.apache.org/docs/latest/mllib-guide.html •  hMps://console.ng.bluemix.net/data/analy6cs/
  80. 80. ©2016 IBM Corpora6on Thank You

×