Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI

337 visualizaciones

Publicado el

Topic: How to use big data to enhance AI
Outline:
1. Spark ETL
Spark SQL
Spark Streaming
2. Spark ML
Spark ML pipeline
Distributed model tuning
Spark ML model and data lineage management

3. Spark XGboost
XGboost introduction
XGboost with Spark
XGboost with GPU

4. Spark Deep Learning pipeline
Transfer learning
Build Spark ML pipeline with TensorFlow
Model selection on distributed TF model

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI

  1. 1. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 How Spark Speedup AI Mike Tang
  2. 2. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信 Outline ● Spark ecosystem ● Spark ML and XGBoost ● Spark Deep Learning pipeline
  3. 3. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Big data
  4. 4. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 What is machine learning or AI ? ⬢ Database, Big data, Machine Learning, AI ? ⬢ “using algorithms to understand the pattern in data” Prediction insight
  5. 5. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 History of big data ⬢ Application driven ○ Billions of web pages ○ New system requirements ■ Cheap ■ Robust ■ Efficient ○ 2004 Google ○ 2007 Yahoo ○ Hadoop ecosystem ■ HDFS ■ MAPR ■ Yarn ○ 2012 Hortonworks
  6. 6. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Applications driven for big data ⬢ Ecosystem of Hadoop ○ How Facebook use Hadoop? ■ Hive for OLAP query processing ■ HBase for for billion users activities tracking ○ How Twitter use Hadoop? ■ Storm: streaming data processing for twitter stream data ○ How LinkedIn use Hadoop? ■ Kafaka to subscribe users streaming data ○ When Hadoop come together? ■ Ambari: for node management and deploy different components
  7. 7. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 The leading data science platform for big data Apache Spark Hadoop Interactive Streaming Batch Nosql Tensor flow ⬢ Apache Spark ○ Machine learning application driven ○ The leading computation engine for big data processing ○ Data pipeline for different data source and other computation engine ○ Uniform data processing object RDD and DataFrame ○ Memory based
  8. 8. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Data pipeline for machine learning Resilient Distributed Dataset server server server server ETL Exploration Machine learning Structural data RAW data processing Interactive, OLAP, Spark SQL Feature engineering Model training Data Product Visualization
  9. 9. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 ML is only a small part of real-word ML system
  10. 10. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Bring Data Science to Big Data Retraining History data Feedback data Data scientist Continuous updating Deploying Operational data ML Model Feature engineering Model selection Model tuning ML Pipeline Scoring
  11. 11. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信 Outline ● Spark ecosystem ● Spark XGBoost ● Spark Deep Learning pipeline
  12. 12. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ Machine learning for big data ⬢ Application lists ○ House price prediction ○ CTR prediction ○ …. ○ Products recommendation ⬢ ML job categories ○ Regression ○ Classification ○ Clustering ○ Etc. ⬢ XGBoost is good at ○ Regression and Classification
  13. 13. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ XGBoost is the start-of-art approach in Kaggle for structural data ○ 80% teams win the competition based on XGBoost ○ A tree based model ○ Excellent at classification and regression ○ Ref: http://xgboost.readthedocs.io/en/latest/model.html
  14. 14. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ Ensemble and Boosting is time consuming for training model ○ Ensemble ○ An ensemble is a combination of predication model that output a final result
  15. 15. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ Ensemble and Boosting is time consuming for training model ○ Gradient Boosting ○ Multiple round (1…M) iterations to correct the errors of previous round mistake ○ Ref: https://www.slideshare.net/LonghowLam/machine-learning-overview
  16. 16. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ Train XGBoost is time consuming Training data XGBoost Model B: Model EvaluationTesting data Model Evaluation A: Training algorithm C: Model tuning
  17. 17. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ What should we do ? Training data XGBoost Model B: Model Evaluation with Spark ML Testing data Model Evaluatio n A: Speedup Training 1. Parallel and GPU C: AUTO Model Tuning with Spark ML
  18. 18. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Motivation ⬢ From single machine to parallel computation ○ Distributed training ○ GPU supported ○ Cowork with big data ecosystem ⬢ How to provide the end-end solution for DS? ○ Front-end ■ Easy and efficient way for parallel XGBoost computation ■ Notebook front end for model visualization ○ Backend ■ Yarn to allocate the resource for application (CPU, Memory, GPU) ■ Docker support
  19. 19. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 How Spark enhance XGBoost ⬢ Efficient distributed training and Spark ML pipeline ○ Dataframe and RDD support for efficient data preprocessing ⬢ Ref: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
  20. 20. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 How Spark enhance XGBoost ⬢ Each node of XGBoost need Rabit to communicate with each others ○ Efficient but not easy to manage Rabit XGBoost worker2 XGBoost worker3 XGBoost worker4 Training data Partition 1 XGBoost worker1 Training data Partition 2 Training data Partition 3 Training data Partition 4 Statistic sync: optimal split value
  21. 21. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 XGBoost on Spark ML pipeline ⬢ Distributed XGBoost inside Spark ML pipeline ⬢ XGBoost estimator ○ Extend from Spark ML estimator ⬢ XGBoost model ○ Extend from Spark ML pipelineModel ○ Naturally work inside Spark ML Pipeline for model materialization ⬢ XGBoost parameter ○ Extend from Spark ML parameter ○ Enable automatically parameter tuning
  22. 22. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 XGBoost on Spark ML pipeline ⬢ Distributed XGBoost ○ Parameter: ○ val paramMap = List( "eta" -> 0.1f, "max_depth" -> 2, "objective" -> "binary:logistic").toMap ○ training ○ val xgboostModelRDD = XGBoost.train(trainRDD, paramMap, 1, 4, useExternalMemory=true) ○ val xgboostModelDF = XGBoost.trainWithDataFrame(trainDF, paramMap, 1, 4, useExternalMemory = true) ○ Prediction ○ val xgboostPredictionRDD = xgboostModelRDD.predict(trainRDD.map{x => x.features}) ○ XGBoost inside ML pipeline ○ val xgboostEstimator = new XGBoostEstimator( Map[String, Any]("num_round" -> 30, "nworkers" -> 10, "objective" -> "reg:linear", "eta" -> 0.3, "max_depth" -> 6, "early_stopping_rounds" -> 10)) val pipeline = new Pipeline() .setStages(Array(assembler, xgboostEstimator)) ○ val pipelineData = dataset.withColumnRenamed("PE","label") ○ val pipelineModel = pipeline.fit(pipelineData)
  23. 23. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 GPU speedup XGBoost ⬢ Where to improve the tree building procedure? ⬢ Procedure to build a tree ○ for each feature of input data ■ for each leaf of current tree ● find the best spilt ■ split the leaf node A Y N A Y N
  24. 24. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 GPU speedup XGBoost ⬢ GPU speedup XGBoost in the single machine ○ Processing all nodes in the same level concurrently ○ Optimizing splitting point selection ○ Optimize memory usage for data sparsity ⬢ Algorithm to speedup XGBoost via GPU ○ Phase 1: Find splits ○ Phase 2: Update node positions ○ Phase 3: Sort node buckets ○ Ref: https://peerj.com/articles/cs-127/ Instance ID 1 4 3 2 Feature value 0.1 0.2 0.4 0.5 Gradient 0.3 0.5 0.3 0.3
  25. 25. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 GPU speedup XGBoost ⬢ XGBoost with GPU wins 4.x speedup vs CPU based ⬢ Ref: https://devblogs.nvidia.com/gradient-boosting-decision-trees-xgboost-cuda/
  26. 26. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 GPU speedup XGBoost ⬢ GPU is good but manage GPU cluster is not easy ○ Different versions of drivers for GPUs ○ Users have to build XGBoost for GPU supported ○ Hard to manage the resources of GPU ○ GPU resource cannot be shared ⬢ An idle environment is everything included ○ Spark is an efficient distributed engine for data processing ○ Spark ML pipeline for model tuning ○ GPU is used to speedup the XGBoost training ○ Yarn is able to manage the resources of cluster ○ Notebook is used for end users
  27. 27. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 What you can learn from this notebook ⬢ Combine Spark, and XGBoost together ○ Train and deploy XGBoost model in a unified data platform ○ Automatically tune the XGBoost model based on Spark ML pipeline ○ Speedup XGBoost training based on distributed computation and GPU ○ Multiple users can share the same cluster with GPU and Spark ⬢ Benefits ○ End to end solution for ML pipeline with XGBoost support ○ Do not need to care about GPU management ○ Train the XGBoost with Spark ML APIs ○ Visualize the predication results on notebook
  28. 28. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Spark and Xgboost for Fintech ⬢ Lending club data ⬢ Spark Dataframe for ETL ⬢ Spark SQL for OLAP ⬢ Spark ML for auto modeling tuning and model serving ⬢ Notebook link: (use databricks community edition) ○ Part1: (https://bit.ly/2QuLQ9b) https://databricks-prod- cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999 72933037924/27242371102049/8135547933712821/latest.html ○ Part2:(https://bit.ly/2AZJI3Z) https://databricks-prod- cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999 72933037924/27242371102070/8135547933712821/latest.html ⬢ Acknowledgment: https://databricks.com/blog/2018/08/09/loan-risk-analysis- with-xgboost-and-databricks-runtime-for-machine-learning.html
  29. 29. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信 Outline ● Spark ecosystem ● Spark XGBoost ● Spark Deep Learning pipeline
  30. 30. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信 Why Deep Learning Data explosion Computation explosion An AI-driven world
  31. 31. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信 Why Deep Learning Data explosion Computation explosion An AI-driven world
  32. 32. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 What is deep learning ⬢ A set of machine learning techniques that can learn useful representations of features directly from images, text and sound. ⬢ Achievements ○ ImageNet ○ Google Neural Machine Translation ○ AlphaGo/AlphaZero ⬢ Benefit from big data and GPU
  33. 33. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2
  34. 34. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 A typical Deep Learning workflow Load data Select neural network architecture, optimize the parameters
  35. 35. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Build your own deep learning model Model Images(#) Classes(#) ImageNet 14M 20K Skin cancer 129,450 757
  36. 36. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 了解更多CS求职信息 扫描二维码关注微信
  37. 37. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2
  38. 38. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2
  39. 39. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Transfer Learning Pipeline Pre-trained CNN model Softmax classification (Trainable parameters) Load data as DataFrame
  40. 40. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Deep Learning in Spark MLlib Pipeline ⬢ Spark MLlib pipeline ○ Sequence of Transformers and Estimators ○ Simple, concise API and ease of use ⬢ Integrates with Spark APIs ○ Spark is great at scaling out computations ○ Image representation and reader in Spark DataFrame/Dataset (new in Spark 2.3) ⬢ Spark Deep Learning Pipelines (github.com/databricks/spark-deep-learning) ○ Plugin your own TensorFlow Graph or Keras Model as Transformers ○ Open source under Apache 2.0 license
  41. 41. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Auto ML in Spark ML pipeline ⬢ Spark to prepare the data ○ Spark streaming ○ Spark SQL ⬢ Spark for model parameter tuning ○ Hyper parameter ○ Save memory usage ⬢ TensorFlow auto network structure tuning ○ Reinforce learning ○ Transfer learning ⬢ Model deploy as a service
  42. 42. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 Case study ⬢ Car damage estimation ⬢ Intelligence agent ⬢ X-Ray Image analysis ⬢ Anti-Terrorism
  43. 43. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 What you can learn this section ⬢ How to combine deep learning and Spark together ⬢ Take DL as a operator in Spark ML pipeline ⬢ Transfer learning with DL model ⬢ DL model parameter tuning ⬢ Apply DL model into Spark SQL ⬢ Notebook: https://databricks-prod- cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4999972933037924/4 324977500035919/8135547933712821/latest.html ⬢ Acknowledgment: https://docs.databricks.com/applications/deep-learning/deep-learning- pipelines.html
  44. 44. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2
  45. 45. 了解更多CS求职信息 扫描二维码关注微信 www.laioffer.comlaiofferhelper2 resources: https://drive.google.com/drive/folders/1wGKNGq7w75YKYazMZ7ytgaAtfTCgvsE D?usp=sharing

×