Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data Science in Scala

1.325 visualizaciones

Publicado el

This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.

Publicado en: Datos y análisis

Big Data Science in Scala

  1. 1. Big Data Science in Scala Anastasia Lieva Data Scientist @lievAnastazia
  2. 2. 1. R 2. Python 3. SQL 2014 KDnuggets Polls: most popular tools in data-science 2015 2016
  3. 3. Context: Real Time Bidding Raw requests: 100 000 requests per second 4 terabytes per day
  4. 4. R Python SQL Scala
  5. 5. R Python SQL Scala Spark ML/DATAFRAME/SQL SMILE Saddle
  6. 6. Spark Saddle Smile Preprocessing Machine Learning Evaluation Preprocessing Machine Learning Evaluation
  7. 7. Problem: Optimize click rate of delivering ads We want to estimate the probability the ads will be clicked ● request configuration ● proposed creative ● user history ● third-party information depending on:
  8. 8. Algorithm: Random Forest Averaging the decisions from all the trees os Categorie City Oui Non OuiNon adType adSize weekDay Oui Non OuiNon
  9. 9. Raw data { "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true } Sampling of 13 Gb
  10. 10. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  11. 11. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  12. 12. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  13. 13. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  14. 14. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False
  15. 15. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False Os MaxPrice Time 3.0 6.0 1.0 5.0 3.0 5.0 1.0 2.0 3.0
  16. 16. Preprocessing: Spark ml Extraction: Extracting features from “raw” data Transformation: Scaling, converting, or modifying features Selection: Selecting a subset from a larger set of features
  17. 17. Preprocessing: Spark ml Extraction: Extracting features from “raw” data TF-IDF, SparkSQL Transformation: Scaling, converting, or modifying features Bucketizer, String Indexer, Index to String, Vector Assembler Selection: Selecting a subset from a larger set of features ChiSqSelector
  18. 18. Preprocessing: Saddle array-backed, specialized data structures: Pandas-like operations: dealing with missing values index transformation tools extracting,slicing,mapping row/column wise groupBy/join/concat sorting/pivoting
  19. 19. Learning: Spark ml Dataframe-based API Classification Regression Linear Methods Decision Trees Tree ensembles
  20. 20. Learning: Spark ml Dataframe-based API Pipeline interface Classification Regression Linear Methods Decision Trees Tree ensembles TF-IDF String Indexer Assembler Random Forest Evaluation
  21. 21. Compare performance : Spark
  22. 22. Learning: Smile Classification Regression Linear Methods Decision Trees Tree ensembles Array-backed API
  23. 23. Learning: Smile Classification Regression Linear Methods Decision Trees Tree ensembles ★ Visualisation ★ Missing Values Imputation ★ Association Rule Mining ★ Manifold learning ★ Multi-dimensional scaling ★ Feature selection and dimensionality reduction
  24. 24. Preprocessing: Saddle Create dataframe and balance the data
  25. 25. Preprocessing: Spark ml Create dataframe and balance the data
  26. 26. Preprocessing: Spark ml Index categorical data timestamp os osIdx 1465037789 iOS 1 1464983457 Windows Phone 2 1465019529 Android 0 1464974567 iOS 1 1465018552 Android 0
  27. 27. Preprocessing: Saddle Index categorical data
  28. 28. Preprocessing: Saddle Split randomly to test and train sets and convert to input type needed in Smile RF implementation
  29. 29. Preprocessing: Spark ml Conversion and sampling
  30. 30. Learning: Smile Construct Classifier and set hyperparameters Spark ml
  31. 31. Learning: Train model and predict on test dataframe Spark ml Smile
  32. 32. Learning: Evaluate model Spark ml Smile
  33. 33. Compare Spark and Smile Random Forest The higher the better The lower the better Classification metrics
  34. 34. Compare Spark and Smile Random Forest Running time on 13 GB minutes
  35. 35. Compare preprocessing: Spark vs Saddle
  36. 36. My List[tools] for THIS project: Preprocessing Spark Machine Learning (Random Forest) Smile
  37. 37. Your Option[tools] for YOUR project: Spark SMILE Saddle

×