Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

End-to-End Data Pipelines with Apache Spark

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 29 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Anuncio

Similares a End-to-End Data Pipelines with Apache Spark (20)

Más reciente (20)

Anuncio

End-to-End Data Pipelines with Apache Spark

  1. 1. End-to-End Data Pipelines with Apache Spark Burak Yavuz December 27, 2015
  2. 2. Who Am I? • Software Engineer at Databricks • MS Management Science & Eng. @ Stanford University • BS Mechanical Eng. @ Bogazici University, Istanbul • Contributor to Spark Core, MLlib, SQL, and Streaming • Maintainer of Spark Packages 2
  3. 3. Outline • Intro - Spark & Ecosystem • Build an End-to-End Data Product • Step 1: Understand your Data • SparkSQL - DataFrames • Step 2: Build your Service • SparkMLlib - ML Pipelines • Step 3: Monitor your Service • Spark Streaming • Kafka 3
  4. 4. Timeline of Spark • 2010: a research paper • 2010-13: a project under github/mesos • 2013-14: Apache incubating -> TLP • 2014: the most active project in the ASF 4
  5. 5. Apache Spark 5
  6. 6. Spark Ecosystem • 770 contributors • 6000+ forks on GitHub • 14000+ commits! 6 https://github.com/apache/spark
  7. 7. 7 http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
  8. 8. 8 http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
  9. 9. 9 http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
  10. 10. 10
  11. 11. • a community index of 3rd-party packages • helps users find packages • helps package developers meet users • users provide feedback through voting and commenting • index maintained by Databricks 11 3rd Party Packages Community Spark Packages http://spark-packages.org
  12. 12. Types of Packages Currently Available • Data Source Connectors • spark-avro, spark-redshift, spark-mongodb, spark- sequoiadb, spark-cassandra-connector, … • Deployment Scripts • spark_azure, spark_gce, sbt-spark-ec2 • Machine Learning Algorithms • spark-hash, spark-mrmr-feature-selection, streaming- matrix-factorization, generalized-kmeans-clustering • and many more… 12
  13. 13. What’s new in Spark 1.6 • Dataset API • Automatic memory configuration • Optimized state storage in Spark Streaming • Pipeline persistence in Spark ML 13
  14. 14. Demo Source Code: http://brkyvz.github.io/spark-pipeline Scenario: As an e-commerce company, we would like to recommend products that users may like in order to increase sales and profit. Dataset: http://jmcauley.ucsd.edu/data/amazon/ - 18 GB - 82.83 million reviews We will use a subset with 24 million reviews 14
  15. 15. 15
  16. 16. 16
  17. 17. Recommendation Engines • Finding Similar Items • Clustering using: • Metadata • Matrix Factorization • Frequent Itemsets • Ranking • Rating Prediction using: • Matrix Factorization 17
  18. 18. Architecture 18 Web Service 1 Web Service 2 Web Service 3 Cassandra Sales Data Database Spark Sales + Ratings Rating Data ML Model Recommendations Request
  19. 19. 19 Step 1: Understand your Data
  20. 20. 20 Step 2: Build your Service
  21. 21. Solution Proposal Use Matrix Factorization to understand customers and items. Then: 1) Predict the rating for a product for a given user 2) Find similar products, and show top k 21
  22. 22. Matrix Factorization 22 https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
  23. 23. Matrix Factorization 23 https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
  24. 24. 24 https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
  25. 25. 25 Step 3: Monitor your Service
  26. 26. • Distributed messaging system • High-throughput • Fast • Scalable • Durable • http://kafka.apache.org/ 26 Apache Kafka
  27. 27. Architecture 27 Web Service 1 Web Service 2 Web Service 3 Kafka Spark Streaming
  28. 28. Architecture 28 Web Service 1 Web Service 2 Web Service 3 Kafka Spark Streaming
  29. 29. Thank you. burak@databricks.com

×