Introduction to Spark (Intern Event Presentation)

2. What is Apache Spark? Fast and general computing engine for clusters Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

3. About Databricks Founded by creators of Spark in 2013 Oﬀers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs

4. 0 20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors Contributors / Month to Spark Community Growth Most active open source project in big data

5. Spark Programming Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

6. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

7. Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s Iterative algorithm used in machine learning

8. Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes On-Disk Performance Time to sort 100TB

9. Higher-Level Libraries Spark Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph

10. Higher-Level Libraries // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

11. Demo

12. Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org Spark Community

13.

14. Ongoing Work Speeding up Spark through code generation and binary processing (Project Tungsten) R interface to Spark (SparkR) Real-time machine learning library Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

15. Thank you. We’re hiring!

Notas del editor

Add “variables” to the “functions” in functional programming
100 GB of data on 50 m1.xlarge EC2 machines
Alibab, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.

Introduction to Spark (Intern Event Presentation)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introduction to Spark (Intern Event Presentation)

Similar a Introduction to Spark (Intern Event Presentation) (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Introduction to Spark (Intern Event Presentation)

Notas del editor