2016 Spark Summit East Keynote: Matei Zaharia

•

128 recomendaciones•39,953 vistas

Databricks CTO and Spark creator Matei Zaharia's keynote at Spark Summit East 2016: Planned major expansions to Apache Spark

Software

Spark 2.0
Matei Zaharia
February 17, 2016

2015: A Great Year for Spark
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Total
Contributors
3900
1100
66K
12K
500
1000

Meetup Groups: January 2015
source: meetup.com

Meetup Groups: January 2016
source: meetup.com

New Components
DataFrames
SparkR
Data Sources
Project Tungsten
Streaming ML
Kafka Connector
ML Pipelines
Debug UI
Dataset API

Spark 2.0
Next major release, coming in April / May
Builds on all we learned in past 2 years

Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor some dependency conflicts(e.g.Guava)
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)

Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames

Background on Project Tungsten
CPU speedshave not kept up with I/O in past 5 years
Bring Spark performance closerto bare metal, through:
• Native memory management
• Runtime code generation

Tungsten So Far
Spark 1.4–1.6 added binary storage and basic code gen
DataFrame + Dataset APIs enable Tungstenin userprograms
• Alsoused underSpark SQL + parts of MLlib

New in 2.0
Whole-stage code generation
• Remove expensive iteratorcalls
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Parquet + built-incache
Automatically applies to SQL, DataFrames, Datasets

Background
Real-time processingis increasinglyimportant
Most apps needto combine it with batch & interactive queries
• Trackstate using a stream, then run SQL queries
• Train an ML model offline, then update it
Spark is very well-suitedto do this

Structured Streaming
High-levelstreaming APIbuilt on Spark SQL engine
• Declarative API that extendsDataFrames / Datasets
• Eventtime, windowing,sessions,sources& sinks
Also supports interactive & batch queries
• Aggregate datain a stream,then serve using JDBC
• Change queriesat runtime
• Build and apply ML models Not just streaming, but
“continuous applications”

Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditionalstreaming
Other processingtypes
Kafka DatabaseETL

Details on Structured Streaming
Spark 2.0 will have a first version focusedon ETL [SPARK-8360]
Later versions will add more operators & libraries
See Reynold’s keynote tomorrow for a deep dive!

Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets

Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams

Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames

Más contenido relacionado

La actualidad más candente

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Distributed ML in Apache SparkDatabricks

Spark DataFrames and ML PipelinesDatabricks

Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Introduction to Apache Spark 2.0Knoldus Inc.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

Apache Spark Usage in the Open Source EcosystemDatabricks

Spark Summit EU talk by Stephan KesslerSpark Summit

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

Spark streaming state of the unionDatabricks

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Composable Parallel Processing in Apache Spark and WeldDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

La actualidad más candente (20)

Jump Start with Apache Spark 2.0 on Databricks

Distributed ML in Apache Spark

Spark DataFrames and ML Pipelines

Exceptions are the Norm: Dealing with Bad Actors in ETL

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...

Large-Scale Data Science in Apache Spark 2.0

Introduction to Apache Spark 2.0

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...

Spark Summit EU talk by Shay Nativ and Dvir Volk

Building a Data Pipeline from Scratch - Joe Crobak

Apache Spark Usage in the Open Source Ecosystem

Spark Summit EU talk by Stephan Kessler

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

An Introduction to Sparkling Water by Michal Malohlava

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Writing Continuous Applications with Structured Streaming PySpark API

Spark streaming state of the union

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Composable Parallel Processing in Apache Spark and Weld

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Destacado

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Parallelizing Existing R Packages with SparkRDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

The Future of Real-Time in SparkReynold Xin

Flink vs. SparkSlim Baltagi

What's New in Spark 2?Eyal Ben Ivri

Introduction to Spark (Intern Event Presentation)Databricks

Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Apache Spark RDDsDean Chen

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...Spark Summit

Lessons Learned From Running Spark On DockerSpark Summit

Destacado (20)

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Apache Spark 2.0: Faster, Easier, and Smarter

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Parallelizing Existing R Packages with SparkR

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

The Future of Real-Time in Spark

Flink vs. Spark

What's New in Spark 2?

Introduction to Spark (Intern Event Presentation)

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark SQL Deep Dive @ Melbourne Spark Meetup

Apache Spark RDDs

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...

Lessons Learned From Running Spark On Docker

Similar a 2016 Spark Summit East Keynote: Matei Zaharia

The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies

Simplifying Big Data Applications with Apache Spark 2.0Spark Summit

What’s new in Apache Spark 2.3DataWorks Summit

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

New directions for Apache Spark in 2015Databricks

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Semantic Web Serverswebhostingguy

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Spark and machine learning in microservices architectureStepan Pushkarev

Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio

Introduction to Datasource V2 APIdatamantra

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

MLeap: Release Spark ML PipelinesDataWorks Summit/Hadoop Summit

Apache Flink: Past, Present and FutureGyula Fóra

Similar a 2016 Spark Summit East Keynote: Matei Zaharia (20)

The structured streaming upgrade to Apache Spark and how enterprises can bene...

Simplifying Big Data Applications with Apache Spark 2.0

What’s new in Apache Spark 2.3

Jump Start with Apache Spark 2.0 on Databricks

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

New directions for Apache Spark in 2015

Spark + AI Summit 2020 イベント概要

Semantic Web Servers

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Spark and machine learning in microservices architecture

Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Introduction to Datasource V2 API

Spark Streaming and MLlib - Hyderabad Spark Group

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

MLeap: Release Spark ML Pipelines

Apache Flink: Past, Present and Future

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics

Patterns for automating API delivery. API conferencessuser9e7c64

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

Strategies for using alternative queries to mitigate zero resultsJean Silva

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Precise and Complete Requirements? An Elusive GoalLionel Briand

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools

SAM Training Session - How to use EXCEL ?Alexandre Beguel

Osi security architecture in network.pptxVinzoCenzo

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver

2016 Spark Summit East Keynote: Matei Zaharia

1. Spark 2.0 Matei Zaharia February 17, 2016

2. 2015: A Great Year for Spark 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Total Contributors 3900 1100 66K 12K 500 1000

3. Meetup Groups: January 2015 source: meetup.com

4. Meetup Groups: January 2016 source: meetup.com

5. New Components DataFrames SparkR Data Sources Project Tungsten Streaming ML Kafka Connector ML Pipelines Debug UI Dataset API

6. Spark 2.0 Next major release, coming in April / May Builds on all we learned in past 2 years

7. Versioning in Spark In reality, we hate breaking APIs! Will notdo so exceptfor some dependency conflicts(e.g.Guava) 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)

8. Major Features in 2.0 TungstenPhase 2 speedupsof 5-10x StructuredStreaming real-time engine on SQL/DataFrames Unifying Datasets and DataFrames

9. Tungsten Phase 2

10. Background on Project Tungsten CPU speedshave not kept up with I/O in past 5 years Bring Spark performance closerto bare metal, through: • Native memory management • Runtime code generation

11. Tungsten So Far Spark 1.4–1.6 added binary storage and basic code gen DataFrame + Dataset APIs enable Tungstenin userprograms • Alsoused underSpark SQL + parts of MLlib

12. New in 2.0 Whole-stage code generation • Remove expensive iteratorcalls • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Parquet + built-incache Automatically applies to SQL, DataFrames, Datasets

13. Structured Streaming

14. Background Real-time processingis increasinglyimportant Most apps needto combine it with batch & interactive queries • Trackstate using a stream, then run SQL queries • Train an ML model offline, then update it Spark is very well-suitedto do this

15. Structured Streaming High-levelstreaming APIbuilt on Spark SQL engine • Declarative API that extendsDataFrames / Datasets • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”

16. Goal: end-to-end continuous applications Example Reporting Applications ML Model Ad-hoc Queries Traditionalstreaming Other processingtypes Kafka DatabaseETL

17. Details on Structured Streaming Spark 2.0 will have a first version focusedon ETL [SPARK-8360] Later versions will add more operators & libraries See Reynold’s keynote tomorrow for a deep dive!

18. Datasets & DataFrames

19. Datasets and DataFrames In 2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

20. Example case class User(name: String, id: Int) case class Message(user: User, text: String) dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row] messages = dataframe.as[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) .map(m => m.user) // Dataset[User] pipeline.train(users) // MLlib takes either DataFrames or Datasets

21. Benefits Simpler to understand • Onlykept Dataset separate to keep binary compatibility in 1.x Libraries can take data of both forms With Streaming, same API will also work on streams

22. Long-Term RDD will remain the low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames

23. Thank you! Enjoy Spark Summit

2016 Spark Summit East Keynote: Matei Zaharia

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a 2016 Spark Summit East Keynote: Matei Zaharia

Similar a 2016 Spark Summit East Keynote: Matei Zaharia (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

2016 Spark Summit East Keynote: Matei Zaharia