New directions for Apache Spark in 2015

•

35 recomendaciones•11,662 vistas

This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.

Tecnología

New Directions for Spark in 2015
Matei Zaharia
February 20, 2015

What is Apache Spark?
Fast and general engine for big data processing with
libraries for SQL, streaming, advanced analytics
Most active open source project in big data
2

Founded by the creators of Spark in 2013
Largest organization contributing to Spark
–  3/4 of the code in 2014
End-to-end hosted service, Databricks Cloud
About Databricks
3

2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500 active production deployments
4

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
5

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache
6

7
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes

9
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms

10
DataFrames
Similar API to data frames
in R and Pandas
Automatically optimized
via Spark SQL
Coming in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime

11
R Interface (SparkR)
Arrives in Spark 1.4 (June)
Exposes DataFrames,
RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))

12
Machine Learning Pipelines
High-level API inspired by
SciKit-Learn
Featurization, evaluation,
model tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame

13
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}

$14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”$

15
Goal: one engine for all data sources,
workloads and environments

To Learn More
Two free massive online
courses on Spark:
databricks.com/moocs
16
Try
Databricks Cloud:
databricks.com

Más contenido relacionado

La actualidad más candente

Jump Start into Apache® Spark™ and DatabricksDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

A look ahead at spark 2.0 Databricks

New Developments in SparkDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Spark Meetup at UberDatabricks

Apache Spark Usage in the Open Source EcosystemDatabricks

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Strata NYC 2015 - What's coming for the Spark communityDatabricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Spark streaming State of the Union - Strata San Jose 2015Databricks

Spark what's new what's comingDatabricks

Operational Tips for Deploying SparkDatabricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Parallelize R Code Using Apache Spark Databricks

The BDAS Open Source Communityjeykottalam

La actualidad más candente (20)

Jump Start into Apache® Spark™ and Databricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Spark Summit EU 2015: Lessons from 300+ production users

A look ahead at spark 2.0

New Developments in Spark

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Spark Meetup at Uber

Apache Spark Usage in the Open Source Ecosystem

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Strata NYC 2015 - What's coming for the Spark community

From Pipelines to Refineries: Scaling Big Data Applications

Strata NYC 2015 - Supercharging R with Apache Spark

Spark streaming State of the Union - Strata San Jose 2015

Spark what's new what's coming

Operational Tips for Deploying Spark

New Directions for Spark in 2015 - Spark Summit East

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

Large-Scale Data Science in Apache Spark 2.0

Parallelize R Code Using Apache Spark

The BDAS Open Source Community

Destacado

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Tuning and Debugging in Apache SparkDatabricks

TensorFlow User Group #1陽平山口

デブサミ2017 公募セッション募集要項Developers Summit

Tensor flow usergroup 2016 (公開版)Hiroki Nakahara

Flink vs. SparkSlim Baltagi

CultureReed Hastings

Apache Provisionr (incubating) - Bucharest JUG 10Andrei Savu

Strata + Hadoop World 2014 レポート #cwt2014Cloudera Japan

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA

Spark - The beginningsDaniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly

Apache SparkMahdi Esmailoghli

New Directions in Information Organization: A Linked Data Model with BIBFRAMESharonYang

Introduction to Apache SparkAnastasios Skarlatidis

Is spark streaming based on reactive streams?chibochibo

Hadoopビッグデータ基盤の歴史を振り返る #cwt2015Cloudera Japan

Apache spark linkedinYukti Kaura

Stream dataprocessing101Sotaro Kimura

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly

Destacado (20)

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Tuning and Debugging in Apache Spark

TensorFlow User Group #1

デブサミ2017 公募セッション募集要項

Tensor flow usergroup 2016 (公開版)

Flink vs. Spark

Culture

Apache Provisionr (incubating) - Bucharest JUG 10

Strata + Hadoop World 2014 レポート #cwt2014

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

New Directions in Information Organization: A Linked Data Model with BIBFRAME

Introduction to Apache Spark

Is spark streaming based on reactive streams?

Hadoopビッグデータ基盤の歴史を振り返る #cwt2015

Apache spark linkedin

Stream dataprocessing101

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Similar a New directions for Apache Spark in 2015

Spark Community Update - Spark Summit San Francisco 2015Databricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Scalable Machine Learning with PySparkLadle Patel

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

H2O PySparkling WaterSri Ambati

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys

Big data apache spark + scalaJuantomás García Molina

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

Composable Parallel Processing in Apache Spark and WeldDatabricks

Koalas: Unifying Spark and pandas APIsTakuya UESHIN

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Big data analysis using spark r publishedDipendra Kusi

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Building a modern Application with DataFramesDatabricks

Dev Ops TrainingSpark Summit

Spark ML Pipeline servingStepan Pushkarev

Similar a New directions for Apache Spark in 2015 (20)

Spark Community Update - Spark Summit San Francisco 2015

Jump Start with Apache Spark 2.0 on Databricks

Spark + AI Summit 2020 イベント概要

Scalable Machine Learning with PySpark

Big Data Processing with .NET and Spark (SQLBits 2020)

H2O PySparkling Water

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Big data apache spark + scala

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Apache Spark: Lightning Fast Cluster Computing

Composable Parallel Processing in Apache Spark and Weld

Koalas: Unifying Spark and pandas APIs

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Big data analysis using spark r published

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Building a modern Application with DataFrames

Dev Ops Training

Spark ML Pipeline serving

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Artificial Intelligence: Facts and MythsJoaquim Jorge

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

MINDCTI Revenue Release Quarter One 2024MIND CTI

Why Teams call analytics are critical to your entire businesspanagenda

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

New directions for Apache Spark in 2015

1. New Directions for Spark in 2015 Matei Zaharia February 20, 2015

2. What is Apache Spark? Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data 2

3. Founded by the creators of Spark in 2013 Largest organization contributing to Spark –  3/4 of the code in 2014 End-to-end hosted service, Databricks Cloud About Databricks 3

4. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500 active production deployments 4

5. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 5

6. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache 6

7. 7 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes

8. Distributors Applications 8

9. 9 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms

10. 10 DataFrames Similar API to data frames in R and Pandas Automatically optimized via Spark SQL Coming in Spark 1.3 df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime

11. 11 R Interface (SparkR) Arrives in Spark 1.4 (June) Exposes DataFrames, RDDs, and ML library in R df = jsonFile(“tweets.json”) summarize( group_by( df[df$user == “matei”,], “date”), sum(“retweets”))

12. 12 Machine Learning Pipelines High-level API inspired by SciKit-Learn Featurization, evaluation, model tuning tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame

13. 13 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}

14. 14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”

15. 15 Goal: one engine for all data sources, workloads and environments

16. To Learn More Two free massive online courses on Spark: databricks.com/moocs 16 Try Databricks Cloud: databricks.com

New directions for Apache Spark in 2015

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a New directions for Apache Spark in 2015

Similar a New directions for Apache Spark in 2015 (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

New directions for Apache Spark in 2015