SlideShare una empresa de Scribd logo
1 de 16
1
SPARK
INSTRUCTOR:
DR. SHIYONG LU
BY:
SRINATH REDDY KOTU
GRADUATE STUDENT
2
Data Processing Goals
Low latency (interactive) queries on
historical data: enable faster decisions
E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
E.g., detect & block worms in real-time (a worm may
infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
E.g., anomaly detection, trend analysis
3
The Need for Unification (1/2)
Today’s state-of-art analytics stack
Batch stack
(e.g., Hadoop)
Input
Splitter
Streaming stack
(e.g., Storm)
Real-Time
Analytics
Ad-Hoc queries
on historical data
Interactive queries
on historical data
Interactive queries (e.g.,
HBase, Impala, SQL)
Challenges:
Need to maintain three separate stacks
Expensive and complex
Hard to compute consistent metrics across stacks
Hard and slow to share data across stacks
4
Data Processing Stack
Data Processing Layer
Resource Management Layer
Storage Layer
5
Hadoop Stack
Data Processing Layer
Resource Management Layer
Storage Layer
…
Hadoop MR
Hive Pig
HBase Storm
Hadoop Yarn
HDFS, S3, …
6
BDAS Stack
Data Processing Layer
Resource Management Layer
Storage Layer
Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
7
How do BDAS & Hadoop fit together?
Mesos Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
Tachyon
Hadoop Yarn
Spark
Stramin
g
Shark
SQL
Graph
X ML
library
BlinkDB
MLbas
e
Spark Hadoop MR
Hive Pig HBas
e
Storm
8
Apache Mesos (cluster manager)
Enable multiple frameworks to share same
cluster resources (e.g., Hadoop, Storm, Spark)
Twitter’s large scale deployment
6,000+ servers,
500+ engineers running jobs on Mesos
Mesospehere: startup to commercialize Mesos
9
Apache Spark
Distributed Execution Engine
Fault-tolerant, efficient in-memory storage (RDDs)
Powerful programming model and APIs (Scala,
Python, Java)
Fast: up to 100x faster than Hadoop
Easy to use: 5-10x less code than Hadoop
General: support interactive & iterative apps
10
Spark Streaming
Large scale streaming computation
Implement streaming as a sequence of <1s jobs
Fault tolerant
Handle stragglers
Ensure exactly one semantics
Integrated with Spark: unifies batch, interactive,
and batch computations
11
Shark
Hive over Spark: full support for HQL and UDFs
Up to 100x when input is in memory
Up to 5-10x when input is on disk
Running on hundreds of nodes at Yahoo!
12
BlinkDB
Trade between query performance and accuracy
using sampling
Why?
In-memory processing doesn’t guarantee interactive
processing
E.g., ~10’s sec just to scan 512 GB RAM!
Gap between memory capacity and transfer rate
increasing
13
GraphX
Combine data-parallel and graph-parallel
computations
Provide powerful abstractions:
PowerGraph, Pregel implemented in less than 20
LOC!
Leverage Spark’s fault tolerance
14
MLlib and MLbase
MLlib: high quality library for ML algorithms
MLbase: make ML accessible to non-experts
Declarative interface: allow users to say what they
want
E.g., classify(data)
Automatically pick best algorithm for given data, time
Allow developers to easily add and test new
algorithms
15
Tachyon
In-memory, fault-tolerant storage system
Flexible API, including HDFS API
Allow multiple frameworks (including Hadoop) to
share in-memory data
16
Thank You

Más contenido relacionado

La actualidad más candente

A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

La actualidad más candente (20)

Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 

Similar a Spark

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
DataWorks Summit
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 

Similar a Spark (20)

Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 

Último

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 

Último (20)

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 

Spark

  • 2. 2 Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis
  • 3. 3 The Need for Unification (1/2) Today’s state-of-art analytics stack Batch stack (e.g., Hadoop) Input Splitter Streaming stack (e.g., Storm) Real-Time Analytics Ad-Hoc queries on historical data Interactive queries on historical data Interactive queries (e.g., HBase, Impala, SQL) Challenges: Need to maintain three separate stacks Expensive and complex Hard to compute consistent metrics across stacks Hard and slow to share data across stacks
  • 4. 4 Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer
  • 5. 5 Hadoop Stack Data Processing Layer Resource Management Layer Storage Layer … Hadoop MR Hive Pig HBase Storm Hadoop Yarn HDFS, S3, …
  • 6. 6 BDAS Stack Data Processing Layer Resource Management Layer Storage Layer Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon
  • 7. 7 How do BDAS & Hadoop fit together? Mesos Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon Hadoop Yarn Spark Stramin g Shark SQL Graph X ML library BlinkDB MLbas e Spark Hadoop MR Hive Pig HBas e Storm
  • 8. 8 Apache Mesos (cluster manager) Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) Twitter’s large scale deployment 6,000+ servers, 500+ engineers running jobs on Mesos Mesospehere: startup to commercialize Mesos
  • 9. 9 Apache Spark Distributed Execution Engine Fault-tolerant, efficient in-memory storage (RDDs) Powerful programming model and APIs (Scala, Python, Java) Fast: up to 100x faster than Hadoop Easy to use: 5-10x less code than Hadoop General: support interactive & iterative apps
  • 10. 10 Spark Streaming Large scale streaming computation Implement streaming as a sequence of <1s jobs Fault tolerant Handle stragglers Ensure exactly one semantics Integrated with Spark: unifies batch, interactive, and batch computations
  • 11. 11 Shark Hive over Spark: full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk Running on hundreds of nodes at Yahoo!
  • 12. 12 BlinkDB Trade between query performance and accuracy using sampling Why? In-memory processing doesn’t guarantee interactive processing E.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing
  • 13. 13 GraphX Combine data-parallel and graph-parallel computations Provide powerful abstractions: PowerGraph, Pregel implemented in less than 20 LOC! Leverage Spark’s fault tolerance
  • 14. 14 MLlib and MLbase MLlib: high quality library for ML algorithms MLbase: make ML accessible to non-experts Declarative interface: allow users to say what they want E.g., classify(data) Automatically pick best algorithm for given data, time Allow developers to easily add and test new algorithms
  • 15. 15 Tachyon In-memory, fault-tolerant storage system Flexible API, including HDFS API Allow multiple frameworks (including Hadoop) to share in-memory data

Notas del editor

  1. So what does this mean?Well, this means that we want low response-time on historical data since the faster we can make a decision the better.We want the ability to perform queries on live data since decisions on real-time data are better than on stale data.Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.