SlideShare una empresa de Scribd logo
1 de 36
In-Memory Cluster Computing for
Iterative and Interactive Applications
Presented By
Kelly Technologies
www.kellytechno.com
 Commodity clusters have become an important
computing platform for a variety of applications
 In industry: search, machine translation, ad
targeting, …
 In research: bioinformatics, NLP, climate simulation,
…
 High-level cluster programming models like
MapReduce power many of these apps
 Theme of this work: provide similarly powerful
abstractions for a broader class of applications
www.kellytechno.com
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
E.g., MapReduce:
MapMap
MapMap
MapMap
ReduceReduce
ReduceReduce
Input Output
www.kellytechno.com
 Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
 E.g., MapReduce:
MapMap
MapMap
MapMap
ReduceReduce
ReduceReduce
Input Output
www.kellytechno.com
 Acyclic data flow is a powerful abstraction, but
is not efficient for applications that repeatedly
reuse a working set of data:
 Iterative algorithms (many in machine
learning)
 Interactive data mining tools (R, Excel,
Python)
 Spark makes working sets a first-class concept
to efficiently support these apps
www.kellytechno.com
 Provide distributed memory abstractions for
clusters to support apps with working sets
 Retain the attractive properties of
MapReduce:
 Fault tolerance (for crashes & stragglers)
 Data locality
 Scalability
Solution: augment data flow model with
“resilient distributed datasets” (RDDs)
www.kellytechno.com
 We conjecture that Spark’s combination of data
flow with RDDs unifies many proposed cluster
programming models
 General data flow models: MapReduce, Dryad, SQL
 Specialized models for stateful apps: Pregel (BSP),
HaLoop (iterative MR), Continuous Bulk Processing
 Instead of specialized APIs for one type of app,
give user first-class control of distrib. datasets
www.kellytechno.com
 Spark programming model
 Example applications
 Implementation
 Demo
 Future work
www.kellytechno.com
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster
that can be rebuilt if a partition is lost
 Created by transforming data in stable storage
using data flow operators (map, filter, group-by,
…)
 Can be cached across parallel operations
 Parallel operations on RDDs
 Reduce, collect, count, save, …
 Restricted shared variables
 Accumulators, broadcast variables
www.kellytechno.com
 Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1Block 1
Block 2Block 2
Block 3Block 3
WorkerWorker
WorkerWorker
WorkerWorker
DriverDriver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1Cache 1
Cache 2Cache 2
Cache 3Cache 3
Base RDDBase RDD
Transformed RDDTransformed RDD
Cached RDDCached RDD
Parallel operationParallel operation
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec for
on-disk data)
www.kellytechno.com
 An RDD is an immutable, partitioned, logical
collection of records
 Need not be materialized, but rather contains
information to rebuild a dataset from stable storage
 Partitioning can be based on a key in each
record (using hash or range partitioning)
 Built using bulk transformations on other RDDs
 Can be cached for future reuse
www.kellytechno.com
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations
(return a result to driver)
reduce
collect
count
save
lookupKey
…
www.kellytechno.com
 RDDs maintain lineage information that can
be used to reconstruct lost partitions
 Ex:cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
.cache()
HdfsRDD
path: hdfs://…
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
MappedRDD
func: split(…)
CachedRDDCachedRDD
www.kellytechno.com
 Consistency is easy due to immutability
 Inexpensive fault tolerance (log lineage rather
than replicating/checkpointing data)
 Locality-aware scheduling of tasks on partitions
 Despite being restricted, model seems
applicable to a broad variety of applications
www.kellytechno.com
Concern RDDs Distr. Shared Mem.
Reads Fine-grained Fine-grained
Writes Bulk transformations Fine-grained
Consistency Trivial (immutable) Up to app / runtime
Fault recovery Fine-grained and low-
overhead using lineage
Requires checkpoints
and program rollback
Straggler
mitigation
Possible using
speculative execution
Difficult
Work placement Automatic based on
data locality
Up to app (but runtime
aims for transparency)
www.kellytechno.com
DryadLINQ
Language-integratedAPIwith
SQL-likeoperationsonlazy
datasets
Cannothaveadatasetpersist
acrossqueries
Relationaldatabases
Lineage/provenance,logical
logging,materializedviews
Piccolo
Parallelprogramswithshared
distributedtables;similarto
distributedsharedmemory
IterativeMapReduce(Twister
andHaLoop)
Cannotdefinemultiple
distributeddatasets,run
differentmap/reducepairson
them,orquerydatainteractively
RAMCloud
Allowsrandomread/writetoall
cells,requiringloggingmuchlike
distributedsharedmemory
systems
www.kellytechno.com
 Spark programming model
 Example applications
 Implementation
 Demo
 Future work
www.kellytechno.com
 Goal: find best line separating two sets of
points
+
–
+
+
+
+
+
+
+
+
– –
–
–
–
–
–
–
+
target
–
random initial line
www.kellytechno.com
 val data = spark.textFile(...).map(readPoint).cache()
 var w = Vector.random(D)
 for (i <- 1 to ITERATIONS) {
 val gradient = data.map(p =>
 (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
 ).reduce(_ + _)
 w -= gradient
 }
 println("Final w: " + w)
www.kellytechno.com
127 s / iteration
first iteration 174 s
further iterations 6 s
www.kellytechno.com
 MapReduce data flow can be expressed
using RDD transformations
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
www.kellytechno.com
val lines = spark.textFile(“hdfs://...”)
val counts = lines.flatMap(_.split(“s”))
.reduceByKey(_ + _)
counts.save(“hdfs://...”)
www.kellytechno.com
 Graph processing framework from Google that
implements Bulk Synchronous Parallel model
 Vertices in the graph have state
 At each superstep, each node can update its
state and send messages to nodes in future
step
 Good fit for PageRank, shortest paths, …
www.kellytechno.com
Input graphInput graph Vertex state 1Vertex state 1 Messages 1Messages 1
Superstep 1Superstep 1
Vertex state 2Vertex state 2 Messages 2Messages 2
Superstep 2Superstep 2
. . .
Group by vertex IDGroup by vertex ID
Group by vertex IDGroup by vertex ID
www.kellytechno.com
Input graphInput graph Vertex ranks 1Vertex ranks 1 Contributions 1Contributions 1
Superstep 1 (add contribs)Superstep 1 (add contribs)
Vertex ranks 2Vertex ranks 2 Contributions 2Contributions 2
Superstep 2 (add contribs)Superstep 2 (add contribs)
. . .
Group & add by
vertex
Group & add by
vertex
Group & add by vertexGroup & add by vertex
www.kellytechno.com
 Separate RDDs for immutable graph state and
for vertex states and messages at each
iteration
 Use groupByKey to perform each step
 Cache the resulting vertex and message RDDs
 Optimization: co-partition input graph and
vertex state RDDs to reduce communication
www.kellytechno.com
 Twitter spam classification (Justin Ma)
 EM alg. for traffic prediction (Mobile
Millennium)
 K-means clustering
 Alternating Least Squares matrix factorization
 In-memory OLAP aggregation on Hive data
 SQL on Spark (future work)
www.kellytechno.com
 Spark programming model
 Example applications
 Implementation
 Demo
 Future work
www.kellytechno.com
 Spark runs on the Mesos
cluster manager [NSDI
11], letting it share
resources with Hadoop &
other apps
 Can read from any
Hadoop input source (e.g.
HDFS)
 ~6000 lines of Scala code thanks to building on
Mesos
SparkSpark HadoopHadoop MPIMPI
MesosMesos
NodeNode NodeNode NodeNode NodeNode
…
www.kellytechno.com
 Scala closures are Serializable Java objects
 Serialize on driver, load & run on workers
 Not quite enough
 Nested closures may reference entire outer scope
 May pull in non-Serializable variables not used inside
 Solution: bytecode analysis + reflection
 Shared variables implemented using custom
serialized form (e.g. broadcast variable contains
pointer to BitTorrent tracker)
www.kellytechno.com
 Modified Scala interpreter to allow Spark to
be used interactively from the command line
 Required two changes:
 Modified wrapper code generation so that each
“line” typed has references to objects for its
dependencies
 Place generated classes in distributed filesystem
 Enables in-memory exploration of big data
www.kellytechno.com
 Spark programming model
 Example applications
 Implementation
 Demo
 Future work
www.kellytechno.com
 Further extend RDD capabilities
 Control over storage layout (e.g. column-oriented)
 Additional caching options (e.g. on disk, replicated)
 Leverage lineage for debugging
 Replay any task, rebuild any intermediate RDD
 Adaptive checkpointing of RDDs
 Higher-level analytics tools built on top of Spark
www.kellytechno.com
 By making distributed datasets a first-class
primitive, Spark provides a simple, efficient
programming model for stateful data analytics
 RDDs provide:
 Lineage info for fault recovery and debugging
 Adjustable in-memory caching
 Locality-aware parallel operations
 We plan to make Spark the basis of a suite of
batch and interactive data analysis tools
www.kellytechno.com
Setofpartitions
Preferredlocationsfor
eachpartition
Optionalpartitioning
scheme(hashorrange)
Storagestrategy(lazy
orcached)
ParentRDDs(forminga
lineageDAG)
www.kellytechno.com
Presented By
Kelly Technologies
www.kellytechno.com

Más contenido relacionado

La actualidad más candente

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 

La actualidad más candente (20)

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 

Destacado

Destacado (10)

Oracle training in hyderabad
Oracle training in hyderabadOracle training in hyderabad
Oracle training in hyderabad
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Oracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadOracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabad
 
Data science institutes in hyderabad
Data science institutes in hyderabadData science institutes in hyderabad
Data science institutes in hyderabad
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabad
 
Sas training in hyderabad
Sas training in hyderabadSas training in hyderabad
Sas training in hyderabad
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 

Similar a Spark training-in-bangalore

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Similar a Spark training-in-bangalore (20)

Scala+data
Scala+dataScala+data
Scala+data
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

Más de Kelly Technologies

Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangalore
Kelly Technologies
 

Más de Kelly Technologies (13)

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Websphere mb training in hyderabad
Websphere mb training in hyderabadWebsphere mb training in hyderabad
Websphere mb training in hyderabad
 
Hadoop training institutes in bangalore
Hadoop training institutes in bangaloreHadoop training institutes in bangalore
Hadoop training institutes in bangalore
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Tableau training in bangalore
Tableau training in bangaloreTableau training in bangalore
Tableau training in bangalore
 
Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangalore
 
Qlikview training in hyderabad
Qlikview training in hyderabadQlikview training in hyderabad
Qlikview training in hyderabad
 
Project Management Planning training in hyderabad
Project Management Planning training in hyderabadProject Management Planning training in hyderabad
Project Management Planning training in hyderabad
 
Oracle training in_hyderabad
Oracle training in_hyderabadOracle training in_hyderabad
Oracle training in_hyderabad
 
Ax finance training in hyderabad
Ax finance training in hyderabadAx finance training in hyderabad
Ax finance training in hyderabad
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Spark training-in-bangalore

  • 1. In-Memory Cluster Computing for Iterative and Interactive Applications Presented By Kelly Technologies www.kellytechno.com
  • 2.  Commodity clusters have become an important computing platform for a variety of applications  In industry: search, machine translation, ad targeting, …  In research: bioinformatics, NLP, climate simulation, …  High-level cluster programming models like MapReduce power many of these apps  Theme of this work: provide similarly powerful abstractions for a broader class of applications www.kellytechno.com
  • 3. Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: MapMap MapMap MapMap ReduceReduce ReduceReduce Input Output www.kellytechno.com
  • 4.  Current popular programming models for clusters transform data flowing from stable storage to stable storage  E.g., MapReduce: MapMap MapMap MapMap ReduceReduce ReduceReduce Input Output www.kellytechno.com
  • 5.  Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:  Iterative algorithms (many in machine learning)  Interactive data mining tools (R, Excel, Python)  Spark makes working sets a first-class concept to efficiently support these apps www.kellytechno.com
  • 6.  Provide distributed memory abstractions for clusters to support apps with working sets  Retain the attractive properties of MapReduce:  Fault tolerance (for crashes & stragglers)  Data locality  Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs) www.kellytechno.com
  • 7.  We conjecture that Spark’s combination of data flow with RDDs unifies many proposed cluster programming models  General data flow models: MapReduce, Dryad, SQL  Specialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing  Instead of specialized APIs for one type of app, give user first-class control of distrib. datasets www.kellytechno.com
  • 8.  Spark programming model  Example applications  Implementation  Demo  Future work www.kellytechno.com
  • 9.  Resilient distributed datasets (RDDs)  Immutable collections partitioned across cluster that can be rebuilt if a partition is lost  Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)  Can be cached across parallel operations  Parallel operations on RDDs  Reduce, collect, count, save, …  Restricted shared variables  Accumulators, broadcast variables www.kellytechno.com
  • 10.  Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1Block 1 Block 2Block 2 Block 3Block 3 WorkerWorker WorkerWorker WorkerWorker DriverDriver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1Cache 1 Cache 2Cache 2 Cache 3Cache 3 Base RDDBase RDD Transformed RDDTransformed RDD Cached RDDCached RDD Parallel operationParallel operation Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) www.kellytechno.com
  • 11.  An RDD is an immutable, partitioned, logical collection of records  Need not be materialized, but rather contains information to rebuild a dataset from stable storage  Partitioning can be based on a key in each record (using hash or range partitioning)  Built using bulk transformations on other RDDs  Can be cached for future reuse www.kellytechno.com
  • 12. Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey … www.kellytechno.com
  • 13.  RDDs maintain lineage information that can be used to reconstruct lost partitions  Ex:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… HdfsRDD path: hdfs://… FilteredRDD func: contains(...) FilteredRDD func: contains(...) MappedRDD func: split(…) MappedRDD func: split(…) CachedRDDCachedRDD www.kellytechno.com
  • 14.  Consistency is easy due to immutability  Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data)  Locality-aware scheduling of tasks on partitions  Despite being restricted, model seems applicable to a broad variety of applications www.kellytechno.com
  • 15. Concern RDDs Distr. Shared Mem. Reads Fine-grained Fine-grained Writes Bulk transformations Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low- overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using speculative execution Difficult Work placement Automatic based on data locality Up to app (but runtime aims for transparency) www.kellytechno.com
  • 17.  Spark programming model  Example applications  Implementation  Demo  Future work www.kellytechno.com
  • 18.  Goal: find best line separating two sets of points + – + + + + + + + + – – – – – – – – + target – random initial line www.kellytechno.com
  • 19.  val data = spark.textFile(...).map(readPoint).cache()  var w = Vector.random(D)  for (i <- 1 to ITERATIONS) {  val gradient = data.map(p =>  (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x  ).reduce(_ + _)  w -= gradient  }  println("Final w: " + w) www.kellytechno.com
  • 20. 127 s / iteration first iteration 174 s further iterations 6 s www.kellytechno.com
  • 21.  MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) www.kellytechno.com
  • 22. val lines = spark.textFile(“hdfs://...”) val counts = lines.flatMap(_.split(“s”)) .reduceByKey(_ + _) counts.save(“hdfs://...”) www.kellytechno.com
  • 23.  Graph processing framework from Google that implements Bulk Synchronous Parallel model  Vertices in the graph have state  At each superstep, each node can update its state and send messages to nodes in future step  Good fit for PageRank, shortest paths, … www.kellytechno.com
  • 24. Input graphInput graph Vertex state 1Vertex state 1 Messages 1Messages 1 Superstep 1Superstep 1 Vertex state 2Vertex state 2 Messages 2Messages 2 Superstep 2Superstep 2 . . . Group by vertex IDGroup by vertex ID Group by vertex IDGroup by vertex ID www.kellytechno.com
  • 25. Input graphInput graph Vertex ranks 1Vertex ranks 1 Contributions 1Contributions 1 Superstep 1 (add contribs)Superstep 1 (add contribs) Vertex ranks 2Vertex ranks 2 Contributions 2Contributions 2 Superstep 2 (add contribs)Superstep 2 (add contribs) . . . Group & add by vertex Group & add by vertex Group & add by vertexGroup & add by vertex www.kellytechno.com
  • 26.  Separate RDDs for immutable graph state and for vertex states and messages at each iteration  Use groupByKey to perform each step  Cache the resulting vertex and message RDDs  Optimization: co-partition input graph and vertex state RDDs to reduce communication www.kellytechno.com
  • 27.  Twitter spam classification (Justin Ma)  EM alg. for traffic prediction (Mobile Millennium)  K-means clustering  Alternating Least Squares matrix factorization  In-memory OLAP aggregation on Hive data  SQL on Spark (future work) www.kellytechno.com
  • 28.  Spark programming model  Example applications  Implementation  Demo  Future work www.kellytechno.com
  • 29.  Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other apps  Can read from any Hadoop input source (e.g. HDFS)  ~6000 lines of Scala code thanks to building on Mesos SparkSpark HadoopHadoop MPIMPI MesosMesos NodeNode NodeNode NodeNode NodeNode … www.kellytechno.com
  • 30.  Scala closures are Serializable Java objects  Serialize on driver, load & run on workers  Not quite enough  Nested closures may reference entire outer scope  May pull in non-Serializable variables not used inside  Solution: bytecode analysis + reflection  Shared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker) www.kellytechno.com
  • 31.  Modified Scala interpreter to allow Spark to be used interactively from the command line  Required two changes:  Modified wrapper code generation so that each “line” typed has references to objects for its dependencies  Place generated classes in distributed filesystem  Enables in-memory exploration of big data www.kellytechno.com
  • 32.  Spark programming model  Example applications  Implementation  Demo  Future work www.kellytechno.com
  • 33.  Further extend RDD capabilities  Control over storage layout (e.g. column-oriented)  Additional caching options (e.g. on disk, replicated)  Leverage lineage for debugging  Replay any task, rebuild any intermediate RDD  Adaptive checkpointing of RDDs  Higher-level analytics tools built on top of Spark www.kellytechno.com
  • 34.  By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics  RDDs provide:  Lineage info for fault recovery and debugging  Adjustable in-memory caching  Locality-aware parallel operations  We plan to make Spark the basis of a suite of batch and interactive data analysis tools www.kellytechno.com

Notas del editor

  1. Explain that trends in data collection rates vs processor and IO speeds are driving this
  2. Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and
  3. Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and
  4. RDDs = first-class way to manipulate and persist intermediate datasets
  5. Key idea:
  6. You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
  7. Note that dataset is reused on each gradient computation
  8. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  9. Mention it’s designed to be fault-tolerant