SlideShare una empresa de Scribd logo
1 de 54
SIKS Big Data Course
Part Two
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Enschede, December 7, 2016
Recap Spark
 Data Sharing Crucial for:
- Interactive Analysis
- Iterative machine learning algorithms
 Spark RDDs
- Distributed collections, cached in memory across cluster nodes
 Keep track of Lineage
- To ensure fault-tolerance
- To optimize processing based on knowledge of the data partitioning
RDDs in More Detail
RDDs additionally provide:
- Control over partitioning, which can be used to optimize data
placement across queries.
- usually more efficient than the sort-based approach of Map
Reduce
- Control over persistence (e.g. store on disk vs in RAM)
- Fine-grained reads (treat RDD as a big table)
Slide by Matei Zaharia, creator Spark, http://spark-project.org
Scheduling Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
agnostic to
operators!
doesn’t know
about stages
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
RDD API Example
// Read input file
val input = sc.textFile("input.txt")
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove
empty lines
val counts = tokenized // frequency of log
levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b, 2 } 6
RDD API Example
// Read input file
val input = sc.textFile( )
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove
empty lines
val counts = tokenized // frequency of log
levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b } 7
Transformations
sc.textFile().map().filter().map().reduceByKey()
8
DAG View of RDD’s
textFile() map() filter() map() reduceByKey()
9
Mapped RDD
Partition
1
Partition
2
Partition
3
Filtered RDD
Partition
1
Partition
2
Partition
3
Mapped RDD
Partition
1
Partition
2
Partition
3
Shuffle RDD
Partition
1
Partition
2
Hadoop RDD
Partition
1
Partition
2
Partition
3
input tokenized
counts
Transformations build up a DAG, but don’t “do
anything”
10
How runJob Works
Needs to compute my parents, parents, parents, etc all
the way back to an RDD with no dependencies (e.g.
HadoopRDD).
11
Mapped RDD
Partition
1
Partition
2
Partition
3
Filtered RDD
Partition
1
Partition
2
Partition
3
Mapped RDD
Partition
1
Partition
2
Partition
3
Hadoop RDD
Partition
1
Partition
2
Partition
3
input tokenized
counts
runJob(counts)
Physical
Optimizations
1. Certain types of transformations can be
pipelined.
2. If dependent RDD’s have already been
cached (or persisted in a shuffle) the graph
can be truncated.
Pipelining and truncation produce a set of
stages where each stage is composed of
tasks
12
Scheduler Optimizations
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
Task Details
Stage boundaries are only at input RDDs or
“shuffle” operations
So, each task looks like this:
Task
f1  f2  …
Task
f1  f2  …
map output file
or master
external
storage
fetch map
outputs
and/or
How runJob Works
Needs to compute my parents, parents, parents, etc all
the way back to an RDD with no dependencies (e.g.
HadoopRDD).
15
Mapped RDD
Partition
1
Partition
2
Partition
3
Filtered RDD
Partition
1
Partition
2
Partition
3
Mapped RDD
Partition
1
Partition
2
Partition
3
Shuffle RDD
Partition
1
Partition
2
Hadoop RDD
Partition
1
Partition
2
Partition
3
input tokenized counts
runJob(counts)
16
How runJob Works
Needs to compute my parents, parents, parents, etc all
the way back to an RDD with no dependencies (e.g.
HadoopRDD).
input tokenized counts
Mapped RDD
Partition
1
Partition
2
Partition
3
Filtered RDD
Partition
1
Partition
2
Partition
3
Mapped RDD
Partition
1
Partition
2
Partition
3
Shuffle RDD
Partition
1
Partition
2
Hadoop RDD
Partition
1
Partition
2
Partition
3
runJob(counts)
Stage Graph
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task
will:
1.Read
Hadoop
input
2.Perform
maps and
filters
3.Write
partial sums
Each task
will:
1.Read
partial sums
2.Invoke user
function
passed to
runJob.
Shuffle write Shuffle readInput
read
Physical Execution Model
 Distinguish between:
- Jobs: complete work to be done
- Stages: bundles of work that can execute together
- Tasks: unit of work, corresponds to one RDD partition
 Defining stages and tasks should not require deep
knowledge of what these actually do
- Goal of Spark is to be extensible, letting users define new
RDD operators
RDD Interface
Set of partitions (“splits”)
List of dependencies on parent RDDs
Function to compute a partition given parents
Optional preferred locations
Optional partitioning info (Partitioner)
Captures all current Spark operations!Captures all current Spark operations!
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(partition) = read corresponding block
preferredLocations(part) = HDFS block location
partitioner = none
Example: FilteredRDD
partitions = same as parent RDD
dependencies = “one-to-one” on parent
compute(partition) = compute parent and filter it
preferredLocations(part) = none (ask parent)
partitioner = none
Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each parent
compute(partition) = read and join shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!
Spark will now know
this data is hashed!
DependencyTypes
union join with inputs not
co-partitioned
map, filter
join with
inputs co-
partitioned
“Narrow” deps:
groupByKey
“Wide” (shuffle) deps:
Improving Efficiency
 Basic Principle: Avoid Shuffling!
Filter Input Early
Avoid groupByKey on Pair RDDs
 All key-value pairs will be shuffled accross the network, to a
reducer where the values are collected together
groupByKey
“Wide” (shuffle) deps:
aggregateByKey
 Three inputs
- Zero-element
- Merging function within partition
- Merging function across partitions
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts =
(p1: Int, p2: Int) => p1 + p2
val countByKey =
kv.aggregateByKey(initialCount)
(addToCounts,sumPartitionCounts)
Combiners!
combineByKey
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) =>
(acc1._1 + acc2._1, acc1._2 + acc2._2) )
.map{
case (key, value) =>
(key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
Control the Degree of Parallellism
 Repartition
- Concentrate effort - increase use of nodes
 Coalesce
- Reduce number of tasks
Broadcast Values
 In case of a join with a small RHS or LHS, broadcast the
small set to every node in the cluster
Broadcast Variables
 Create with SparkContext.broadcast(initVal)
 Access with .value inside tasks
 Immutable!
- If you modify the broadcast value after creation, that change is
local to the node
Maintaining Partitioning
 mapValues instead of map
 flatMapValues instead of flatMap
- Good for tokenization!
The best trick of all, however…
Use Higher Level API’s!
DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
38
Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured
data
MLlib
machine
learning
GraphX
graph
Combining Processing
Types
// Load data using SQL
points = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(t => (model.predict(t.location), 1))
.reduceByWindow(“5s”, (a, b) => a + b)
Performance of
Composition
Separate computing frameworks:
…
HDFS
read
HDFS
write
HDFS
read
HDFS
write
HDFS
read
HDFS
write
HDFS
write
HDFS
read
Spark:
Encode Domain Knowledge
 In essence, nothing more than libraries with pre-cooked
code – that still operates over the abstraction of RDDs
 Focus on optimizations that require domain knowledge
Spark MLLib
Data Sets
Challenge: Data
Representation
Java objects often many times larger than
data
class User(name: String, friends: Array[Int])
User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5
DataFrames / Spark SQL
Efficient library for working with structured
data
» Two interfaces: SQL for data analysts and
external apps, DataFrames for complex
programs
» Optimized computation and storage underneath
Spark SQL added in 2014, DataFrames in
2015
Spark SQL Architecture
Logical
Plan
Logical
Plan
Physical
Plan
Physical
Plan
CatalogCatalog
Optimizer
RDDsRDDs
…
Data
Source
API
SQLSQL
Data
Frames
Data
Frames
Code
Generator
DataFrame API
DataFrames hold rows with a known
schema and offer relational operations
through a DSL
c = HiveContext()
users = c.sql(“select * from users”)
ma_users = users[users.state == “MA”]
ma_users.count()
ma_users.groupBy(“name”).avg(“age”)
ma_users.map(lambda row: row.user.toUpper())
Expression AST
What DataFrames Enable
1. Compact binary representation
• Columnar, compressed cache; rows for
processing
1. Optimization across operators (join
reordering, predicate pushdown, etc)
2. Runtime code generation
Performance
Performance
Data Sources
Uniform way to access structured data
» Apps can migrate across Hive, Cassandra, JSON, …
» Rich semantics allows query pushdown into data
sources
Spark
SQL
users[users.age > 20]
select * from users
Examples
JSON:
JDBC:
Together:
select user.id, text from tweets
{
“text”: “hi”,
“user”: {
“name”: “bob”,
“id”: 15 }
}
tweets.json
select age from users where lang = “en”
select t.text, u.age
from tweets t, users u
where t.user.id = u.id
and u.lang = “en”
Spark
SQL
{JSON}
select id, age from
users where lang=“en”
Thanks
 Matei Zaharia, MIT (https://cs.stanford.edu/~matei/)
 Paul Wendell, Databricks
 http://spark-project.org

Más contenido relacionado

La actualidad más candente

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

La actualidad más candente (20)

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar a Bigdata processing with Spark - part II

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar a Bigdata processing with Spark - part II (20)

Spark
SparkSpark
Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark learning
Spark learningSpark learning
Spark learning
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Más de Arjen de Vries

Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Arjen de Vries
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Arjen de Vries
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Arjen de Vries
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social MediaArjen de Vries
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMMArjen de Vries
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsArjen de Vries
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master SpecialisationArjen de Vries
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelArjen de Vries
 
The personal search engine
The personal search engineThe personal search engine
The personal search engineArjen de Vries
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Arjen de Vries
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaArjen de Vries
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by StrategyArjen de Vries
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?Arjen de Vries
 

Más de Arjen de Vries (20)

Doing a PhD @ DOSSIER
Doing a PhD @ DOSSIERDoing a PhD @ DOSSIER
Doing a PhD @ DOSSIER
 
Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen)
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6)
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social Media
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMM
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC Chairs
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward Panel
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain Knowledge
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by Strategy
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?
 

Último

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptRakeshMohan42
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 

Último (20)

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 

Bigdata processing with Spark - part II

  • 1. SIKS Big Data Course Part Two Prof.dr.ir. Arjen P. de Vries arjen@acm.org Enschede, December 7, 2016
  • 2.
  • 3. Recap Spark  Data Sharing Crucial for: - Interactive Analysis - Iterative machine learning algorithms  Spark RDDs - Distributed collections, cached in memory across cluster nodes  Keep track of Lineage - To ensure fault-tolerance - To optimize processing based on knowledge of the data partitioning
  • 4. RDDs in More Detail RDDs additionally provide: - Control over partitioning, which can be used to optimize data placement across queries. - usually more efficient than the sort-based approach of Map Reduce - Control over persistence (e.g. store on disk vs in RAM) - Fine-grained reads (treat RDD as a big table) Slide by Matei Zaharia, creator Spark, http://spark-project.org
  • 5. Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! agnostic to operators! doesn’t know about stages doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 6. RDD API Example // Read input file val input = sc.textFile("input.txt") val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 } 6
  • 7. RDD API Example // Read input file val input = sc.textFile( ) val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b } 7
  • 9. DAG View of RDD’s textFile() map() filter() map() reduceByKey() 9 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts
  • 10. Transformations build up a DAG, but don’t “do anything” 10
  • 11. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 11 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts runJob(counts)
  • 12. Physical Optimizations 1. Certain types of transformations can be pipelined. 2. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated. Pipelining and truncation produce a set of stages where each stage is composed of tasks 12
  • 13. Scheduler Optimizations Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 14. Task Details Stage boundaries are only at input RDDs or “shuffle” operations So, each task looks like this: Task f1  f2  … Task f1  f2  … map output file or master external storage fetch map outputs and/or
  • 15. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 15 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input tokenized counts runJob(counts)
  • 16. 16 How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). input tokenized counts Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 17. Stage Graph Task 1 Task 2 Task 3 Task 1 Task 2 Stage 1 Stage 2 Each task will: 1.Read Hadoop input 2.Perform maps and filters 3.Write partial sums Each task will: 1.Read partial sums 2.Invoke user function passed to runJob. Shuffle write Shuffle readInput read
  • 18. Physical Execution Model  Distinguish between: - Jobs: complete work to be done - Stages: bundles of work that can execute together - Tasks: unit of work, corresponds to one RDD partition  Defining stages and tasks should not require deep knowledge of what these actually do - Goal of Spark is to be extensible, letting users define new RDD operators
  • 19. RDD Interface Set of partitions (“splits”) List of dependencies on parent RDDs Function to compute a partition given parents Optional preferred locations Optional partitioning info (Partitioner) Captures all current Spark operations!Captures all current Spark operations!
  • 20. Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  • 21. Example: FilteredRDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  • 22. Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed! Spark will now know this data is hashed!
  • 23. DependencyTypes union join with inputs not co-partitioned map, filter join with inputs co- partitioned “Narrow” deps: groupByKey “Wide” (shuffle) deps:
  • 24. Improving Efficiency  Basic Principle: Avoid Shuffling!
  • 26. Avoid groupByKey on Pair RDDs  All key-value pairs will be shuffled accross the network, to a reducer where the values are collected together groupByKey “Wide” (shuffle) deps:
  • 27. aggregateByKey  Three inputs - Zero-element - Merging function within partition - Merging function across partitions val initialCount = 0; val addToCounts = (n: Int, v: String) => n + 1 val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2 val countByKey = kv.aggregateByKey(initialCount) (addToCounts,sumPartitionCounts) Combiners!
  • 28. combineByKey val result = input.combineByKey( (v) => (v, 1), (acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2) ) .map{ case (key, value) => (key, value._1 / value._2.toFloat) } result.collectAsMap().map(println(_))
  • 29. Control the Degree of Parallellism  Repartition - Concentrate effort - increase use of nodes  Coalesce - Reduce number of tasks
  • 30. Broadcast Values  In case of a join with a small RHS or LHS, broadcast the small set to every node in the cluster
  • 31.
  • 32.
  • 33.
  • 34. Broadcast Variables  Create with SparkContext.broadcast(initVal)  Access with .value inside tasks  Immutable! - If you modify the broadcast value after creation, that change is local to the node
  • 35. Maintaining Partitioning  mapValues instead of map  flatMapValues instead of flatMap - Good for tokenization!
  • 36.
  • 37. The best trick of all, however…
  • 38. Use Higher Level API’s! DataFrame APIs for core processing Works across Scala, Java, Python and R Spark ML for machine learning Spark SQL for structured query processing 38
  • 40. Combining Processing Types // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b)
  • 41. Performance of Composition Separate computing frameworks: … HDFS read HDFS write HDFS read HDFS write HDFS read HDFS write HDFS write HDFS read Spark:
  • 42. Encode Domain Knowledge  In essence, nothing more than libraries with pre-cooked code – that still operates over the abstraction of RDDs  Focus on optimizations that require domain knowledge
  • 45. Challenge: Data Representation Java objects often many times larger than data class User(name: String, friends: Array[Int]) User(“Bobby”, Array(1, 2)) User 0x… 0x… String 3 0 1 2 Bobby 5 0x… int[] char[] 5
  • 46. DataFrames / Spark SQL Efficient library for working with structured data » Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs » Optimized computation and storage underneath Spark SQL added in 2014, DataFrames in 2015
  • 48. DataFrame API DataFrames hold rows with a known schema and offer relational operations through a DSL c = HiveContext() users = c.sql(“select * from users”) ma_users = users[users.state == “MA”] ma_users.count() ma_users.groupBy(“name”).avg(“age”) ma_users.map(lambda row: row.user.toUpper()) Expression AST
  • 49. What DataFrames Enable 1. Compact binary representation • Columnar, compressed cache; rows for processing 1. Optimization across operators (join reordering, predicate pushdown, etc) 2. Runtime code generation
  • 52. Data Sources Uniform way to access structured data » Apps can migrate across Hive, Cassandra, JSON, … » Rich semantics allows query pushdown into data sources Spark SQL users[users.age > 20] select * from users
  • 53. Examples JSON: JDBC: Together: select user.id, text from tweets { “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } } tweets.json select age from users where lang = “en” select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = “en” Spark SQL {JSON} select id, age from users where lang=“en”
  • 54. Thanks  Matei Zaharia, MIT (https://cs.stanford.edu/~matei/)  Paul Wendell, Databricks  http://spark-project.org

Notas del editor

  1. NOT a modified version of Hadoop