SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
After Dark 1.5+End-to-End, Real-time, Advanced Analytics:

Reference Big Data Processing Pipeline
Melbourne Spark Meetup
Dec 09, 2015
Chris Fregly
Principal Data Solutions Engineer
advancedspark.com
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Open Source Committer


Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
1700+ Members in just 4 mos!!
2100+ Downloads of Docker Image

 with many Meetup Demos!!!

 @ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,

 
 BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed, 

 
 distributed, big data processing systems
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Spark (Nov 12th)
4
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Istanbul Spark Meetup (Nov 28th)
Singapore Pre-Strata Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
San Francisco Advanced Spark (Dec 10th)
San Francisco Univ of San Francisco (Dec 11th)
Toronto Spark Meetup (Dec 14th)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topics of This Talk (15-20 mins each)
  Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
5
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Live, Interactive Demo!
6
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Required!
7
You -> 
 Audience Instructions
 Navigate to 

sparkafterdark.com
 Click
3 actresses & 3 actors




Data ->
Scientist
This is Totally Anonymous!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topics of This Talk (15-20 mins each)
  Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
8
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
“Hardware and software working together.”

- Martin Thompson

 http://mechanical-sympathy.blogspot.com


“Whatever your data structure, my array will beat it.”

- Scott Meyers

 Every C++ Book, basically

9
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
10
Project 

Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line and Cache Sizes
11
My

Laptop
My

SoftLayer

BareMetal
aka
“LLC”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes

12
Key
 Ptr
Pre-process & Pull Key From Record
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes

Key
 Ptr
Pad
/Pad
 CPU Cache-line Friendly!
Dereference full key only 
to resolve prefix duplicates
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
13
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sequential vs Random Memory Access
14
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Performance Impact of Random Pointer vs. Sequential Key Sort

15
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
16
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Random vs. Sequential Sort
17
Naïve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr
 Key
Must Dereference to Compare Key
Key
 Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Performance Impact of CPU Cache & Prefetch Misses

18
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

19
Bad: Row-wise traversal,

 not using full CPU cache line,

ineffective pre-fetching
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
20
Good: Full CPU Cache Line,

Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results Of Matrix Multiply Challenge
Cache-Friendly Matrix Multiply
21
Naïve Matrix Multiply
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters 

(each at different increments) by
1000’s of Simultaneous Threads

22
Possible Solutions
  Synchronized Immutable
  Synchronized Mutable
  AtomicReference CAS
  Volatile???
Context Switches are Expen$ive!!
aka
“LLC”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)

object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {

this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement, 

 
 
 
 
 
 
 
 
 counters.right + rightIncrement)
}
}
}
23
Locks whole
outer object!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {…}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
counters.increment(leftIncrement, rightIncrement)
}
}
24
Locks just
MutableCounters
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Lock-Free AtomicReference Counters
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
}

 // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
} 
25
Lock Free!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()

 …

 // Store two 32-bit Int into one 64-bit Long

 // Use >>> 32 and << 32 to set and retrieve each Int from the Long

 } 
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
} 
26
Q: Why not 

@volatile long?
A: The JVM does not
guarantee atomic

updates of 64-bit 
longs and doubles
Lock Free!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Performance Impact of Thread Blocking & Context Switching
27
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of 2 Counter Synchronization
Immutable Case Class





28
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event 
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, 
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left: Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement, 

 
 
 
 counters.right + rightIncrement)
}
val counters = new AtomicLong()
…
do {
…
} while !(counters.compareAndSet(originalCounters, 

updatedCounters))
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profile Visualizations: Flame Graphs
29
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU

serialization, etc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms

Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection

Reuse ByteArrays

In-place updates for aggregations
Maximize CPU Cache Effectiveness

8-byte alignment

AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation

Dynamic optimizations using entire query plan

Developer implements genCode() to create Scala source code (String)

Project Janino compiles source code into JVM bytecode
30
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression

Spark 1.2 updates saturated Network, Disk I/O

10x increase in I/O throughput relative to CPU

More partition, pruning, and pushdown support

Newer columnar file formats help reduce I/O
31
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap

Uses BytesToBytesMap internally

In-place updates of serialized aggregation

No object creation on hot-path


TungstenAggregate & TungstenAggregationIterator

Operates directly on serialized, binary UnsafeRow

2 steps to avoid single-key OOMs
  Hash-based (grouping) agg spills to disk if needed
  Sort-based agg performs external merge sort on spills
32
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structs & Algos: Sorting
o.a.s.util.collection.unsafe.sort. 


UnsafeSortDataFormat

UnsafeExternalSorter

UnsafeShuffleWriter

UnsafeInMemorySorter

RecordPointerAndKeyPrefix



33
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>

Note: Mixing multiple subclasses of SortDataFormat 
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem

Boxing creates excessive objects

Expression tree evaluations are costly

JVM can’t inline polymorphic impls

 
Lack of polymorphism == poor code design
Solution

Code generation enables inlining

Rewrite and optimize code using overall plan, 8-byte align

Defer source code generation to each operator, UDF, UDAF

Use Janino to compile generated source code into bytecode

 
(More IDE friendly than Scala quasiquotes)
34
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J

SparkContext.addExecutors() until max is reached
Scaling down is hard L

SparkContext.removeExecutors()

Lose RDD cache inside Executor JVM

Must rebuild active RDD partitions in another Executor JVM
Uses External Shuffle Service from Spark 1.1-1.2

If Executor JVM dies/restarts, shuffle keeps shufflin’!
35
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/create 

 
--header "Content-Type:application/json;charset=UTF-8" 

 
--data ’{"action" : "CreateSubmissionRequest”, 
"mainClass" : "org.apache.spark.examples.SparkPi”,

 
 
"sparkProperties" : {

 
 
 "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",

 
 "spark.app.name" : "SparkPi",…

 
}}’
Get Spark Job Status

curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
36
(the snitch)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
  Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker
  Spark Core

Tuning and Profiling
  Spark SQL

Tuning and Customizing
37
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping
38
Min/Max Heuristics
For Chunk Skipping
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Partitions
Partition Based on Data Access Patterns

/genders.parquet/gender=M/…

 
 
 
 
 
 
 /gender=F/… <-- Use Case: Access Users by Gender

 
 
 
 
 
 /gender=U/…
Dynamic Partition Creation (Write)

Dynamically create partitions on write based on column (ie. Gender)

SQL: INSERT TABLE genders PARTITION (gender) SELECT …

DF: gendersDF.write.format("parquet").partitionBy("gender")

.save("/genders.parquet")
Partition Discovery (Read)

Dynamically infer partitions on read based on paths (ie. /gender=F/…)

SQL: SELECT id FROM genders WHERE gender=F

DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). 


 
 
.where("gender=F")
39
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pruning
Partition Pruning

Filter out rows by partition


 
SELECT id, gender FROM genders WHERE gender = ‘F’

Column Pruning

Filter out columns by column filter

Extremely useful for columnar storage formats (ie. Parquet)

 
Skip entire blocks of columns


 
SELECT id, gender FROM genders

40
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pushdowns
aka. Predicate or Filter Pushdowns 
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan


 
def buildScan(requiredColumns: Array[String], 
filters: Array[Filter]): RDD[Row]
41
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Compare Performance of File Formats, Partitions, and Pushdowns
42
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Filter Pushdown and Collapsing
43
Filter pushdown (Parquet)
Filter collapsing (next best thing)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Join Between Partitioned & Unpartitioned
44
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Join Between Partitioned & Partitioned
45
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
46
Cartesian Join vs. Inner Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
47
Broadcast Join vs. Normal Shuffle Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Visualizing the Query Plan
48
Effectiveness 
of Filter
Cost-based
Join Optimization
Similar to
MapReduce

Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap

getPeakMemoryUsedBytes()
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

 
TableScan (impl): Read all data from source 


 
PrunedFilteredScan (impl): Column pruning & predicate pushdowns

 
InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory

Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class): Handles all filters supported by this source

 
EqualTo (impl)

 
GreaterThan (impl)

 
StringStartsWith (impl)
49
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL Data Sources
50
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS 
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

51
json() convenience method
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=false (unless your schema is evolving)

spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

 .load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

 .save("file:/root/pipeline/datasets/dating/genders.parquet") 
SQL

CREATE TABLE genders USING parquet

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders.parquet")

52
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch Data Source
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 


 
 
 
 
 
 
 "es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

 
.options(esConfig).save("<index>/<document-type>")

53
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Data Source (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

 
.format("org.apache.spark.sql.cassandra")

 
.mode(SaveMode.Append)

 
.options(Map("keyspace"->"<keyspace>",

 
 
 
 
 
 "table"->"<table>")).save(…)

54
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mythical New Cassandra Data Source
By-pass Cassandra CQL “front door”

(which is optimized for transactional data)

Bulk read and write directly against SSTables

(Similar to 5 year old NetflixOSS project Aegisthus)

Promotes Cassandra to first-class analytics option

Replicated analytics cluster not needed
55
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL Data Sources (IBM)
Coming Soon!
56
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom Data Source
① Study existing implementations

o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods

o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}

Spark JDBC (o.a.s.sql.execution.datasources.jdbc)

 class JDBCRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)

 class CassandraSourceRelation extends BaseRelation

 
with PrunedFilteredScan 


 
with InsertableRelation!
57
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Create a Custom Data Source
58
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Publishing Custom Data Sources
59
spark-packages.org
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in 

SPARK-8159, SPARK-9571
Every UDF must use Expressions and

implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring


② Extend and implement base trait

 o.a.s.sql.catalyst.expressions.Expression.genCode



③ Don’t forget about Python!

 python.pyspark.sql.functions.py
61
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demo!
Creating a Custom UDF participating in Code Generation
62
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Spark 1.6 Improvements
63
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
64
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Hybrid

Join
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Adaptive Memory Management
Spark <1.6

Manual configure between 2 memory regions

 
Spark execution engine (shuffles, joins, sorts, aggs)

 
 
spark.shuffle.memoryFraction

 
RDD Data Cache

 
 
spark.storage.memoryFraction 
Spark 1.6+

Unified memory regions

Dynamically expand/contract memory regions

Supports minimum for RDD storage (LRU Cache)
65
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Metrics
Shows exact memory usage per operator & per node
Helps debugging and identifying skew
66
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten

val ds = sqlContext.read.text("ratings.csv").as[String]

val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF

val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs

val simpleSum = new Aggregator[Int, Int, Int] with Serializable {

 def zero: Int = 0

 def reduce(b: Int, a: Int) = b + a 

 def merge(b1: Int, b2: Int) = b1 + b2

 def finish(b: Int) = b

}.toColumn

val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()

%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json`
67
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Streaming State Management
New trackStateByKey()

Store deltas, compaction later

No longer scan every key

Sessionization TTL

Integrated A/B Testing 

Shows failure output in Admin UI



68
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!!
Chris Fregly
IBM Spark Tech Center 
(http://spark.tc)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Clone, Contribute, Commit on Github
Run All Demos in Docker
2100+ Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax
69
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
What’s Next?
70
After Dark 1.6+
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What’s Next?
Autoscaling Docker/Spark Workers

Completely Docker-based

Docker Compose, Google Kubernetes
Lots of Demos and Examples

More Zeppelin & IPython/Jupyter notebooks

More advanced analytics use cases
More Performance Profiling

Work closely with Netflix & Databricks

Identify & fix performance bottlenecks

71
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Spark (Nov 12th)
72
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Istanbul Spark Meetup (Nov 28th)
Singapore Pre-Strata Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
San Francisco Advanced Spark (Dec 10th)
San Francisco Univ of San Francisco (Dec 11th)
Toronto Spark Meetup (Dec 14th)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly

Más contenido relacionado

La actualidad más candente

Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 

La actualidad más candente (20)

Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 

Similar a Melbourne Spark Meetup Dec 09 2015

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileDatabricks
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Databricks
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

Similar a Melbourne Spark Meetup Dec 09 2015 (17)

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Más de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

Más de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Último

Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 

Último (20)

Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 

Melbourne Spark Meetup Dec 09 2015

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc After Dark 1.5+End-to-End, Real-time, Advanced Analytics:
 Reference Big Data Processing Pipeline Melbourne Spark Meetup Dec 09, 2015 Chris Fregly Principal Data Solutions Engineer advancedspark.com
  • 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Founder Advanced Apache Meetup Author Advanced . Due 2016
  • 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics Top 5 Most Active Spark Meetup! 1700+ Members in just 4 mos!! 2100+ Downloads of Docker Image with many Meetup Demos!!! @ advancedspark.com Meetup Goals Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch,Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share patterns and idioms of well-designed, distributed, big data processing systems
  • 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza (Nov 10th) San Francisco Advanced Spark (Nov 12th) 4 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Pre-Strata Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) San Francisco Advanced Spark (Dec 10th) San Francisco Univ of San Francisco (Dec 11th) Toronto Spark Meetup (Dec 14th)
  • 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topics of This Talk (15-20 mins each)   Spark Streaming and Spark ML
 Kafka, Cassandra, ElasticSearch, Redis, Docker   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 5
  • 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Live, Interactive Demo! 6
  • 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Required! 7 You -> Audience Instructions  Navigate to 
 sparkafterdark.com  Click 3 actresses & 3 actors Data -> Scientist This is Totally Anonymous!!
  • 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topics of This Talk (15-20 mins each)   Spark Streaming and Spark ML
 Kafka, Cassandra, ElasticSearch, Redis, Docker   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 8
  • 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy “Hardware and software working together.” - Martin Thompson http://mechanical-sympathy.blogspot.com “Whatever your data structure, my array will beat it.” - Scott Meyers Every C++ Book, basically 9
  • 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 10 Project 
 Tungsten (Spark 1.4-1.6+) GraySort Challenge (Spark 1.1-1.2) Minimize Memory & GC Maximize CPU Cache Saturate Network I/O Saturate Disk I/O
  • 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Cache Sizes 11 My
 Laptop My
 SoftLayer
 BareMetal aka “LLC”
  • 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Sympathy (AlphaSort paper) Key (10 bytes) + Pointer (4 bytes) = 14 bytes 12 Key Ptr Pre-process & Pull Key From Record Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes Ptr Must Dereference to Compare Key Key Pointer (4 bytes) = 4 bytes Key Ptr Pad /Pad CPU Cache-line Friendly! Dereference full key only to resolve prefix duplicates
  • 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 13
  • 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sequential vs Random Memory Access 14
  • 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Performance Impact of Random Pointer vs. Sequential Key Sort 15
  • 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 16 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Random vs. Sequential Sort 17 Naïve Random Access Pointer Sort Cache Friendly Sequential Key/Pointer Sort Ptr Key Must Dereference to Compare Key Key Ptr Pre-process & Pull Key From Record -35% -90% -68% -26% % Change -55% perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
  • 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Performance Impact of CPU Cache & Prefetch Misses 18
  • 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 19 Bad: Row-wise traversal, not using full CPU cache line,
 ineffective pre-fetching
  • 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 20 Good: Full CPU Cache Line,
 Effective Prefetching OLD: matB [ k ][ j ]; // Modify algo for Transpose B for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
  • 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results Of Matrix Multiply Challenge Cache-Friendly Matrix Multiply 21 Naïve Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend -96% -93% -93% -70% -53% % Change -63% +8543%?
  • 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thread and Context Switch Sympathy Problem Atomically Increment 2 Counters 
 (each at different increments) by 1000’s of Simultaneous Threads 22 Possible Solutions   Synchronized Immutable   Synchronized Mutable   AtomicReference CAS   Volatile??? Context Switches are Expen$ive!! aka “LLC”
  • 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Synchronized Immutable Counters case class Counters(left: Int, right: Int) object SynchronizedImmutableCounters { var counters = new Counters(0,0) def getCounters(): Counters = { this.synchronized { counters } } def increment(leftIncrement: Int, rightIncrement: Int): Unit = { this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } } } 23 Locks whole outer object!!
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Synchronized Mutable Counters class MutableCounters(left: Int, right: Int) { def increment(leftIncrement: Int, rightIncrement: Int): Unit={ this.synchronized {…} } def getCountersTuple(): (Int, Int) = { this.synchronized{ (counters.left, counters.right) } } } object SynchronizedMutableCounters { val counters = new MutableCounters(0,0) … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { counters.increment(leftIncrement, rightIncrement) } } 24 Locks just MutableCounters
  • 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Lock-Free AtomicReference Counters object LockFreeAtomicReferenceCounters { val counters = new AtomicReference[Counters](new Counters(0,0)) … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalCounters: Counters = null var updatedCounters: Counters = null do { originalCounters = getCounters() updatedCounters = new Counters(originalCounters.left+ leftIncrement, originalCounters.right+ rightIncrement) } // Retry lock-free, optimistic compareAndSet() until AtomicRef updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } } 25 Lock Free!!
  • 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Lock-Free AtomicLong Counters object LockFreeAtomicLongCounters { // a single Long (64-bit) will maintain 2 separate Ints (32-bits each) val counters = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { var originalCounters = 0L var updatedCounters = 0L do { originalCounters = counters.get() … // Store two 32-bit Int into one 64-bit Long // Use >>> 32 and << 32 to set and retrieve each Int from the Long } // Retry lock-free, optimistic compareAndSet() until AtomicLong updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } } 26 Q: Why not 
 @volatile long? A: The JVM does not guarantee atomic
 updates of 64-bit longs and doubles Lock Free!!
  • 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Performance Impact of Thread Blocking & Context Switching 27
  • 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of 2 Counter Synchronization Immutable Case Class 28 Lock-Free AtomicLong -64% -46% -17% -33% perf stat –event context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend % Change -31% -32% -33% -27% case class Counters(left: Int, right: Int) ... this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } val counters = new AtomicLong() … do { … } while !(counters.compareAndSet(originalCounters, 
 updatedCounters))
  • 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profile Visualizations: Flame Graphs 29 Example: Spark Word Count Java Stack Traces are Good! JDK 1.8+ (-XX:-Inline -XX:+PreserveFramePointer) Plateaus are Bad! I/O stalls, Heavy CPU
 serialization, etc
  • 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten: CPU and Memory Create Custom Data Structures & Algorithms Operate on serialized and compressed ByteArrays! Minimize Garbage Collection Reuse ByteArrays In-place updates for aggregations Maximize CPU Cache Effectiveness 8-byte alignment AlphaSort-based Key-Prefix Utilize Catalyst Dynamic Code Generation Dynamic optimizations using entire query plan Developer implements genCode() to create Scala source code (String) Project Janino compiles source code into JVM bytecode 30
  • 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU for serialization, hashing, & compression Spark 1.2 updates saturated Network, Disk I/O 10x increase in I/O throughput relative to CPU More partition, pruning, and pushdown support Newer columnar file formats help reduce I/O 31
  • 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structs & Algos: Aggs UnsafeFixedWidthAggregationMap Uses BytesToBytesMap internally In-place updates of serialized aggregation No object creation on hot-path TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 steps to avoid single-key OOMs   Hash-based (grouping) agg spills to disk if needed   Sort-based agg performs external merge sort on spills 32
  • 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structs & Algos: Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeExternalSorter UnsafeShuffleWriter UnsafeInMemorySorter RecordPointerAndKeyPrefix
 33 Ptr Key-Prefix 2x CPU Cache-line Friendly! SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
 Note: Mixing multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. Supports merging compressed records (if compression CODEC supports it, ie. LZF) In-place external sorting of spilled BytesToBytes data AlphaSort-based, 8-byte aligned sort key In-place sorting of BytesToBytesMap data
  • 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Lack of polymorphism == poor code design Solution Code generation enables inlining Rewrite and optimize code using overall plan, 8-byte align Defer source code generation to each operator, UDF, UDAF Use Janino to compile generated source code into bytecode (More IDE friendly than Scala quasiquotes) 34
  • 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Autoscaling Spark Workers (Spark 1.5+) Scaling up is easy J SparkContext.addExecutors() until max is reached Scaling down is hard L SparkContext.removeExecutors() Lose RDD cache inside Executor JVM Must rebuild active RDD partitions in another Executor JVM Uses External Shuffle Service from Spark 1.1-1.2 If Executor JVM dies/restarts, shuffle keeps shufflin’! 35
  • 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark “Hidden” Spark Submit REST API http://arturmkrtchyan.com/apache-spark-hidden-rest-api Submit Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data ’{"action" : "CreateSubmissionRequest”, "mainClass" : "org.apache.spark.examples.SparkPi”, "sparkProperties" : { "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar", "spark.app.name" : "SparkPi",… }}’ Get Spark Job Status curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request> Kill Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request> 36 (the snitch)
  • 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline   Spark Streaming and Spark ML
 Kafka, Cassandra, ElasticSearch, Redis, Docker   Spark Core
 Tuning and Profiling   Spark SQL
 Tuning and Customizing 37
  • 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel paper ~2010 Collaboration with Twitter and Cloudera Columnar storage format for fast columnar aggs Supports evolving schema Supports pushdowns Support nested partitions Tight compression Min/max heuristics enable file and chunk skipping 38 Min/Max Heuristics For Chunk Skipping
  • 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/… /gender=F/… <-- Use Case: Access Users by Gender /gender=U/… Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender")
 .save("/genders.parquet") Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie. /gender=F/…) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F") 39
  • 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pruning Partition Pruning Filter out rows by partition SELECT id, gender FROM genders WHERE gender = ‘F’ Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (ie. Parquet) Skip entire blocks of columns SELECT id, gender FROM genders 40
  • 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given function Filters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] 41
  • 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Compare Performance of File Formats, Partitions, and Pushdowns 42
  • 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Filter Pushdown and Collapsing 43 Filter pushdown (Parquet) Filter collapsing (next best thing)
  • 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Join Between Partitioned & Unpartitioned 44
  • 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Join Between Partitioned & Partitioned 45
  • 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 46 Cartesian Join vs. Inner Join
  • 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 47 Broadcast Join vs. Normal Shuffle Join
  • 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Visualizing the Query Plan 48 Effectiveness of Filter Cost-based Join Optimization Similar to MapReduce
 Map-side Join & DistributedCache Peak Memory for Joins and Aggs UnsafeFixedWidthAggregationMap
 getPeakMemoryUsedBytes()
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Data Source API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 49
  • 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL Data Sources 50
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 51 json() convenience method
  • 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 52
  • 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 53
  • 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Data Source (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 54
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mythical New Cassandra Data Source By-pass Cassandra CQL “front door” (which is optimized for transactional data) Bulk read and write directly against SSTables (Similar to 5 year old NetflixOSS project Aegisthus) Promotes Cassandra to first-class analytics option Replicated analytics cluster not needed 55
  • 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL Data Sources (IBM) Coming Soon! 56
  • 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom Data Source ① Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ② Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan} Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! 57
  • 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Create a Custom Data Source 58
  • 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Publishing Custom Data Sources 59 spark-packages.org
  • 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc IBM | spark.tc Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in 
 SPARK-8159, SPARK-9571 Every UDF must use Expressions and
 implement Expression.genCode() to participate in the fun Lambdas (RDD or Dataset API) and sqlContext.udf.registerFunction() are not enough!!
  • 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Code Gen ① Study existing implementations o.a.s.sql.catalyst.expressions.Substring ② Extend and implement base trait o.a.s.sql.catalyst.expressions.Expression.genCode ③ Don’t forget about Python! python.pyspark.sql.functions.py 61
  • 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demo! Creating a Custom UDF participating in Code Generation 62
  • 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark 1.6 Improvements 63
  • 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Adaptive Query Execution Adapt query execution using data from previous stages Dynamically choose spark.sql.shuffle.partitions (default 200) 64 Broadcast Join (popular keys) Shuffle Join (not-so-popular keys) Hybrid
 Join
  • 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Adaptive Memory Management Spark <1.6 Manual configure between 2 memory regions Spark execution engine (shuffles, joins, sorts, aggs) spark.shuffle.memoryFraction RDD Data Cache spark.storage.memoryFraction Spark 1.6+ Unified memory regions Dynamically expand/contract memory regions Supports minimum for RDD storage (LRU Cache) 65
  • 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Metrics Shows exact memory usage per operator & per node Helps debugging and identifying skew 66
  • 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark SQL API Datasets type safe API (similar to RDDs) utilizing Tungsten val ds = sqlContext.read.text("ratings.csv").as[String] val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc) Typed Aggregators used alongside UDFs and UDAFs val simpleSum = new Aggregator[Int, Int, Int] with Serializable { def zero: Int = 0 def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2 def finish(b: Int) = b }.toColumn val sum = Seq(1,2,3,4).toDS().select(simpleSum) Query files directly without registerTempTable() %sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 67
  • 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Streaming State Management New trackStateByKey() Store deltas, compaction later No longer scan every key Sessionization TTL Integrated A/B Testing Shows failure output in Admin UI 68
  • 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!!! Chris Fregly IBM Spark Tech Center (http://spark.tc) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Clone, Contribute, Commit on Github Run All Demos in Docker 2100+ Docker Downloads!! Find me on LinkedIn, Twitter, Github, Email, Fax 69
  • 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc What’s Next? 70 After Dark 1.6+
  • 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Docker/Spark Workers Completely Docker-based Docker Compose, Google Kubernetes Lots of Demos and Examples More Zeppelin & IPython/Jupyter notebooks More advanced analytics use cases More Performance Profiling Work closely with Netflix & Databricks Identify & fix performance bottlenecks 71
  • 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza (Nov 10th) San Francisco Advanced Spark (Nov 12th) 72 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Istanbul Spark Meetup (Nov 28th) Singapore Pre-Strata Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) San Francisco Advanced Spark (Dec 10th) San Francisco Univ of San Francisco (Dec 11th) Toronto Spark Meetup (Dec 14th)
  • 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark advancedspark.com @cfregly