SlideShare una empresa de Scribd logo
1 de 146
Descargar para leer sin conexión
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles

 


 


 

 
After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations

Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
November 20, 2015
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Open Source Committer


Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
 Using Chopsticks
 Using “New” iPhone
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1600+ Members in just 4 mos!
Top 5 Most Active Spark Meetup!!

Meetup Goals
  Dig deep into codebase of Spark and related projects
  Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface and share patterns and idioms of these 

well-designed, distributed, big data components
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Slides and Code Are Available!

advancedspark.com
slideshare.net/cfregly
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor

6
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What is “ 
 
 
 
 After Dark”?
Spark-based, Advanced Analytics Reference App
End-to-End, Scalable, Real-time Big Data Pipeline
Demonstration of Spark & Related Big Data Projects
7
github.com/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Tools of This Talk
8
  Kafka
  Redis
  Docker
  Ganglia
  Cassandra
  Parquet, JSON, ORC, Avro
  Apache Zeppelin Notebooks
  Spark SQL, DataFrames, Hive
  ElasticSearch, Logstash, Kibana
  Spark ML, GraphX, Stanford CoreNLP
…
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Themes of this Talk
 Filter
 Off-Heap 
 Parallelize 
 Approximate
 Find Similarity
 Minimize Seeks
 Maximize Scans
 Customize for Workload
 Tune Performance At Every Layer
9
  Be Nice, Collaborate!
Like a Mom!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
 Spark Core: Tuning & Mechanical Sympathy
 Spark SQL: Query Optimizing & Catalyst
 Spark Streaming: Scaling & Approximations
 Spark ML: Featurizing & Recommendations
10
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark Core: Tuning & Mechanical Sympathy
Understand and Acknowledge Mechanical Sympathy

Study AlphaSort and 100Tb GraySort Challenge

Dive Deep into Project Tungsten

 11
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.

- Martin Thompson

 http://mechanical-sympathy.blogspot.com


Whatever your data structure, my array will beat it.

- Scott Meyers

 Every C++ Book, basically

12
Hair
Sympathy
- Bruce Jenner
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
13
Project 

Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
14
Value
Ptr
Key
Dereference Not Required!
AlphaSort

List [(Key, Pointer)]

Key is directly available for comparison
Naïve

List [Pointer]

Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs

= 14 bytes


15
Key
 Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) 

= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)

= 16 bytes
 Key
 Ptr
Pad
/Pad
 CPU Cache-line Friendly!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
16
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
17
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line Sizes
18
My

Laptop
My

SoftLayer

BareMetal
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cache Hits: Sequential v Random Access
19
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Mechanical Sympathy
CPU Cache Lines and Matrix Multiplication

20
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

21
Bad: Row-wise traversal,

 not using CPU cache line,

ineffective pre-fetching
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication


// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)

matBT[ i ][ j ] = matB[ j ][ i ];


// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
22
Good: Full CPU cache line,

effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j

before k
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
23
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
24
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
25
Cache-Friendly Matrix Multiply
~27x
~13x
~13x
~2x
perf stat -XX:-Inline –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
 55 hp
550 hp
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Mechanical Sympathy
CPU Cache Lines and Lock-Free Thread Sync

26
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
27
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)

object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…

def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement, 

 
 
 
 
 
 
 
 tuple.right + rightIncrement)
tuple
}
}
}
28
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {

originalLong = tuple.get()

val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter

val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter

val updatedRightInt = originalRightInt + rightIncrement // increment right counter

val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter

updatedLong = updatedLong << 32 // shift the new long left 

updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
29
Q: Why not @volatile long?
A: Java Memory Model 

does not guarantee synchronous

updates of 64-bit longs or doubles
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
30
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters




Naïve Case Class Counters

Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profiling Visualizations: Flame Graphs
32
Example: Spark Word Count
Java Stack Traces 
(-XX:+PreserveFramePointer)
Plateaus

are Bad!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuffle
Saturate Network and Disk Controllers
33
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Results
34
Spark Goals
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute

206 Workers, 1 Master (AWS EC2 i2.8xlarge)

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

3 GBps mixed read/write disk I/O per node
Network

AWS Placement Groups, VPC, Enhanced Networking

Single Root I/O Virtualization (SR-IOV)

10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
35
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu

206 nodes * 32 cores = 6592 cores 

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace

Required ~10s of sampling 79 keys from in each partition
36
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based” 
 
 New “sort-based”







①  Use less OS resources (socket buffers, file descriptors)
②  TimSort partitions in-memory
③  MergeSort partitions on-disk into a single master file
④  Serve partitions from master file: seek once, sequential scan
37
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll

Use only kernel-space between disk and network controllers
Custom memory management

spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning

spark.shuffle.io.preferDirectBuffers=true

 
Reuse off-heap buffers

spark.shuffle.io.numConnectionsPerPeer=8 (for example)

 
Increase to saturate hosts with multiple disks (8x800 SSD)
38
Details in
SPARK-2468
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuffle workloads
o.a.s.util.collection.TimSort[K,V]

Based on JDK 1.7 TimSort

Performs best with partially-sorted runs

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)] 

Good memory locality

Keys never removed, values only append
39
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Daytona GraySort Challenge Goal Success






1.1 Gbps/node network I/O (Reducers)

Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
40
Aggregate 

Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)

spark.shuffle.consolidateFiles (Mapper)

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer (Reducer)

Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors

Minimizes intermediate files and overall shuffle

More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold 


Use DataFrame.explain(true) or EXPLAIN to verify

41
Many Threads
(1 per CPU)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
42
SPARK-7076
(Spark 1.4)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Quick Review of Project Tungsten Jiras



43
SPARK-7076
(Spark 1.4)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!

Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuffle

Partitioning, pruning, and predicate pushdowns

Binary, compressed, columnar file formats (Parquet)
44
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =

hash (Deprecated)

 
< 10,000 reducers

 
Output partition file hashes the key of (K,V) pair

 
Mapper creates an output file per partition 

 
Leads to M*P output files for all partitions

sort (GraySort Challenge)

 
> 10,000 reducers

 
Default from Spark 1.2-1.5

 
Mapper creates single output file for all partitions

 
Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory

 
Uses custom data structures and algorithms for sort-shuffle workload

 
Wins Daytona GraySort Challenge 

tungsten-sort (Project Tungsten)

 
Default since 1.5

 
Modification of existing sort-based shuffle

 
Uses com.misc.Unsafe for self-managed memory and garbage collection

 
Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms

 
Perform joins, sorts, and other operators on both serialized and compressed byte buffers
45
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and off heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder/sort serialized records

LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.sql.catalyst.expression.UnsafeRow

o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)

Generate source code from overall query plan

100+ UDFs converted to use code generation
46
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
47
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()
Used by 

Tungsten
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark + com.misc.Unsafe
48
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalSorter
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Traditional Java Object Row Layout
4-byte String




Multi-field Object


49
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structures for Workload


 UnsafeRow
(Dense Binary Row)


TaskMemoryManager
(Virtual Memory Address)


BytesToBytesMap
(Dense Binary HashMap)
50
Dense, 8-bytes per field (word-aligned)
Key
 Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeRow Layout Example
51
Pre-Tungsten




Tungsten
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Memory Management
o.a.s.memory.


TaskMemoryManager & MemoryConsumer

 
Memory management: virtual memory allocation, pageing

 
Off-heap: direct 64-bit address

 
On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.

PackedRecordPointer

 
64-bit word

 
 
(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.

UTF8String

 
Primitive Array[Byte]
52
2^13 pages * 2^27 page size = 1 TB RAM per Task
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeFixedWidthAggregationMap
Aggregations
o.a.s.sql.execution.


UnsafeFixedWidthAggregationMap

 
Uses BytesToBytesMap

 
In-place updates of serialized data

 
No object creation on hot-path

 
Improved external agg support

 
No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.

GenerateUnsafeRowJoiner

 
Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.

TungstenAggregate & TungstenAggregationIterator

 
Operates directly on serialized, binary UnsafeRow

 
2 Steps: hash-based agg (grouping), then sort-based agg

 
Supports spilling and external merge sorting
53
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Equality
Bitwise comparison on UnsafeRow

No need to calculate equals(), hashCode()

Row 1
Equals!
Row 2
54
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Joins
Surprisingly, not many code changes

o.a.s.sql.catalyst.expressions.

UnsafeProjection

 
Converts InternalRow to UnsafeRow
55
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort. 


UnsafeSortDataFormat

UnsafeInMemorySorter

UnsafeExternalSorter

RecordPointerAndKeyPrefix


UnsafeShuffleWriter
AlphaSort-Style Cache Friendly


56
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spilling
Efficient Spilling

Exact data size is known

No need to maintain heuristics & approximations

Controls amount of spilling
Spill merge on compressed, binary records!

If compression CODEC supports it




57
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation 
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
58
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations

https://github.com/apache/spark/pull/7214/files
Extend base trait

o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function

o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)

o.a.s.sql.functions.scala
Don’t forget about Python!

python.pyspark.sql.functions.py


60
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Benefits from Project Tungsten?
Users of DataFrames



All Spark SQL Queries

Catalyst



All RDDs

Serialization, Compression, and Aggregations
61
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Project Tungsten Performance Results
Query Time




Garbage
Collection
62
OOM’d on
Large Dataset!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
 Spark Core: Tuning & Mechanical Sympathy
 Spark SQL: Query Optimizing & Catalyst
 Spark Streaming: Scaling & Approximations
 Spark ML: Featurizing & Recommendations
63
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark SQL: Query Optimizing & Catalyst
Explore DataFrames/Datasets/DataSources, Catalyst

Review Partitions, Pruning, Pushdowns, File Formats

Create a Custom DataSource API Implementation

 64
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataFrames
Inspired by R and Pandas DataFrames

Schema-aware
Cross language support

SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serializing to Python
DataFrame is container for logical plan

Lazy transformations represented as tree
Only logical plan is sent from Python -> JVM

Only results returned from JVM -> Python
UDF and UDAF Support

Custom UDF support using registerFunction()

Experimental UDAF support (ie. HyperLogLog)
Supports existing Hive metastore if available

Small, file-based Hive metastore created if not available
*DataFrame.rdd returns underlying RDD if needed
65
Use DataFrames
instead of RDDs!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Hive
Early days, Shark was “Hive on Spark”
Hive Optimizer slowly replaced with Catalyst
Always use HiveContext – even if not using Hive!

If no Hive, a small Hive metastore file is created
Spark 1.5+ supports all Hive versions 0.12+

Separate classloaders for isolation

Breaks dependency between Spark internal Hive
version
and User’s external Hive version
66
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Catalyst Optimizer



Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
67
Implements
oas.sql.catalyst.rules.Rule
Apply to any plan stage
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

 
TableScan (impl): Read all data from source 


 
PrunedFilteredScan (impl): Column pruning & predicate pushdowns

 
InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)

RunnableCommand (trait/interface): Common commands like EXPLAIN

 
ExplainCommand(impl: case class)

 
CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class): Handles all predicates/filters supported by this source

 
EqualTo (impl)

 
GreaterThan (impl)

 
StringStartsWith (impl)
68
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL DataSources
69
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Debugging
70
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Visualization & Metrics
71
Effectiveness 
of Filter
CPU Cache 

Friendly
Binary Format
 Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS 
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

72
json() convenience method
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame

val jdbcConfig = Map("driver" -> "org.postgresql.Driver",

 
"url" -> "jdbc:postgresql:hostname:port/database", 

 
"dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConfig).load()

SQL

CREATE TABLE genders USING jdbc 


 
OPTIONS (url, dbtable, driver, …)

73
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=true

spark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

 .load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

 .save("file:/root/pipeline/datasets/dating/genders.parquet") 
SQL

CREATE TABLE genders USING parquet

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders.parquet")

74
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ORC Data Source
Configuration

spark.sql.orc.filterPushdown=true
DataFrames

val gendersDF = sqlContext.read.format("orc")

 
.load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender")

 
.save("file:/root/pipeline/datasets/dating/genders")
SQL

CREATE TABLE genders USING orc

OPTIONS 

 
(path "file:/root/pipeline/datasets/dating/genders")

75
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Third-Party Spark SQL DataSources
76
spark-packages.org
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven

com.databricks:spark-csv_2.10:1.2.0
Code

val gendersCsvDF = sqlContext.read

 
.format("com.databricks.spark.csv")

 
.load("file:/root/pipeline/datasets/dating/gender.csv.bz2")

 
.toDF("id", "gender")
77
toDF() is required if CSV does not contain header
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 


 
 
 
 
 
 
 "es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

 
.options(esConfig).save("<index>/<document-type>")

78
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Elasticsearch Tips
Change id field to not_analyzed to avoid indexing
Use term filter to build and cache the query
Perform multiple aggregations in a single request
Adapt scoring function to current trends at query time
79
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AWS Redshift Data Source (Databricks)
Github

https://github.com/databricks/spark-redshift

Maven

com.databricks:spark-redshift:0.5.0

Code

val df: DataFrame = sqlContext.read

 
.format("com.databricks.spark.redshift")

 
.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")

 
.option("query", "select x, count(*) my_table group by x")

 
.option("tempdir", "s3n://tmpdir")

 
.load(...)
80
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
81
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra DataSource (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

 
.format("org.apache.spark.sql.cassandra")

 
.mode(SaveMode.Append)

 
.options(Map("keyspace"->"<keyspace>",

 
 
 
 
 
 "table"->"<table>")).save(…)

82
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala


Pushdown Predicate Rules

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate

2. Only push down primary key column predicates with = or IN predicate.

3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.

4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates.

If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

83
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Cassandra DataSource
By-pass CQL optimized for transactional data

Instead, do bulk reads/writes directly on SSTables

Similar to 5 year old Netflix Open Source project Aegisthus

Promotes Cassandra to first-class Analytics Option

Potentially only part of DataStax Enterprise?!

Please mail a nasty letter to your local DataStax office
84
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?






Ask Michael Armbrust
Spark SQL Lead @ Databricks
85
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom DataSource (Me and You!)
Coming Right Now!
86
DEMO ALERT!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Create a Custom DataSource
Study Existing Native & Third-Party Data Sources
Native
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)

 class JDBCRelation extends BaseRelation

 
with PrunedFilteredScan


 
with InsertableRelation
Third-Party
DataStax Cassandra (o.a.s.sql.cassandra)

 class CassandraSourceRelation extends BaseRelation

 
with PrunedFilteredScan


 
with InsertableRelation!

87
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Create a Custom DataSource
88
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Contribute a Custom Data Source
spark-packages.org

Managed by

Contains links to external github projects

Ratings and comments

Declare Spark version support for each package
Examples

https://github.com/databricks/spark-csv

https://github.com/databricks/spark-avro

https://github.com/databricks/spark-redshift
89
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Columnar File Format
Based on Google Dremel 
Collaboration with Twitter and Cloudera
Self-describing, evolving schema
Fast columnar aggregation
Supports filter pushdowns
Columnar storage format
Excellent compression
90
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Compression
Run Length Encoding: Repeated data
Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
91
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Demonstrate File Formats, Partition Schemes, and Query Plans
92
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Hive JDBC ODBC ThriftServer
Allow BI Tools to Query and Process Spark Data
Register Permanent Table

CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) 

USING org.apache.spark.sql.json 

OPTIONS (path "datasets/dating/ratings.json.bz2")
Register Temp Table

ratingsDF.registerTempTable("ratings_temp") 
Configuration

spark.sql.thriftServer.incrementalCollect=true

spark.driver.maxResultSize > 10gb (default)
93
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Query and Process Spark Data from BI Tools
94
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
 Spark Core: Tuning & Mechanical Sympathy
 Spark SQL: Query Optimizing & Catalyst
 Spark Streaming: Scaling & Approximations
 Spark ML: Featurizing & Recommendations
95
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark Streaming: Scaling & Approximations
Discuss Delivery Guarantees, Parallelism, and Stability

Compare Receiver and Receiver-less Impls

Demonstrate Stream Approximations

 96
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Non-Parallel Receiver Implementation
97
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Receiver Implementation (Kinesis)
  KinesisRDD partitions store relevant offsets
  Single receiver required to see all data/offsets
  Kinesis offsets not deterministic like Kafka
  Partitions rebuild from Kinesis using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
98
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parallel Receiver-less Implementation (Kafka)
99
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Receiver-less Implementation (Kafka)
  KafkaRDD partitions store relevant offsets
  Each partition acts as a Receiver
  Tasks/Executors pull from Kafka in parallel
  Partitions rebuild from Kafka using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
100
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Maintain Stability of Stream Processing
Rate Limiting

Since Spark 1.2

Fixed limit on number of messages per second

Potential to drops messages on the floor

Back Pressure

Since Spark 1.5 (TypeSafe Contribution)

More dynamic than rate limiting

Push back on reliable, buffered source (Kafka, Kinesis)

Fundamentals of Control Theory and Observability
101
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Streaming Approximations
HyperLogLog and CountMin Sketch
102
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL) Approx Distinct Count
  Approximate count distinct
  Twitter’s Algebird
  Better than HashSet
  Low, fixed memory
  Only 1.5K, 2% error,10^9 counts (tunable)

 Redis HLL: 12K per key, 0.81%, 2^64 counts
  Spark’s countApproxDistinctByKey()
  Streaming example in Spark codebase
103
http://research.neustar.biz/
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS) Approx Count
 Approximate count
 Twitter’s Algebird
 Better than HashMap
 Low, fixed memory
 Known error bounds
 Large num counters
 Streaming example in Spark codebase
104
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Using HLL and CMS for Streaming Count Approximations
105
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Monte Carlo Simulations
From Manhattan Project (Atomic bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials

Converge on expected value
SparkPi example in Spark codebase

1 Argument: # of trials



 
 
 
 
 
 


 
 
 
 
 
 
 
 
 Pi ~= # red dots


 
 
 
 
 
 
 
 
 
 
 / # total dots

 
 
 
 
 
 
 
 
 
 
 * 4
106
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Using a Monte Carlo Simulation to Estimate Pi
107
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Streaming Best Practices
Get Data Out of Streaming ASAP

Processing interval may exceed batch interval

Leads to unstable streaming system
Please Don’t…

Use updateStateByKey() like an in-memory DB

Put streaming jobs on the request/response hot path
Use Separate Jobs for Different Batch Intervals

Small Batch Interval: Store raw data (Redis, Cassandra, etc)

Medium Batch Interval: Transform, join, process data

High Batch Interval: Model training
Gotchas

Tune streamingContext.remember()
Use Approximations!!
108
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
 Spark Core: Tuning & Mechanical Sympathy
 Spark SQL: Query Optimizing & Catalyst
 Spark Streaming: Scaling & Approximations
 Spark ML: Featurizing & Recommendations
109
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Spark ML: Featurizing & Recommendations
Understand Similarity and Dimension Reduction

Demonstrate Sampling and Bucketing

Generate Recommendations
110
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Live, Interactive Demo!
sparkafterdark.com
111
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Needed!!
112
->
You are

here 
->
Audience Instructions
  Navigate to sparkafterdark.com
  Click 3 actresses and 3 actors

  Wait for us to analyze together!
Note: This is totally anonymous!!

Project Links
  https://github.com/fluxcapacitor/pipeline
  https://hub.docker.com/r/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Similarity
113
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Similarity
Euclidean
Linear-based measure
Suffers from Magnitude bias
Cosine
Angle-based measure
Adjusts for magnitude bias
Jaccard
Set intersection / union
Suffers Popularity bias
Log Likelihood
Netflix “Shawshank” Problem
Adjusts for popularity bias

114
		 Ali	 Matei	 Reynold	 Patrick	 Andy	
Kimberly	 1	 1	 1	 1	
Leslie	 1	 1!
Meredith	 1	 1	 1	
Lisa	 1	 1	 1	
Holden	 1	 1	 1	 1	 1	
z!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols

Minimize shuffle through approximations!
Reduce m (rows)
Sampling and bucketing 
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
115
Dimension reduction!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Dimension Reduction
Sampling and Bucketing
116
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)




Twitter: 40% efficiency gain vs. Cosine Similarity 

117
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets 
Use similarity hash algorithm
Requires pre-processing of data
Parallel compare bucket contents 
O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets
ie. 500k x 500k matrix

O(1.25e17) -> O(1.25e13); b=50
118
github.com/mrsqueeze/spark-hash
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2); 

nnz=num nonzeros, nnz << n





Note: Choose most frequent value (may not be 0)
119
(index,value)
(index,value)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Recommendations
Summary Statistics and Top-K Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
120
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Recommendations
Non-personalized

No preference or behavior data for user, yet
aka “Cold Start Problem”

Personalized

User-Item Similarity

Items that others with similar prefs have liked
Item-Item Similarity

Items similar to your previously-liked items
121
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recommendation Terminology
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction, polynomial expansion
Hyper-parameter Tuning
K-Folds Cross Validation, Grid Search
Pipelines/Workflows
Chaining together Transformers and Evaluators
122
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Single Machine ML Algorithms
Stay Local, Distribute As Needed
Helps migration of existing single-node algos to Spark
Convert between Spark and Pandas DataFrames
New “pdspark” package: integration w/ scikitlearn, R
123
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
124
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Top Users by Like Count

“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs






125
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Top Influencers by Like Graph


“I might like the most-influential users in overall like graph.”
GraphX: PageRank







126
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Generate Non-Personalized Recommendations
127
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Personalized Recommendations
Understand Similarity and Personalized Recommendations
128
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Like Behavior of Similar Users
“I like the same people that you like. 

What other people did you like that I haven’t seen?” 
MLlib: Matrix Factorization, User-Item Similarity
129
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Generate Personalized Recommendations using 

Collaborative Filtering & Matrix Factorization
130
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Similar Text-based Profiles as Me


“Our profiles have similar keywords and named entities. 

We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
131
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Similar Profiles to Previous Likes


132
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Relevant, High-Value Emails

 
 “Your initial email references a lot of things in my profile.

I might like you for making the effort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition






133
^
Her Email< My Profile
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Demo!
Feature Engineering for Text/NLP Use Cases
134
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
The Future of Recommendations
135
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked.

I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity




136
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  NLP Conversation Starter Bot! 
“If your responses to my generic opening
lines are positive, I may read your profile.” 

MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
137
Positive Negative
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
138
Maintaining the Spark
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
⑨  Recommendations for Couples
“I want Mad Max. You want Message In a Bottle. 

Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity

GraphX: Nearest Neighbors, Shortest Path



 
 similar 
 
 similar
•  
 plots ->
 <- actors

 

139
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Final Recommendation!
140
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
  Get Off the Computer & Meet People!
Thank you, Helsinki!!
Chris Fregly @cfregly
IBM Spark Technology Center 
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
141
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do

142
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
What’s Next?
143
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What’s Next?
Autoscaling Spark Workers

Completely Docker-based

Docker Compose and Docker Machine
Lots of Demos and Examples!

Zeppelin & IPython/Jupyter notebooks

Advanced streaming use cases

Advanced ML, Graph, and NLP use cases
Performance Tuning and Profiling

Work closely with Brendan Gregg & Netflix

Surface & share more low-level details of Spark internals
144
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
145
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Austin Data Days Conference (Jan 2016)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark
 spark.tc
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

Más contenido relacionado

La actualidad más candente

Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Chris Fregly
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Chris Fregly
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 

La actualidad más candente (20)

Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 

Destacado

The reference collection
The reference collectionThe reference collection
The reference collectionvargas8854
 
Университет в кармане
Университет в карманеУниверситет в кармане
Университет в карманеkulibin
 
ECHA Website Customer Insight Study Summary Report
ECHA Website Customer Insight Study Summary ReportECHA Website Customer Insight Study Summary Report
ECHA Website Customer Insight Study Summary ReportNikolaos Vaslamatzis
 
1 samuel 14 commentary
1 samuel 14 commentary1 samuel 14 commentary
1 samuel 14 commentaryGLENN PEASE
 
Planificacion de Su Seguridad Económica 2012
Planificacion de Su Seguridad Económica   2012Planificacion de Su Seguridad Económica   2012
Planificacion de Su Seguridad Económica 2012CGLFINS
 
พลังประชาชน ที่บุรีรัมย์
พลังประชาชน ที่บุรีรัมย์พลังประชาชน ที่บุรีรัมย์
พลังประชาชน ที่บุรีรัมย์konthaiuk
 
關中麥客
關中麥客關中麥客
關中麥客nhush
 
How to find and close more business (without spending a thing)
How to find and close more business (without spending a thing)How to find and close more business (without spending a thing)
How to find and close more business (without spending a thing)Heinz Marketing Inc
 
Trulia Metro Movers Report - Fall 2011
Trulia Metro Movers Report - Fall 2011Trulia Metro Movers Report - Fall 2011
Trulia Metro Movers Report - Fall 2011Trulia
 
1 samuel 20 commentary
1 samuel 20 commentary1 samuel 20 commentary
1 samuel 20 commentaryGLENN PEASE
 
Aplicacion escritorio web
Aplicacion escritorio webAplicacion escritorio web
Aplicacion escritorio webmarianap611
 
Press Festival Via011
Press Festival Via011Press Festival Via011
Press Festival Via011Welcome Luiz
 
Deloitte Tech Trends 2014 Technical Debt
Deloitte Tech Trends 2014 Technical DebtDeloitte Tech Trends 2014 Technical Debt
Deloitte Tech Trends 2014 Technical DebtCAST
 
1 samuel 19 commentary
1 samuel 19 commentary1 samuel 19 commentary
1 samuel 19 commentaryGLENN PEASE
 
水晶振動子のシミュレーション
水晶振動子のシミュレーション水晶振動子のシミュレーション
水晶振動子のシミュレーションTsuyoshi Horigome
 

Destacado (18)

The reference collection
The reference collectionThe reference collection
The reference collection
 
Zaragoza turismo-55
Zaragoza turismo-55Zaragoza turismo-55
Zaragoza turismo-55
 
Университет в кармане
Университет в карманеУниверситет в кармане
Университет в кармане
 
ECHA Website Customer Insight Study Summary Report
ECHA Website Customer Insight Study Summary ReportECHA Website Customer Insight Study Summary Report
ECHA Website Customer Insight Study Summary Report
 
TDD for Testers
TDD for TestersTDD for Testers
TDD for Testers
 
1 samuel 14 commentary
1 samuel 14 commentary1 samuel 14 commentary
1 samuel 14 commentary
 
Planificacion de Su Seguridad Económica 2012
Planificacion de Su Seguridad Económica   2012Planificacion de Su Seguridad Económica   2012
Planificacion de Su Seguridad Económica 2012
 
พลังประชาชน ที่บุรีรัมย์
พลังประชาชน ที่บุรีรัมย์พลังประชาชน ที่บุรีรัมย์
พลังประชาชน ที่บุรีรัมย์
 
關中麥客
關中麥客關中麥客
關中麥客
 
How to find and close more business (without spending a thing)
How to find and close more business (without spending a thing)How to find and close more business (without spending a thing)
How to find and close more business (without spending a thing)
 
Trulia Metro Movers Report - Fall 2011
Trulia Metro Movers Report - Fall 2011Trulia Metro Movers Report - Fall 2011
Trulia Metro Movers Report - Fall 2011
 
1 samuel 20 commentary
1 samuel 20 commentary1 samuel 20 commentary
1 samuel 20 commentary
 
Aplicacion escritorio web
Aplicacion escritorio webAplicacion escritorio web
Aplicacion escritorio web
 
Press Festival Via011
Press Festival Via011Press Festival Via011
Press Festival Via011
 
An easy way into your sap systems v3.0
An easy way into your sap systems v3.0An easy way into your sap systems v3.0
An easy way into your sap systems v3.0
 
Deloitte Tech Trends 2014 Technical Debt
Deloitte Tech Trends 2014 Technical DebtDeloitte Tech Trends 2014 Technical Debt
Deloitte Tech Trends 2014 Technical Debt
 
1 samuel 19 commentary
1 samuel 19 commentary1 samuel 19 commentary
1 samuel 19 commentary
 
水晶振動子のシミュレーション
水晶振動子のシミュレーション水晶振動子のシミュレーション
水晶振動子のシミュレーション
 

Similar a Helsinki Spark Meetup Nov 20 2015

Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Databricks
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScyllaDB
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana MandziukFwdays
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 

Similar a Helsinki Spark Meetup Nov 20 2015 (15)

Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 

Más de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

Más de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Último

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 

Último (20)

WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

Helsinki Spark Meetup Nov 20 2015

  • 1. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles After Dark 1.5 High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing, Text Analytics, and Recommendations Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center ** We’re Hiring -- Only Nice People, Please!! ** November 20, 2015
  • 2. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Founder Advanced Apache Meetup Author Advanced . Due 2016 My Ma’s First Time in California
  • 3. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Random Slide: More Ma “First Time” Pics 3 In California Using Chopsticks Using “New” iPhone
  • 4. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th) 4 San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th) Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
  • 5. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these well-designed, distributed, big data components
  • 6. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Slides and Code Are Available! advancedspark.com slideshare.net/cfregly github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor 6
  • 7. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demonstration of Spark & Related Big Data Projects 7 github.com/fluxcapacitor
  • 8. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Tools of This Talk 8   Kafka   Redis   Docker   Ganglia   Cassandra   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP … github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor
  • 9. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer 9   Be Nice, Collaborate! Like a Mom!!
  • 10. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 10
  • 11. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy Study AlphaSort and 100Tb GraySort Challenge Dive Deep into Project Tungsten 11
  • 12. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically 12 Hair Sympathy - Bruce Jenner
  • 13. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 13 Project 
 Tungsten (Spark 1.4-1.6+) GraySort Challenge (Spark 1.1-1.2) Minimize Memory and GC Maximize CPU Cache Locality Saturate Network I/O Saturate Disk I/O
  • 14. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Technique: Sort 100 Bytes Recs 14 Value Ptr Key Dereference Not Required! AlphaSort List [(Key, Pointer)] Key is directly available for comparison Naïve List [Pointer] Must dereference key for comparison Ptr Dereference for Key Comparison Key
  • 15. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes 15 Key Ptr Not CPU Cache-line Friendly! Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
 = 16 bytes Key Ptr Pad /Pad CPU Cache-line Friendly!
  • 16. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 16
  • 17. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload 17
  • 18. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line Sizes 18 My
 Laptop My
 SoftLayer
 BareMetal
  • 19. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cache Hits: Sequential v Random Access 19
  • 20. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Mechanical Sympathy CPU Cache Lines and Matrix Multiplication 20
  • 21. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 21 Bad: Row-wise traversal, not using CPU cache line,
 ineffective pre-fetching
  • 22. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 
 // Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ]; 22 Good: Full CPU cache line,
 effective prefetching OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ]; Reference j
 before k
  • 23. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 23 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 24. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication 24
  • 25. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Matrix Multiply Comparison Naïve Matrix Multiply 25 Cache-Friendly Matrix Multiply ~27x ~13x ~13x ~2x perf stat -XX:-Inline –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend ~10x 55 hp 550 hp
  • 26. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Mechanical Sympathy CPU Cache Lines and Lock-Free Thread Sync 26
  • 27. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } } 27
  • 28. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement) tuple } } } 28
  • 29. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do { originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter } while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } 29 Q: Why not @volatile long? A: Java Memory Model 
 does not guarantee synchronous
 updates of 64-bit longs or doubles
  • 30. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync 30
  • 31. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Counters Comparison Naïve Tuple Counters Naïve Case Class Counters Cache Friendly Lock-Free Counters ~2x ~1.5x ~3.5x ~2x ~2x ~1.5x ~1.5x ~1.5x
  • 32. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Profiling Visualizations: Flame Graphs 32 Example: Spark Word Count Java Stack Traces (-XX:+PreserveFramePointer) Plateaus
 are Bad!!
  • 33. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations Improve Data Structs/Algos for Sort & Shuffle Saturate Network and Disk Controllers 33
  • 34. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Results 34 Spark Goals   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  • 35. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps) 35
  • 36. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best) Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition 36
  • 37. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan 37
  • 38. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD) 38 Details in SPARK-2468
  • 39. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 39
  • 40. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Daytona GraySort Challenge Goal Success 1.1 Gbps/node network I/O (Reducers)
 Theoretical max = 1.25 Gbps for 10 GB ethernet 3 GBps/node disk I/O (Mappers) 40 Aggregate 
 Cluster Network I/O! 220 Gbps / 206 nodes ~= 1.1 Gbps per node
  • 41. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 41 Many Threads (1 per CPU)
  • 42. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays Maximize CPU Cache Locality, Minimize GC Utilize Dynamic Code Generation 42 SPARK-7076 (Spark 1.4)
  • 43. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Quick Review of Project Tungsten Jiras 43 SPARK-7076 (Spark 1.4)
  • 44. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet) 44
  • 45. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers 45
  • 46. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation 46 UnsafeFixedWithAggregationMap TungstenAggregationIterator CodeGenerator GeneratorUnsafeRowJoiner UnsafeSortDataFormat UnsafeShuffleSortDataFormat PackedRecordPointer UnsafeRow UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter Mostly Same Join Code, UnsafeProjection UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter Details in SPARK-7075
  • 47. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 47 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile() Used by 
 Tungsten
  • 48. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 48 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  • 49. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Traditional Java Object Row Layout 4-byte String Multi-field Object 49
  • 50. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structures for Workload UnsafeRow (Dense Binary Row) TaskMemoryManager (Virtual Memory Address) BytesToBytesMap (Dense Binary HashMap) 50 Dense, 8-bytes per field (word-aligned) Key Ptr AlphaSort-Style (Key + Pointer) OS-Style Memory Paging
  • 51. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeRow Layout Example 51 Pre-Tungsten Tungsten
  • 52. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Memory Management o.a.s.memory.
 TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset)) o.a.s.unsafe.types. UTF8String Primitive Array[Byte] 52 2^13 pages * 2^27 page size = 1 TB RAM per Task
  • 53. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeFixedWidthAggregationMap Aggregations o.a.s.sql.execution.
 UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1 o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting 53
  • 54. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1 Equals! Row 2 54
  • 55. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow 55
  • 56. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix
 UnsafeShuffleWriter AlphaSort-Style Cache Friendly 56 Ptr Key-Prefix 2x CPU Cache-line Friendly! Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance. Supports merging compressed records if compression CODEC supports it (LZF)
  • 57. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling Spill merge on compressed, binary records! If compression CODEC supports it 57 UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes() Exact Peak Memory for Spark Jobs
  • 58. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode 58
  • 59. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles IBM | spark.tc Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Each Implements Expression.genCode()!
  • 60. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode() Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction() Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala Don’t forget about Python! python.pyspark.sql.functions.py 60
  • 61. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst All RDDs Serialization, Compression, and Aggregations 61
  • 62. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten Performance Results Query Time Garbage Collection 62 OOM’d on Large Dataset!
  • 63. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 63
  • 64. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst Review Partitions, Pruning, Pushdowns, File Formats Create a Custom DataSource API Implementation 64
  • 65. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataFrames Inspired by R and Pandas DataFrames Schema-aware Cross language support SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan Lazy transformations represented as tree Only logical plan is sent from Python -> JVM Only results returned from JVM -> Python UDF and UDAF Support Custom UDF support using registerFunction() Experimental UDAF support (ie. HyperLogLog) Supports existing Hive metastore if available Small, file-based Hive metastore created if not available *DataFrame.rdd returns underlying RDD if needed 65 Use DataFrames instead of RDDs!!
  • 66. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Hive Early days, Shark was “Hive on Spark” Hive Optimizer slowly replaced with Catalyst Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Breaks dependency between Spark internal Hive version and User’s external Hive version 66
  • 67. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 67 Implements oas.sql.catalyst.rules.Rule Apply to any plan stage
  • 68. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DataSources API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class) Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 68
  • 69. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Native Spark SQL DataSources 69
  • 70. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Debugging 70 gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true) DataFrame.queryExecution.logical DataFrame.queryExecution.analyzed DataFrame.queryExecution.optimizedPlan DataFrame.queryExecution.executedPlan
  • 71. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Query Plan Visualization & Metrics 71 Effectiveness of Filter CPU Cache 
 Friendly Binary Format Cost-based Join Optimization Similar to MapReduce Map-side Join Peak Memory for Joins and Aggs
  • 72. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json
 ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 72 json() convenience method
  • 73. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL CREATE TABLE genders USING jdbc 
 OPTIONS (url, dbtable, driver, …) 73
  • 74. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 74
  • 75. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ORC Data Source Configuration spark.sql.orc.filterPushdown=true DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders") 75
  • 76. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Third-Party Spark SQL DataSources 76 spark-packages.org
  • 77. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv Maven com.databricks:spark-csv_2.10:1.2.0 Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") 77 toDF() is required if CSV does not contain header
  • 78. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", 
 "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 78
  • 79. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Elasticsearch Tips Change id field to not_analyzed to avoid indexing Use term filter to build and cache the query Perform multiple aggregations in a single request Adapt scoring function to current trends at query time 79
  • 80. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift Maven com.databricks:spark-redshift:0.5.0 Code val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) 80 UNLOAD and copy to tmp bucket in S3 enables parallel reads
  • 81. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DB2 and BigSQL DataSources (IBM) Coming Soon! 81
  • 82. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…) 82
  • 83. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate. 6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate. 83
  • 84. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark New Cassandra DataSource By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office 84
  • 85. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Rumor of REST DataSource (Databricks) Coming Soon? Ask Michael Armbrust Spark SQL Lead @ Databricks 85
  • 86. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom DataSource (Me and You!) Coming Right Now! 86 DEMO ALERT!!
  • 87. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! 87
  • 88. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Create a Custom DataSource 88
  • 89. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift 89
  • 90. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parquet Columnar File Format Based on Google Dremel Collaboration with Twitter and Cloudera Self-describing, evolving schema Fast columnar aggregation Supports filter pushdowns Columnar storage format Excellent compression 90
  • 91. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values Delta, Prefix Encoding: Sorted data 91
  • 92. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Demonstrate File Formats, Partition Schemes, and Query Plans 92
  • 93. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2") Register Temp Table ratingsDF.registerTempTable("ratings_temp") Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default) 93
  • 94. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Query and Process Spark Data from BI Tools 94
  • 95. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 95
  • 96. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark Streaming: Scaling & Approximations Discuss Delivery Guarantees, Parallelism, and Stability Compare Receiver and Receiver-less Impls Demonstrate Stream Approximations 96
  • 97. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Non-Parallel Receiver Implementation 97
  • 98. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver Implementation (Kinesis)   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 98
  • 99. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Parallel Receiver-less Implementation (Kafka) 99
  • 100. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Receiver-less Implementation (Kafka)   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/Executors pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee 100
  • 101. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Maintain Stability of Stream Processing Rate Limiting Since Spark 1.2 Fixed limit on number of messages per second Potential to drops messages on the floor Back Pressure Since Spark 1.5 (TypeSafe Contribution) More dynamic than rate limiting Push back on reliable, buffered source (Kafka, Kinesis) Fundamentals of Control Theory and Observability 101
  • 102. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Streaming Approximations HyperLogLog and CountMin Sketch 102
  • 103. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approx Distinct Count   Approximate count distinct   Twitter’s Algebird   Better than HashSet   Low, fixed memory   Only 1.5K, 2% error,10^9 counts (tunable) Redis HLL: 12K per key, 0.81%, 2^64 counts   Spark’s countApproxDistinctByKey()   Streaming example in Spark codebase 103 http://research.neustar.biz/
  • 104. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approx Count  Approximate count  Twitter’s Algebird  Better than HashMap  Low, fixed memory  Known error bounds  Large num counters  Streaming example in Spark codebase 104
  • 105. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Using HLL and CMS for Streaming Count Approximations 105
  • 106. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Monte Carlo Simulations From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials
 Converge on expected value SparkPi example in Spark codebase 1 Argument: # of trials 
 Pi ~= # red dots
 / # total dots * 4 106
  • 107. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Using a Monte Carlo Simulation to Estimate Pi 107
  • 108. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Streaming Best Practices Get Data Out of Streaming ASAP Processing interval may exceed batch interval Leads to unstable streaming system Please Don’t… Use updateStateByKey() like an in-memory DB Put streaming jobs on the request/response hot path Use Separate Jobs for Different Batch Intervals Small Batch Interval: Store raw data (Redis, Cassandra, etc) Medium Batch Interval: Transform, join, process data High Batch Interval: Model training Gotchas Tune streamingContext.remember() Use Approximations!! 108
  • 109. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline  Spark Core: Tuning & Mechanical Sympathy  Spark SQL: Query Optimizing & Catalyst  Spark Streaming: Scaling & Approximations  Spark ML: Featurizing & Recommendations 109
  • 110. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Spark ML: Featurizing & Recommendations Understand Similarity and Dimension Reduction Demonstrate Sampling and Bucketing Generate Recommendations 110
  • 111. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Live, Interactive Demo! sparkafterdark.com 111
  • 112. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Audience Participation Needed!! 112 -> You are
 here -> Audience Instructions   Navigate to sparkafterdark.com   Click 3 actresses and 3 actors   Wait for us to analyze together! Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline   https://hub.docker.com/r/fluxcapacitor
  • 113. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Similarity 113
  • 114. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias 114 Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1! Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z!
  • 115. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis 115 Dimension reduction!!
  • 116. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Dimension Reduction Sampling and Bucketing 116
  • 117. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain vs. Cosine Similarity 117
  • 118. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 118 github.com/mrsqueeze/spark-hash
  • 119. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); 
 nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0) 119 (index,value) (index,value)
  • 120. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Recommendations Summary Statistics and Top-K Historical Analysis Collaborative Filtering and Clustering Text Featurization and NLP 120
  • 121. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Types of Recommendations Non-personalized
 No preference or behavior data for user, yet aka “Cold Start Problem” Personalized
 User-Item Similarity
 Items that others with similar prefs have liked Item-Item Similarity
 Items similar to your previously-liked items 121
  • 122. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering Dimension reduction, polynomial expansion Hyper-parameter Tuning K-Folds Cross Validation, Grid Search Pipelines/Workflows Chaining together Transformers and Evaluators 122
  • 123. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Single Machine ML Algorithms Stay Local, Distribute As Needed Helps migration of existing single-node algos to Spark Convert between Spark and Pandas DataFrames New “pdspark” package: integration w/ scikitlearn, R 123
  • 124. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations 124
  • 125. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Users by Like Count “I might like users who have the most-likes overall based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs 125
  • 126. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Top Influencers by Like Graph
 “I might like the most-influential users in overall like graph.” GraphX: PageRank 126
  • 127. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Non-Personalized Recommendations 127
  • 128. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Personalized Recommendations Understand Similarity and Personalized Recommendations 128
  • 129. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Like Behavior of Similar Users “I like the same people that you like. 
 What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity 129
  • 130. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Generate Personalized Recommendations using 
 Collaborative Filtering & Matrix Factorization 130
  • 131. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Text-based Profiles as Me
 “Our profiles have similar keywords and named entities. 
 We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams 131
  • 132. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Similar Profiles to Previous Likes
 132 “Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!” MLlib: Word2Vec, TF/IDF, Doc Similarity
  • 133. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
 I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition 133 ^ Her Email< My Profile
  • 134. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Demo! Feature Engineering for Text/NLP Use Cases 134
  • 135. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles The Future of Recommendations 135
  • 136. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
 I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity 136 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 137. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.” 
 MLlib: TF/IDF, DecisionTrees, Sentiment Analysis 137 Positive Negative
  • 138. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles 138 Maintaining the Spark
  • 139. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark ⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle. 
 Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
 GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors 139
  • 140. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Final Recommendation! 140
  • 141. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark   Get Off the Computer & Meet People! Thank you, Helsinki!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker! 141
  • 142. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 142
  • 143. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles What’s Next? 143
  • 144. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals 144
  • 145. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th) 145 San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th) Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
  • 146. Click to edit Master text styles Click to edit Master text styles IBM Spark spark.tc Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark