Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Shark
1. Shark: SQL and Rich
Analytics at Scale
November 26, 2012
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
AMPLab, EECS, UC Berkeley
Presented by Alexander Ivanichev
Reynold S. Xin, Josh Rosen, Matei Zaharia,
Michael J. Franklin, Scott Shenker, Ion Stoica
2. Agenda
• Problems with existing database solutions
• What is Shark ?
• Shark architecture overview
• Extending spark for SQL
• Shark perfomance and examples
3. Challenges in modern data analysis
Data size growing
• Processing has to scale out over large clusters
• Faults and stragglers complicate DB design
Complexity of analysis increasing
• Massive ETL (web crawling)
• Machine learning, graph processing
• Leads to long running jobs
4. Existinganalysissolutions
MPP Databases
• Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala...
• Fast!
• Generally not fault-tolerant; challenging for long running queries as clusters scale up.
• Lack rich analytics such as machine learning and graph algorithms.
Map Reduce
• Apache Hive, Google Tenzing, Turn Cheetah...
• Enables fine-grained fault-tolerance, resource sharing, scalability.
• Expressive Machine Learning algorithms.
• High-latency, dismissed for interactive workloads.
5. What’sgoodabout MapReduce?
A data warehouse
• scales out to thousands of nodes in a fault- tolerant manner
• puts structure/schema onto HDFS data (schema-on-read)
• compiles HiveQL queries into MapReduce jobs
• flexible and extensible: support UDFs, scripts, custom serializers, storage formats
Data-parallel operations
• Apply the same operations on a defined set of data
Fine-grained, deterministic tasks
• Enables fault-tolerance & straggler mitigation
But slow: 30+ seconds even for simple queries
6. Sharkresearch
Shows MapReduce model can be extended to support SQL efficiently
• Started from a powerful MR-like engine (Spark)
• Extended the engine in various ways
The artifact: Shark, a fast engine on top of MR
• Performant SQL
• Complex analytics in the same engine
• Maintains MR benefits, e.g. fault-tolerance
7. WhatisShark?
A new data analysis system that is :
MapReduce-based architecture
• Built on the top of the RDD and spark execution engine
• Scales out and is fault-tolerant
Performant
• Similar speedups of up to 100x
• Supports low-latency, interactive queries through in-memory computation
Expressive and flexible
• Supports both SQL and complex analytics such as machine learning
• Apache Hive compatible with (storage, UDFs, types, metadata, etc)
8. SharkontopofSparkengine
Fast MapReduce-like engine
• In-memory storage for fast iterative computations
• General execution graphs
• Designed for low latency (~100ms jobs)
Compatible with Hadoop storage APIs
• Read/write to any Hadoop-supported systems,
including HDFS, Hbase, SequenceFiles, etc
Growing open source platform
• 17+ companies contributing code
10. Sharkarchitecture
HDFS Name Node
Metastore
(System
Catalog)
Resource ManagerScheduler
Master Process
Used to query an existing Hive warehouse
returns result much faster without
modification
Master Node
SparkRuntime
Slave Node
Execution Engine
Memstore
Resource Manager Daemon
HDFS DataNode
SparkRuntime
Slave Node
Execution Engine
Memstore
Resource Manager Daemon
HDFS DataNode
11. Sharkarchitecture-overview
Spark
• Support partial DAG execution
• Optimization of joint algorithm
Features of shark
• Supports general computation
• Provides in-memory storage abstraction-RDD
• Engine is optimized for low latency
12. Sharkarchitecture-overview
RDD
• Sparks main abstraction-RDD
• Collection stored in external storage system or derived data set
• Contains arbitrary data types
Benefits of RDD’s
• Return at the speed of DRAM
• Use of lineage
• Speedy recovery
• Immutable-foundation for relational processing.
groupByKey
join
mapPartitions
sortByKey
distinct
filter
13. Sharkarchitecture-overview
Fault tolerance guarantees
• Shark can tolerate the loss of any set of worker nodes.
• Recovery is parallelized across the cluster.
• The deterministic nature of RDDs also enables straggler mitigation
• Recovery works even in queries that combine SQL and machine learning UDFs
15. AnalyzingdataonShark
CREATE EXTERNAL TABLE wiki
(id BIGINT, title STRING, last_modified STRING,xml STRING,text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION 's3n://spark-data/wikipedia-sample/';
SELECT COUNT(*) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
16. CachingDatainShark
CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache"= "true") AS
SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s memory using RDD.cache().
17. Tuningthedegreeofparallelism
• Relies on Spark to infer the number of map tasks (automatically based on input size).
• Number of reduce tasks needs to be specified by the user.
SET mapred.reduce.tasks=499;
• Out of memory error on slaves if the number is too small.
• It is usually OK to set a higher value since the overhead of task launching is low in Spark
18. ExtendingSparkforSQL
• Partial DAG Execution
• Columnar Memory Store
• Machine Learning Integration
• Hash-based Shuffle vs Sort-based Shuffle
• Data Co-partitioning
• Partition Pruning based on Range Statistics
• Distributed Data Loading
• Distributed sorting
• Better push-down of limits
19. PartialDAGExecution(PDE)
Partial DAG execution(PDE)
• Static query optimization
• Dynamic query optimization
• Modification of statistics
Example of statistics
• Partition size record count
• List of “heavy hitters”
• Approximate histogram
SELECT * FROM table1 a
JOIN table2 b ON a.key=b.key
WHERE my_udf(b.field1, b.field2) = true;
How to optimize the following query?
• Hard to estimate cardinality!
• Without cardinality estimation, cost-
based optimizer breaks down
21. Columnarmemorystore
• Simply catching records as JVM objects is insufficient
• Shark employs column oriented storage , a partition of
columns is one MapReduce “record”
• Benefits: compact representation, cpu efficient compression,
cache locality
1
2
3
John
Mike
Sally
4.1
3.5
6.4
1
John
4.1
2
Mike
3.5
3
Sally
6.4
Row storage Column storage
22. Machinelearningsupport
Language Integration
• Shark allows queries to perform logistic
regression over a user database.
• Example: Data analysis pipeline that performs logistic regression over
database.
Execution Engine Integration
• Common abstraction allows machine learning computation and SQL
queries to share workers and cached data.
• Enables end to end fault tolerance
Shark supports machine learning-first class citizen
23. IntegratingMachinelearningwithSQL
def logRegress(points: RDD[Point]):Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS){
val gradient = points.map { p =>
val denom = 1 + exp(-p.y* (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ +_)
w -=gradient
}
w
}
val users = sql2rdd("SELECT * FROM user u
JOIN comment c ONc.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
Unified system for query processing and machine learning
27. Summary
• By using Spark as the execution engine and employing novel and traditional database
techniques, Shark bridges the gap between MapReduce and MPP databases.
• User can conveniently mix the best parts of both SQL and MapReduce-style programming
and avoid moving data.
• Provide fine-grained fault recovery across both types of operations.
• In memory computation, users can choose to load high-value data into Shark’s memory store
for fast analytics
• Shark can answer queries up to 100X faster than Hive and machine learning 100X faster than
Hadoop MapReduce and can achieve comparable performance with MPP databases