Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Spark Saturday: Spark SQL &
DataFrames Workshop w/
Apache Spark 2.3
Jules S. Damji
Apache Spark Developer & Community Advocate
Spark Saturday , Santa ClaraAugust 4th,2018

I have used Apache Spark Before…

I have used SQL or Spark SQL Before…

I know the difference between
DataFrame and RDDs…

Spark Community &Developer Advocate @
Databricks
Developer Advocate @ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest
https://www.linkedin.com/in/dmatrix
@2twitme,
jules@databricks.com

Morning Afternoon
Agenda for the day
• Introduction to DataFrames &
Datasets
• DataFrames Labs
• Break
• DeveloperCertification
• Get to know Databricks
• Overview of Spark
Fundamentals & Architecture
• Unified APIs:SparkSessions,
SQL, DataFrames, Datasets…
• Break
• Spark SQL Labs
• Lunch

Get to know Databricks
• Keepthis URL Open in a separate tab https://dbricks.co/spark-saturday-bayarea
• Labs @Copyrightedby Databricks. Cannotbe repurposedfor Commercialuse!
Use This

Big Data Systems of Yesterday…
MapReduce/Hadoop
Generalbatch
processing
Drill
Storm
Pregel Giraph
Dremel Mahout
Storm Impala
Drill . . .
Specialized systems
for newworkloads
Hard to combine in pipelines

MapReduce
Generalbatch
processing
Unified engine
Big Data Systems Today
?
Pregel
Dremel Millwheel
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworkloads

Faster, Easier to Use, Unified
13
First Distributed
Processing Engine
Specialized Data
Processing Engines
Unified Data
Processing Engine

Apache Spark Philosophy
Unified engine for complete
data applications
High-level user-friendly APIs
SQLStreaming ML Graph
…
DL
Applications

An Analogy ….
New applications

Unified engineacross diverse workloads &
environments

Apache Spark: The First Unified Analytics Engine
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL + Streaming
Machine Learning
MLlib + SparkR
Uniquelycombines Data & AI technologies

DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle

Common Spark Use Cases
2
ETL
MACHINE
LEARNING
SQL
ANALYTICS
STREAMING

The Benefits of Apache Spark?
SPEED
100x faster than
Hadoop for large
scale data processing
EASE OF USE
Simple APIs for
operating on large
data sets
UNIFIED ENGINE
Packaged withhigher-
level libraries (SQL,
Streaming, ML, Graph)

Apache Spark at Massive Scale
23
60TB+
Compressed
data
250,000+
# of tasks in a
single job
4.5-6x
CPU performance
improvement over Hive
https://databricks.com/blog/2016/08/31/apache-
spark-scale-a-60-tb-production-use-case.html

Apache Spark Architecture
Deployments
Modes
• Local
• Standalone
• YARN
• Mesos

Driver
+
Executor
Driver
+
Executor
Container
EC2 Machine
Student-1 Notebook
Student-2 Notebook
Container
JVM
JVM
Local Mode in Databricks

30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
S
S
S
S
Ex.
Ex.
30 GB Container
30 GB Container
22 GB JVM
22 GB JVM
S
S
S
S
Dr
Ex.
... ...
Standalone Mode

Spark Deployment Modes
As spark 2.3 Kubernetes

Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates
pods that run the executors in response
to requests from the Spark scheduler.
[K8S-34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
29

Spark on Kubernetes
Supported:
• Supports Kubernetes1.6 and up
• Supports cluster mode only
• Staticresource allocation only
• Supports Java and Scala
applications
• Can use container-local and
remote dependencies that are
downloadable
30
In roadmap (2.4):
• Client mode
• Dynamic resource allocation +
external shuffle service
• Python and R support
• Submission client local dependencies
+ Resource staging server (RSS)
• Non-secured and KerberizedHDFS
access (injection of Hadoop
configuration)

Apache Spark Application
Anatomy

Apache Spark Architecture
An Anatomy ofan Application
Spark Application
• Jobs
• Stages
• Tasks

S S
Container
S*
*
*
*
*
*
*
*
JVM
T
*
*
DF/RDD
A Spark Executor

Resilient Distributed Dataset
(RDD)

A Resilient Distributed Dataset
(RDD)
1. Distributed Data Abstraction
Logical Model Across Distributed Storage
S3, Blob or HDFS

2. Resilient & Immutable
RDD RDD RDDT
RDD à Tà RDD -> RDD
T
T = Transformation

3. Compile-time Type-safe
Integer RDD
String or Text RDD
Double or Binary RDD

4. Unstructured/Structured Data: Text (logs, tweets,
articles, social)

5. Lazy
RDD RDD RDDT
RDD à Tà RDD à Tà RDD
T
T = Transformation
A = Action
RDD RDD RDDT A

2 kinds of Actions
collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)

Unified API Foundation for
the Future: SparkSessions,
DataFrame, Dataset, MLlib,
Structured Streaming

Major Themes in Apache Spark 2.x
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-time engine
on SQL / DataFrames
Smarter
Unifying Datasets
and DataFrames &
SparkSessions
Easier

SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works with metadata
– Sets/gets Spark
Configuration
– Driver uses for Cluster
resource management

SparkSession vs SparkContext
SparkSessions Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkConf

Long Term
• RDD as the low-level API in Spark
• For control and certain type-safety in Java/Scala
• Datasets & DataFrames give richer semantics &
optimizations
• For semi-structured data and DSL like operations
• New libraries will increasingly use these as interchange
format
• Examples: Structured Streaming, MLlib, GraphFrames,
and Deep Learning Pipelines

DataFrames/Dataset, Spark
SQL & Catalyst Optimizer

The not so secret truth…
SQL
is not about SQL
is about more thanSQL

10
Is About Creating and Running Spark Programs
Faster:
•  Write less code
•  Read less data
•  Do less work
• optimizerdoes the hard work
Spark SQL: The wholestory

Spark SQL Architecture
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL DataFrames
Code
Generator
Datasets

59
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzinga logicalplan to resolve references
Logical Optimization: logicalplan optimization
Physical Planning: Physical planning
Code Generation:Compileparts of the query to Java bytecode
SQL AST
DataFrame
Datasets

LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS
Catalyst Optimizations
• Catalyst compiles operations into
physical plan for execution and
generates JVM byte code
• Intelligently choose between
broadcast joins and shuffle joins to
reduce network traffic
• Lower level optimizations:
eliminate expensive object
allocations and reduce virtual
functions calls
• Push filter predicate down to data
source, so irrelevant data can be
skipped
• Parquet: skip entire blocks, turn
comparisons into cheaper integer
comparisons via dictionary coding
• RDMS: reduce amount of data traffic
by pushing down predicates

PhysicalPlan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
LogicalPlan
filter
join
PhysicalPlan
join
scan
(users)events file userstable
61
scan
(events)
filter
users.join(events, users("id") === events("uid")) .
filter(events("date") > "2015-01-01")
DataFrame Optimization

Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "people")
.load()
.where($"name" === "michael")
62
You Write
Spark Translates
For Postgres SELECT * FROM people WHERE name = 'michael'

Columns: Predicate pushdown
SELECT firstName, LastName, SSN, COO, title FROM
people where firstName = ‘jules’ and COO = ‘tz’;
63
You Write
SparkWill Push it down
To Postgres or Parquet SELECT <items, item, …items > FROM people WHERE <condition>

43
Spark Core (RDD)
Catalyst & Tungsten
DataFrame/DatasetSQL
MLPipelines Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpark2.x Components
Spark SQL
GraphFrames DL Pipelines
TensorFrames

DataFrames & Datasets
Spark 2.x APIs

Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Iterator[T]
Opaque Computation
& Opaque Data

Structured APIs In Spark
68
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts

Unification of APIs in Spark 2.0

DataFrame API code.
// convert RDD -> DF with column names
val df = parsedRDD.toDF("project", "page", "numRequests")
//filter, groupBy, sum, and then agg()
df.filter($"project" === "en").
groupBy($"page").
agg(sum($"numRequests").as("count")).
limit(100).
show(100)
project page numRequests
en 23 45
en 24 200

Take DataFrame à SQL Table à Query
df. createOrReplaceTempView(("edits")
val results = spark.sql("""SELECT page, sum(numRequests)
AS count FROM edits WHERE project = 'en' GROUP BY page
LIMIT 100""")
results.show(100)
project page numRequests
en 23 45
en 24 200

Easy to write code... Believe it!
from pyspark.sql.functions import avg
dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])
dataDF = dataRDD.toDF(["name", "age"])
# Using RDD code to compute aggregate average
(dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1]
+y[1])) .map(lambda (x, (y, z)): (x, y / z)))
# Using DataFrame
dataDF.groupBy("name").agg(avg("age"))
name age
Jim 20
Ann 31
Jim 30

Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)}
.map { case (dept, (age, c)) => dept -> age / c }
select dept, avg(age) from data group by 1
SQL
DataFrame
RDD
data.groupBy("dept").avg("age")

Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.x
v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ")
/ / Convert data to domain o b j e c ts .
case c l a s s Person(name: S tr i n g , age: I n t )
v a l d s : Dataset[Person] = d f.a s [P e r s on ]
v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)

Datasets: Lightning-fast Serialization with Encoders

DataFrames are Faster than RDDs

Why When
DataFrames & Datasets
• StructuredData schema
• Code optimization & performance
• Space efficiency with Tungsten
• High-level APIs and DSL
• StrongType-safety
• Ease-of-use & Readability
• What-to-do

BLOG: http://dbricks.co/3-apis
Spark Summit Talk: http://dbricks.co/summit-3aps

Databricks Developer
Certification for Apache
Spark 2.x

83
Why: Build Your Skills - Certification
● The industry standard for Apache Spark certification from
original creators at Databricks
○ Validate your overall knowledge on Apache Spark
○ Assure clients that you are up-to-date with the fast
moving Apache Spark project with features in new
releases

84
What: Build Your Skills - Certification
● Databricks Certification Exam
○ The test is approximately 3 hours and is proctored
either online or at a test center
○ Series of randomly generated multiple choice
questions
○ Test fee is $300
○ Two editions: Scala & Python
○ Can take it twice

85
How To Prepare for Certification
• Knowledge of Apace Spark Basics
• Structured Streaming, Spark Architecture, MLlib,
Performance & Debugging, Spark SQL, GraphFrames,
Programming Languages (offered Python or Scala)
• Experience Developing Spark apps in production
• Courses:
• Databricks Apache Spark Programing 105 & 110
• Getting Started with Apache Spark SQL
• 7 Steps for a Developer to Learn Apache Spark
• Spark: The Definitive Guide

86
Where To Sign for Certification
REGISTER: Databricks Certified Developer: Apache Spark 2.X
LOGISTICS: How to Take the Exam

https://dbricks.co/developer-cert

Resources
• Getting Started Guide with Apache Spark on
Databricks
• docs.databricks.com
• Spark Programming Guide
• Structured Streaming Programming Guide
• Databricks Engineering Blogs
• spark-packages.org

https://databricks.com/company/careers

Do you have any questions for my preparedanswers?

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Similar a Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3 (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3