SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit 2015 - June, 15th
Graduated
from Alpha
in 1.3
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
SQL!About Me and
2
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT&COUNT(*)&
FROM&hiveTable&
WHERE&hive_udf(data)&&
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
SQL!About Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
5
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
• @michaelarmbrust
•  Lead developer of Spark SQL @databricks
6
SQL!About Me and
The not-so-secret truth...
7
is about more than SQL.
!
SQL!
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
8
DataFrame
noun – [dey-tuh-freym]
9
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
!
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
!
read and write&
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
•  Format
•  Partitioning
•  Handling of
existing data
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)&
finish the I/O
specification
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 13
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
14
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ETL Using Custom Data Sources
sqlContext.read&
&&.format("com.databricks.spark.jira")&
&&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")&
&&.option("user",&"marmbrus")&
&&.option("password",&"*******")&
&&.option("query",&"""&
&&&&|project&=&SPARK&AND&&
&&&&|component&=&SQL&AND&&
&&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)&
&&.load()&
&&.repartition(1)&
&&.write&
&&.format("parquet")&
&&.saveAsTable("sparkSqlJira")&
15
Write Less Code: High-Level Operations
Solve common problems concisely using DataFrame functions:
•  Selecting columns and filtering
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Plotting results with Pandas
16
Write Less Code: Compute an Average
private&IntWritable&one&=&&
&&new&IntWritable(1)&
private&IntWritable&output&=&
&&new&IntWritable()&
proctected&void&map(&
&&&&LongWritable&key,&
&&&&Text&value,&
&&&&Context&context)&{&
&&String[]&fields&=&value.split("t")&
&&output.set(Integer.parseInt(fields[1]))&
&&context.write(one,&output)&
}&
&
IntWritable&one&=&new&IntWritable(1)&
DoubleWritable&average&=&new&DoubleWritable()&
&
protected&void&reduce(&
&&&&IntWritable&key,&
&&&&Iterable<IntWritable>&values,&
&&&&Context&context)&{&
&&int&sum&=&0&
&&int&count&=&0&
&&for(IntWritable&value&:&values)&{&
&&&&&sum&+=&value.get()&
&&&&&count++&
&&&&}&
&&average.set(sum&/&(double)&count)&
&&context.Write(key,&average)&
}&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[x.[1],&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
17
Write Less Code: Compute an Average
Using RDDs
&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
&
&
&Using DataFrames
&
sqlCtx.table("people")&&
&&&.groupBy("name")&&
&&&.agg("name",&avg("age"))&&
&&&.collect()&&
!
Full API Docs
•  Python
•  Scala
•  Java
•  R
18
Not Just Less Code: Faster Implementations
19
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
20
Demo
Combine data from with data from
Running in
•  Hosted Spark in the cloud
•  Notebooks with integrated visualization
•  Scheduled production jobs
https://accounts.cloud.databricks.com/
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 1/5
> 
Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB)
> 
> 
%run /home/michael/ss.2015.demo/spark.sql.lib ...
%sql SELECT * FROM sparkSqlJira
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 2/5
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 3/5
Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB)
> 
rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea
n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam
e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i
ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme
nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line
s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j
iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string]
Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB)
> 
val rawPRs = sqlContext.read
.format("com.databricks.spark.rest")
.option("url", "https://spark-prs.appspot.com/search-open-prs")
.load()
display(rawPRs)
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 4/5
Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB)
> 
import org.apache.spark.sql.functions._
sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i
ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d
ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai
d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl
e: boolean]
import org.apache.spark.sql.functions._
val sparkPRs = rawPRs
.select(
  // "Explode" nested array to create one row per item.
  explode($"components").as("component"),
 
  // Use a built-in function to construct the full 'SPARK-XXXX' key
  concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"),
// Other required columns.
  $"parsed_title.title",
$"jira_issuetype_icon_url",
$"jira_priority_icon_url",
$"number",
$"commenters",
$"user",
$"last_jenkins_outcome",
$"is_mergeable")
.where($"component" === "SQL") // Select only SQL PRs
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 5/5
Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB)
> 
✗
✗
✗
✗
table("sparkSqlJira")
.join(sparkPRs, $"key" === $"pr_jira")
.jiraTable
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)&
&
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&
Augments any
DataFrame
that contains
user_id&
22
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
23
events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&
&&
training_data&=&events&&
&&.where(events.city&==&"San&Francisco")&&
&&.select(events.timestamp)&&
&&.collect()&&
24
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
24
25
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
25
Machine Learning Pipelines
26
tokenizer&=&Tokenizer(inputCol="text",!outputCol="words”)&
hashingTF&=&HashingTF(inputCol="words",!outputCol="features”)&
lr&=&LogisticRegression(maxIter=10,&regParam=0.01)&
pipeline&=&Pipeline(stages=[tokenizer,&hashingTF,&lr])&
&
df&=&sqlCtx.load("/path/to/data")!
model&=&pipeline.fit(df)!	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model!
Find out more during Joseph’s Talk: 3pm Today
Project Tungsten: Initial Results
27
0
50
100
150
200
1x 2x 4x 8x 16x
Average GC
time per
node
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
Find out more during Josh’s Talk: 5pm Tomorrow
Questions?
Spark SQL Office Hours Today
-  Michael Armbrust 1:45-2:30
-  Yin Huai 3:40-4:15
Spark SQL Office Hours Tomorrow
-  Reynold 1:45-2:30

Más contenido relacionado

La actualidad más candente

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

La actualidad más candente (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Similar a Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...Fabio Franzini
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 

Similar a Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015 (20)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Php frameworks
Php frameworksPhp frameworks
Php frameworks
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 

Último (20)

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit 2015 - June, 15th
  • 2. Graduated from Alpha in 1.3 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) SQL!About Me and 2 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT&COUNT(*)& FROM&hiveTable& WHERE&hive_udf(data)&& • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments SQL!About Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC SQL!About Me and
  • 5. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R 5 SQL!About Me and
  • 6. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R • @michaelarmbrust •  Lead developer of Spark SQL @databricks 6 SQL!About Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. ! SQL!
  • 8. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 8
  • 9. DataFrame noun – [dey-tuh-freym] 9 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3). !
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! read and write& functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: •  Format •  Partitioning •  Handling of existing data df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)& finish the I/O specification df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 13
  • 14. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 14 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/
  • 15. ETL Using Custom Data Sources sqlContext.read& &&.format("com.databricks.spark.jira")& &&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")& &&.option("user",&"marmbrus")& &&.option("password",&"*******")& &&.option("query",&"""& &&&&|project&=&SPARK&AND&& &&&&|component&=&SQL&AND&& &&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)& &&.load()& &&.repartition(1)& &&.write& &&.format("parquet")& &&.saveAsTable("sparkSqlJira")& 15
  • 16. Write Less Code: High-Level Operations Solve common problems concisely using DataFrame functions: •  Selecting columns and filtering •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Plotting results with Pandas 16
  • 17. Write Less Code: Compute an Average private&IntWritable&one&=&& &&new&IntWritable(1)& private&IntWritable&output&=& &&new&IntWritable()& proctected&void&map(& &&&&LongWritable&key,& &&&&Text&value,& &&&&Context&context)&{& &&String[]&fields&=&value.split("t")& &&output.set(Integer.parseInt(fields[1]))& &&context.write(one,&output)& }& & IntWritable&one&=&new&IntWritable(1)& DoubleWritable&average&=&new&DoubleWritable()& & protected&void&reduce(& &&&&IntWritable&key,& &&&&Iterable<IntWritable>&values,& &&&&Context&context)&{& &&int&sum&=&0& &&int&count&=&0& &&for(IntWritable&value&:&values)&{& &&&&&sum&+=&value.get()& &&&&&count++& &&&&}& &&average.set(sum&/&(double)&count)& &&context.Write(key,&average)& }& data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[x.[1],&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& 17
  • 18. Write Less Code: Compute an Average Using RDDs & data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& & & &Using DataFrames & sqlCtx.table("people")&& &&&.groupBy("name")&& &&&.agg("name",&avg("age"))&& &&&.collect()&& ! Full API Docs •  Python •  Scala •  Java •  R 18
  • 19. Not Just Less Code: Faster Implementations 19 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 20. 20 Demo Combine data from with data from Running in •  Hosted Spark in the cloud •  Notebooks with integrated visualization •  Scheduled production jobs https://accounts.cloud.databricks.com/
  • 21. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 1/5 >  Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB) >  >  %run /home/michael/ss.2015.demo/spark.sql.lib ... %sql SELECT * FROM sparkSqlJira
  • 22. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 2/5
  • 23. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 3/5 Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB) >  rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string] Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB) >  val rawPRs = sqlContext.read .format("com.databricks.spark.rest") .option("url", "https://spark-prs.appspot.com/search-open-prs") .load() display(rawPRs)
  • 24. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 4/5 Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB) >  import org.apache.spark.sql.functions._ sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl e: boolean] import org.apache.spark.sql.functions._ val sparkPRs = rawPRs .select(   // "Explode" nested array to create one row per item.   explode($"components").as("component"),     // Use a built-in function to construct the full 'SPARK-XXXX' key   concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"), // Other required columns.   $"parsed_title.title", $"jira_issuetype_icon_url", $"jira_priority_icon_url", $"number", $"commenters", $"user", $"last_jenkins_outcome", $"is_mergeable") .where($"component" === "SQL") // Select only SQL PRs
  • 25. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 5/5 Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB) >  ✗ ✗ ✗ ✗ table("sparkSqlJira") .join(sparkPRs, $"key" === $"pr_jira") .jiraTable
  • 26. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 27. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)& & def&add_demographics(events):& &&&u&=&sqlCtx.table("users")& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))& Augments any DataFrame that contains user_id& 22
  • 28. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 23 events&=&add_demographics(sqlCtx.load("/data/events",&"json"))& && training_data&=&events&& &&.where(events.city&==&"San&Francisco")&& &&.select(events.timestamp)&& &&.collect()&&
  • 30. 25 def&add_demographics(events):& &&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column& & Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&& training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&& Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 25
  • 32. Project Tungsten: Initial Results 27 0 50 100 150 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap Find out more during Josh’s Talk: 5pm Tomorrow
  • 33. Questions? Spark SQL Office Hours Today -  Michael Armbrust 1:45-2:30 -  Yin Huai 3:40-4:15 Spark SQL Office Hours Tomorrow -  Reynold 1:45-2:30