SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark + AI Summit Europe, London, Oct 4, 2018
•
Pitfalls of Apache Spark at Scale
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for open source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri
The world’s largest virtual assistant service powering every
iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Data Pipeline
• Downstream data consumers were doing the same expensive
query with expensive joins
• Different teams had their own repos, and built their jars - hard to
track data lineages
• Raw client request data is tricky to process as it involves deep
understanding of business logic
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unified Pipeline
• Single repo for Spark application across the Siri
• Shared business logic code to avoid any discrepancy
• Raw data is cleaned, joined, and transformed into one
standardized data model for data consumers to query on
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Technical Details about Strongly Typed Data
• Schema of data is checked in as case class, and CI ensures
schema changes won’t break the jobs
• Deeply nested relational model data with 5 top level columns
• The total fields are around 2k
• Stored in Parquet format partitioned by UTC day
• Data consumers query on the subset of data
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
DataFrame: Relational untyped APIs introduced in
Spark 1.3. From Spark 2.0, 

type DataFrame = Dataset[Row]
Dataset: Support all the untyped APIs in DataFrame
+ typed functional APIs
Review of APIs of Spark
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of DataFrame
val df: DataFrame = Seq(
(1, "iPhone", 5),
(2, "MacBook", 4),
(3, "iPhone", 3),
(4, "iPad", 2),
(5, "appleWatch", 1)
).toDF("userId", "device", “counts")
df.printSchema()
"""
|root
||-- userId: integer (nullable = false)
||-- device: string (nullable = true)
||-- counts: integer (nullable = false)
|
""".stripMargin
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of DataFrame
df.withColumn(“counts", $"counts" + 1)
.filter($"device" === "iPhone").show()
"""
|+------+------+------+
||userId|device|counts|
|+------+------+------+
|| 1|iPhone| 6|
|| 3|iPhone| 4|
|+------+------+------+
""".stripMargin

// $”counts” == df(“counts”)
// via implicit conversion
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Execution Plan - Dataframe Untyped APIs
df.withColumn("counts", $"counts" + 1).filter($”device" === “iPhone").explain(true)
"""
|== Parsed Logical Plan ==
|'Filter ('device = iPhone)
|+- Project [userId#3, device#4, (counts#5 + 1) AS counts#10]
| +- Relation[userId#3,device#4,counts#5] parquet
|
|== Physical Plan ==
|*Project [userId#3, device#4, (counts#5 + 1) AS counts#10]
|+- *Filter (isnotnull(device#4) && (device#4 = iPhone))
| +- *FileScan parquet [userId#3,device#4,counts#5]
| Batched: true, Format: Parquet,
| PartitionFilters: [],
| PushedFilters: [IsNotNull(device), EqualTo(device,iPhone)],
| ReadSchema: struct<userId:int,device:string,counts:int>
|
""".stripMargin
•
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of Dataset
case class ErrorEvent(userId: Long, device: String, counts: Long)
val ds = df.as[ErrorEvent]
ds.map(row => ErrorEvent(row.userId, row.device, row.counts + 1))
.filter(row => row.device == “iPhone").show()
"""
|+------+------+------+
||userId|device|counts|
|+------+------+------+
|| 1|iPhone| 6|
|| 3|iPhone| 4|
|+------+------+------+
“"".stripMargin
// It’s really easy to put existing Java / Scala code here.
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Execution Plan - Dataset Typed APIs
ds.map { row => ErrorEvent(row.userId, row.device, row.counts + 1) }.filter { row =>
row.device == "iPhone"
}.explain(true)
"""
|== Physical Plan ==
|*SerializeFromObject [
| assertnotnull(input[0, com.apple.ErrorEvent, true]).userId AS userId#27L,
| assertnotnull(input[0, com.apple.ErrorEvent, true]).device, true) AS device#28,
| assertnotnull(input[0, com.apple.ErrorEvent, true]).counts AS counts#29L]
|+- *Filter <function1>.apply
| +- *MapElements <function1>, obj#26: com.apple.ErrorEvent
| +- *DeserializeToObject newInstance(class com.apple.ErrorEvent), obj#25:
| com.apple.siri.ErrorEvent
| +- *FileScan parquet [userId#3,device#4,counts#5]
| Batched: true, Format: Parquet,
| PartitionFilters: [], PushedFilters: [],
| ReadSchema: struct<userId:int,device:string,counts:int>
""".stripMargin
•
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Strongly Typed Pipeline
• Typed Dataset is used to guarantee the schema consistency
• Enables Java/Scala interoperability between systems
• Increases Data Scientist productivity
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Drawbacks of Strongly Typed Pipeline
• Dataset are slower than Dataframe https://tinyurl.com/dataset-vs-dataframe
• In Dataset, many POJO are created for each row resulting high
GC pressure
• Data consumers typically query on subsets of data, but schema
pruning and predicate pushdown are not working well in nested
fields
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.3.1
case class FullName(first: String, middle: String, last: String)
case class Contact(id: Int,
name: FullName,
address: String)
sql("select name.first from contacts").where("name.first = 'Jane'").explain(true)
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema:
struct<name:struct<first:string,middle:string,last:string>>
|
""".stripMargin
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.4 with Schema Pruning
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema:
struct<name:struct<first:string>>
|
""".stripMargin
• [SPARK-4502], [SPARK-25363] Parquet nested column pruning
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.4 with Schema Pruning + Predicate Pushdown
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name.first,Jane)],
ReadSchema: struct<name:struct<first:string>>
|
""".stripMargin
• [SPARK-4502], [SPARK-25363] Parquet nested column pruning
• [SPARK-17636] Parquet nested Predicate Pushdown
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Production Query - Finding a Needle in a Haystack
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.3.1
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-17636]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
• 21x faster in wall clock time

• 8x less data being read

• Saving electric bills in many
data centers
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Future Work
• Use Spark Streaming can be used to aggregate the request data
in the edge first to reduce the data movement
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Building a table format on top of Parquet using Spark's
Datasource V2 API to tracks individual data files with richer
metadata to manage versioning and enhance query performance
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Thank you

Más contenido relacionado

Similar a Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai

Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerAmazon Web Services
 
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMakerBuild Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMakerAmazon Web Services
 
APEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep DiveAPEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep DiveJohnSnyders
 
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...Amazon Web Services
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelDeepak Chandramouli
 
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlowAWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlowAmazon Web Services
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingSergey Arkhipov
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Puppet
 
Immersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dadosImmersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dadosAmazon Web Services LATAM
 
Delivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with OracleDelivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with OracleSimon Haslam
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...Amazon Web Services
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Workshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMakerWorkshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMakerAmazon Web Services
 
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019Timothy Spann
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudJuarez Junior
 
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...Rockwell Automation
 

Similar a Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai (20)

Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
 
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMakerBuild Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
 
APEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep DiveAPEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep Dive
 
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlowAWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
 
Immersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dadosImmersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dados
 
Delivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with OracleDelivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with Oracle
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
F5 Automation Toolchain
F5 Automation ToolchainF5 Automation Toolchain
F5 Automation Toolchain
 
Workshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMakerWorkshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMaker
 
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
 
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai

  • 1. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark + AI Summit Europe, London, Oct 4, 2018 • Pitfalls of Apache Spark at Scale Cesar Delgado @hpcfarmer DB Tsai @dbtsai
  • 2. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Open Source Team • We’re Spark, Hadoop, HBase PMCs / Committers / Contributors • We’re the advocate for open source • Pushing our internal changes back to the upstreams • Working with the communities to review pull requests, develop new features and bug fixes
  • 3. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri The world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
  • 4. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Data • Machine learning is used to personalize your experience throughout your day • We believe privacy is a fundamental human right
  • 5. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Scale • Large amounts of requests, Data Centers all over the world • Hadoop / Yarn Cluster has thousands of nodes • HDFS has hundred of PB • 100's TB of raw event data per day
  • 6. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Data Pipeline • Downstream data consumers were doing the same expensive query with expensive joins • Different teams had their own repos, and built their jars - hard to track data lineages • Raw client request data is tricky to process as it involves deep understanding of business logic
  • 7. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unified Pipeline • Single repo for Spark application across the Siri • Shared business logic code to avoid any discrepancy • Raw data is cleaned, joined, and transformed into one standardized data model for data consumers to query on
  • 8. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Technical Details about Strongly Typed Data • Schema of data is checked in as case class, and CI ensures schema changes won’t break the jobs • Deeply nested relational model data with 5 top level columns • The total fields are around 2k • Stored in Parquet format partitioned by UTC day • Data consumers query on the subset of data
  • 9. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. DataFrame: Relational untyped APIs introduced in Spark 1.3. From Spark 2.0, 
 type DataFrame = Dataset[Row] Dataset: Support all the untyped APIs in DataFrame + typed functional APIs Review of APIs of Spark
  • 10. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of DataFrame val df: DataFrame = Seq( (1, "iPhone", 5), (2, "MacBook", 4), (3, "iPhone", 3), (4, "iPad", 2), (5, "appleWatch", 1) ).toDF("userId", "device", “counts") df.printSchema() """ |root ||-- userId: integer (nullable = false) ||-- device: string (nullable = true) ||-- counts: integer (nullable = false) | """.stripMargin
  • 11. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of DataFrame df.withColumn(“counts", $"counts" + 1) .filter($"device" === "iPhone").show() """ |+------+------+------+ ||userId|device|counts| |+------+------+------+ || 1|iPhone| 6| || 3|iPhone| 4| |+------+------+------+ """.stripMargin
 // $”counts” == df(“counts”) // via implicit conversion
  • 12. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Execution Plan - Dataframe Untyped APIs df.withColumn("counts", $"counts" + 1).filter($”device" === “iPhone").explain(true) """ |== Parsed Logical Plan == |'Filter ('device = iPhone) |+- Project [userId#3, device#4, (counts#5 + 1) AS counts#10] | +- Relation[userId#3,device#4,counts#5] parquet | |== Physical Plan == |*Project [userId#3, device#4, (counts#5 + 1) AS counts#10] |+- *Filter (isnotnull(device#4) && (device#4 = iPhone)) | +- *FileScan parquet [userId#3,device#4,counts#5] | Batched: true, Format: Parquet, | PartitionFilters: [], | PushedFilters: [IsNotNull(device), EqualTo(device,iPhone)], | ReadSchema: struct<userId:int,device:string,counts:int> | """.stripMargin •
  • 13. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of Dataset case class ErrorEvent(userId: Long, device: String, counts: Long) val ds = df.as[ErrorEvent] ds.map(row => ErrorEvent(row.userId, row.device, row.counts + 1)) .filter(row => row.device == “iPhone").show() """ |+------+------+------+ ||userId|device|counts| |+------+------+------+ || 1|iPhone| 6| || 3|iPhone| 4| |+------+------+------+ “"".stripMargin // It’s really easy to put existing Java / Scala code here.
  • 14. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Execution Plan - Dataset Typed APIs ds.map { row => ErrorEvent(row.userId, row.device, row.counts + 1) }.filter { row => row.device == "iPhone" }.explain(true) """ |== Physical Plan == |*SerializeFromObject [ | assertnotnull(input[0, com.apple.ErrorEvent, true]).userId AS userId#27L, | assertnotnull(input[0, com.apple.ErrorEvent, true]).device, true) AS device#28, | assertnotnull(input[0, com.apple.ErrorEvent, true]).counts AS counts#29L] |+- *Filter <function1>.apply | +- *MapElements <function1>, obj#26: com.apple.ErrorEvent | +- *DeserializeToObject newInstance(class com.apple.ErrorEvent), obj#25: | com.apple.siri.ErrorEvent | +- *FileScan parquet [userId#3,device#4,counts#5] | Batched: true, Format: Parquet, | PartitionFilters: [], PushedFilters: [], | ReadSchema: struct<userId:int,device:string,counts:int> """.stripMargin •
  • 15. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Strongly Typed Pipeline • Typed Dataset is used to guarantee the schema consistency • Enables Java/Scala interoperability between systems • Increases Data Scientist productivity
  • 16. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Drawbacks of Strongly Typed Pipeline • Dataset are slower than Dataframe https://tinyurl.com/dataset-vs-dataframe • In Dataset, many POJO are created for each row resulting high GC pressure • Data consumers typically query on subsets of data, but schema pruning and predicate pushdown are not working well in nested fields
  • 17. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.3.1 case class FullName(first: String, middle: String, last: String) case class Contact(id: Int, name: FullName, address: String) sql("select name.first from contacts").where("name.first = 'Jane'").explain(true) """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>> | """.stripMargin
  • 18. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.4 with Schema Pruning """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string>> | """.stripMargin • [SPARK-4502], [SPARK-25363] Parquet nested column pruning
  • 19. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.4 with Schema Pruning + Predicate Pushdown """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name.first,Jane)], ReadSchema: struct<name:struct<first:string>> | """.stripMargin • [SPARK-4502], [SPARK-25363] Parquet nested column pruning • [SPARK-17636] Parquet nested Predicate Pushdown
  • 20. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Production Query - Finding a Needle in a Haystack
  • 21. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.3.1
  • 22. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-17636]
  • 23. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. • 21x faster in wall clock time
 • 8x less data being read
 • Saving electric bills in many data centers
  • 24. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Future Work • Use Spark Streaming can be used to aggregate the request data in the edge first to reduce the data movement • Enhance the Dataset performance by analyzing JVM bytecode and turn closures into Catalyst expressions • Building a table format on top of Parquet using Spark's Datasource V2 API to tracks individual data files with richer metadata to manage versioning and enhance query performance
  • 25. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Conclusions With some work, engineering rigor and some optimizations Spark can run at very large scale in lightning speed
  • 26. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Thank you