SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Albert Franzi (@FranziCros), Alpha Health
Modular Apache Spark:
Transform Your Code into
Pieces
2
Slides available in:
http://bit.ly/SparkAI2019-afranzi
About me
3
2013 2015 20182014 2017 2019
Modular Apache Spark:
Transform Your Code into Pieces
4
5
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
6
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
7
Did you play with duplicated code across your Spark Jobs?
@ The Shining 1980
Have you ever experienced the joy of code
reviewing a never ending stream of spark chaos?
8
Please! Don’t play with duplicated code never ever!
@ The Shining 1980
Fragment the Spark Job
9
Spark Readers Transformation
JoinsAliases Formatters
Spark WritersTask Context
...
Readers / Writers
10
Spark Readers
Spark Writers
● Enforce schemas
● Use schemas to read only the fields you are going to use
● Provide Readers per Dataset & attach its sources to it
● Share schemas & sources between Readers & Writers
● GDPR compliant by design
val userBehaviourSchema: StructType = ???
val userBehaviourPath =
Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/")
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPath)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
GDPR
Readers
11
Spark Readers
Readers
12
Spark Readers
val userBehaviourSchema: StructType = ???
// Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourPaths: Seq[Path] = PathBuilder
.latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withPath(userBehaviourPaths: _*)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Readers
13
val userBehaviourSchema: StructType = ???
val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/")
val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC)
val halfDay: Duration = Duration.ofHours(12)
val userBehaviourReader = ReaderBuilder(PARQUET)
.withSchema(userBehaviourSchema)
.withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay)
.buildReader()
val df: DataFrame = userBehaviourReader.read()
Spark Readers
Readers
14
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
Spark Readers
Transforms
15
Transformation
def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]
Dataset[T] ⇒ magic ⇒ Dataset[U]
Transforms
16
Transformation
def withGreeting(df: DataFrame): DataFrame = {
df.withColumn("greeting", lit("hello world"))
}
def extractFromJson(colName: String,
outputColName: String,
jsonSchema: StructType)(df: DataFrame): DataFrame = {
df.withColumn(outputColName, from_json(col(colName), jsonSchema))
}
Transforms
17
Transformation
def onlyClassifiedAds(df: DataFrame): DataFrame = {
df.filter(col("event_type") === "View")
.filter(col("object_type") === "ClassifiedAd")
}
def dropDuplicates(df: DataFrame): DataFrame = {
df.dropDuplicates()
}
def cleanedCity(df: DataFrame): DataFrame = {
df.withColumn("city", getCityUdf(col("object.location.address")))
}
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
Transforms
18
Transformation
val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq(
dropDuplicates,
cleanedCity,
onlyClassifiedAds
)
val df: DataFrame = UserBehaviourReader.read(startDate, halfDay)
val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
“As a data consumer, I only need to pick up which transformations I would like to apply,
instead of coding them from scratch.”
“It’s like cooking, engineers provide manufactured ingredients (transformations) and
Data Scientists use the required ones for a successful receipt.”
Transforms - Links of Interest
“DataFrame transformations can be defined with arguments so they don’t make
assumptions about the schema of the underlying DataFrame.” - by Matthew Powers.
bit.ly/Spark-ChainingTransformations
bit.ly/Spark-SchemaIndependentTransformations
19
Transformation
github.com/MrPowers/spark-daria
“Spark helper methods to maximize developer productivity.”
20
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
21
Did you put untested Spark jobs into production?
“Mars Climate Orbiter destroyed
because of a Metric System Mixup (1999)”
Testing with Holden Karau
22
github.com/holdenk/spark-testing-base
“Base classes to use when writing tests with Spark.”
● Share Spark Context between tests
● Provide methods to make tests easier
○ Fixture Readers
○ Json to DF converters
○ Extra validators
github.com/MrPowers/spark-fast-tests
“An alternative to spark-testing-base to run tests in parallel without restarting Spark Session
after each test file.”
Testing SharedSparkContext
23
package com.holdenkarau.spark.testing
import java.util.Date
import org.apache.spark._
import org.scalatest.{BeforeAndAfterAll, Suite}
/**
* Shares a local `SparkContext` between all tests in a suite
* and closes it at the end. You can share between suites by enabling
* reuseContextIfPossible.
*/
trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider {
self: Suite =>
...
protected implicit def reuseContextIfPossible: Boolean = false
...
}
Testing
24
package com.alpha.data.test
trait SparkSuite extends DataFrameSuiteBase {
self: Suite =>
override def reuseContextIfPossible: Boolean = true
protected def createDF(data: Seq[Row], schema: StructType): DataFrame = {
spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
}
protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = {
val fixtureContent = readFixtureContent(fileName)
val fixtureJson = fixtureContentToJson(fixtureContent)
jsonToDF(fixtureJson, schema)
}
protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = {
assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep)
}
...
}
Testing
25
Spark Readers Transformation Spark WritersTask Context
Testing each piece independently helps testing all together.
Test
26
- How we reduced the test time execution by skipping unaffected tests -- >
less coffees
- How we simplified our spark code by modularizing
- How we increased our test coverage in our spark code by using the spark-
testing-base provided by Holden Karau
We will learn:
27
Are tests taking too long to execute?
Junit4Git by Raquel Pau
28
github.com/rpau/junit4git
“Junit Extensions for Test Impact Analysis.”
“This is a JUnit extension that ignores those tests that are not related with your last
changes in your Git repository.”
Junit4Git - Gradle conf
29
configurations {
agent
}
dependencies {
testCompile("org.walkmod:scalatest4git_2.11:${version}")
agent "org.walkmod:junit4git-agent:${version}"
}
test.doFirst {
jvmArgs "-javaagent:${configurations.agent.singleFile}"
}
@RunWith(classOf[ScalaGitRunner])
Junit4Git- Travis with Git notes
30
before_install:
- echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc
after_script:
- git push origin refs/notes/tests:refs/notes/tests
.travis.yml
Junit4Git- Runners
31
import org.scalatest.junit.JUnitRunner
@RunWith(classOf[JUnitRunner])
abstract class UnitSpec extends FunSuite
import org.scalatest.junit.ScalaGitRunner
@RunWith(classOf[ScalaGitRunner])
abstract class IntegrationSpec extends FunSuite
Tests to run always
Tests to run on code changes
Junit4Git
32
Runner Listener Agent Test Method
testStarted
Starts the report
loadClass
testFinished
addTestClass
Closes the
report
Class
instrumentClass
Junit4Git- Notes
33
afranzi:~/Projects/data-core$ git notes show
[
{
"test": "com.alpha.data.core.ContextSuite",
"method": "Context Classes with CommonBucketConf should contain a commonBucket Field",
"classes": [
"com.alpha.data.core.Context",
"com.alpha.data.core.ContextSuite$$anonfun$1$$anon$1",
"com.alpha.data.core.Context$",
"com.alpha.data.core.CommonBucketConf$class",
"com.alpha.data.core.Context$$anonfun$getConfigParamWithLogging$1"
]
}
...
]
com.alpha.data.core.ContextSuite > Context Classes with CommonBucketConf should contain a commonBucket Field SKIPPED
com.alpha.data.core.ContextSuite > Context Classes with BucketConf should contain a bucket Field SKIPPED
Junit4Git - Links of Interest
34
bit.ly/SlidesJunit4Git
Real Impact Testing Analysis For JVM
bit.ly/TestImpactAnalysis
The Rise of Test Impact Analysis
Summary
35
Use Transforms as pills
Modularize your code
Don’t duplicate code between Spark Jobs
Don’t be afraid of testing everything
Share all you built as a library, so others can
reuse your code.
36

Más contenido relacionado

Más de Databricks

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 

Más de Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Último

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Último (20)

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Modular Apache Spark: Transform Your Code in Pieces

  • 1. Albert Franzi (@FranziCros), Alpha Health Modular Apache Spark: Transform Your Code into Pieces
  • 3. About me 3 2013 2015 20182014 2017 2019
  • 4. Modular Apache Spark: Transform Your Code into Pieces 4
  • 5. 5 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 6. 6 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 7. 7 Did you play with duplicated code across your Spark Jobs? @ The Shining 1980 Have you ever experienced the joy of code reviewing a never ending stream of spark chaos?
  • 8. 8 Please! Don’t play with duplicated code never ever! @ The Shining 1980
  • 9. Fragment the Spark Job 9 Spark Readers Transformation JoinsAliases Formatters Spark WritersTask Context ...
  • 10. Readers / Writers 10 Spark Readers Spark Writers ● Enforce schemas ● Use schemas to read only the fields you are going to use ● Provide Readers per Dataset & attach its sources to it ● Share schemas & sources between Readers & Writers ● GDPR compliant by design
  • 11. val userBehaviourSchema: StructType = ??? val userBehaviourPath = Path("s3://<bucket>/user_behaviour/year=2018/month=10/day=03/hour=12/gen=27/") val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPath) .buildReader() val df: DataFrame = userBehaviourReader.read() GDPR Readers 11 Spark Readers
  • 12. Readers 12 Spark Readers val userBehaviourSchema: StructType = ??? // Path structure - s3://<bucket>/user_behaviour/[year]/[month]/[day]/[hour]/[gen]/ val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourPaths: Seq[Path] = PathBuilder .latestGenHourlyPaths(userBehaviourBasePath, startDate, halfDay) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withPath(userBehaviourPaths: _*) .buildReader() val df: DataFrame = userBehaviourReader.read()
  • 13. Readers 13 val userBehaviourSchema: StructType = ??? val userBehaviourBasePath = Path("s3://<bucket>/user_behaviour/") val startDate: ZonedDateTime = ZonedDateTime.now(ZoneOffset.UTC) val halfDay: Duration = Duration.ofHours(12) val userBehaviourReader = ReaderBuilder(PARQUET) .withSchema(userBehaviourSchema) .withHourlyPathBuilder(userBehaviourBasePath, startDate, halfDay) .buildReader() val df: DataFrame = userBehaviourReader.read() Spark Readers
  • 14. Readers 14 val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) Spark Readers
  • 15. Transforms 15 Transformation def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U] Dataset[T] ⇒ magic ⇒ Dataset[U]
  • 16. Transforms 16 Transformation def withGreeting(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) } def extractFromJson(colName: String, outputColName: String, jsonSchema: StructType)(df: DataFrame): DataFrame = { df.withColumn(outputColName, from_json(col(colName), jsonSchema)) }
  • 17. Transforms 17 Transformation def onlyClassifiedAds(df: DataFrame): DataFrame = { df.filter(col("event_type") === "View") .filter(col("object_type") === "ClassifiedAd") } def dropDuplicates(df: DataFrame): DataFrame = { df.dropDuplicates() } def cleanedCity(df: DataFrame): DataFrame = { df.withColumn("city", getCityUdf(col("object.location.address"))) } val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*)
  • 18. Transforms 18 Transformation val cleanupTransformations: Seq[DataFrame => DataFrame] = Seq( dropDuplicates, cleanedCity, onlyClassifiedAds ) val df: DataFrame = UserBehaviourReader.read(startDate, halfDay) val classifiedAdsDF = df.transforms(cleanupTransformations: _*) “As a data consumer, I only need to pick up which transformations I would like to apply, instead of coding them from scratch.” “It’s like cooking, engineers provide manufactured ingredients (transformations) and Data Scientists use the required ones for a successful receipt.”
  • 19. Transforms - Links of Interest “DataFrame transformations can be defined with arguments so they don’t make assumptions about the schema of the underlying DataFrame.” - by Matthew Powers. bit.ly/Spark-ChainingTransformations bit.ly/Spark-SchemaIndependentTransformations 19 Transformation github.com/MrPowers/spark-daria “Spark helper methods to maximize developer productivity.”
  • 20. 20 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 21. 21 Did you put untested Spark jobs into production? “Mars Climate Orbiter destroyed because of a Metric System Mixup (1999)”
  • 22. Testing with Holden Karau 22 github.com/holdenk/spark-testing-base “Base classes to use when writing tests with Spark.” ● Share Spark Context between tests ● Provide methods to make tests easier ○ Fixture Readers ○ Json to DF converters ○ Extra validators github.com/MrPowers/spark-fast-tests “An alternative to spark-testing-base to run tests in parallel without restarting Spark Session after each test file.”
  • 23. Testing SharedSparkContext 23 package com.holdenkarau.spark.testing import java.util.Date import org.apache.spark._ import org.scalatest.{BeforeAndAfterAll, Suite} /** * Shares a local `SparkContext` between all tests in a suite * and closes it at the end. You can share between suites by enabling * reuseContextIfPossible. */ trait SharedSparkContext extends BeforeAndAfterAll with SparkContextProvider { self: Suite => ... protected implicit def reuseContextIfPossible: Boolean = false ... }
  • 24. Testing 24 package com.alpha.data.test trait SparkSuite extends DataFrameSuiteBase { self: Suite => override def reuseContextIfPossible: Boolean = true protected def createDF(data: Seq[Row], schema: StructType): DataFrame = { spark.createDataFrame(spark.sparkContext.parallelize(data), schema) } protected def jsonFixtureToDF(fileName: String, schema: Option[StructType] = None): DataFrame = { val fixtureContent = readFixtureContent(fileName) val fixtureJson = fixtureContentToJson(fixtureContent) jsonToDF(fixtureJson, schema) } protected def checkSchemas(inputSchema: StructType, expectedSchema: StructType): Unit = { assert(inputSchema.fields.sortBy(_.name).deep == expectedSchema.fields.sortBy(_.name).deep) } ... }
  • 25. Testing 25 Spark Readers Transformation Spark WritersTask Context Testing each piece independently helps testing all together. Test
  • 26. 26 - How we reduced the test time execution by skipping unaffected tests -- > less coffees - How we simplified our spark code by modularizing - How we increased our test coverage in our spark code by using the spark- testing-base provided by Holden Karau We will learn:
  • 27. 27 Are tests taking too long to execute?
  • 28. Junit4Git by Raquel Pau 28 github.com/rpau/junit4git “Junit Extensions for Test Impact Analysis.” “This is a JUnit extension that ignores those tests that are not related with your last changes in your Git repository.”
  • 29. Junit4Git - Gradle conf 29 configurations { agent } dependencies { testCompile("org.walkmod:scalatest4git_2.11:${version}") agent "org.walkmod:junit4git-agent:${version}" } test.doFirst { jvmArgs "-javaagent:${configurations.agent.singleFile}" } @RunWith(classOf[ScalaGitRunner])
  • 30. Junit4Git- Travis with Git notes 30 before_install: - echo -e "machine github.comn login $GITHUB_TOKEN" >> ~/.netrc after_script: - git push origin refs/notes/tests:refs/notes/tests .travis.yml
  • 31. Junit4Git- Runners 31 import org.scalatest.junit.JUnitRunner @RunWith(classOf[JUnitRunner]) abstract class UnitSpec extends FunSuite import org.scalatest.junit.ScalaGitRunner @RunWith(classOf[ScalaGitRunner]) abstract class IntegrationSpec extends FunSuite Tests to run always Tests to run on code changes
  • 32. Junit4Git 32 Runner Listener Agent Test Method testStarted Starts the report loadClass testFinished addTestClass Closes the report Class instrumentClass
  • 33. Junit4Git- Notes 33 afranzi:~/Projects/data-core$ git notes show [ { "test": "com.alpha.data.core.ContextSuite", "method": "Context Classes with CommonBucketConf should contain a commonBucket Field", "classes": [ "com.alpha.data.core.Context", "com.alpha.data.core.ContextSuite$$anonfun$1$$anon$1", "com.alpha.data.core.Context$", "com.alpha.data.core.CommonBucketConf$class", "com.alpha.data.core.Context$$anonfun$getConfigParamWithLogging$1" ] } ... ] com.alpha.data.core.ContextSuite > Context Classes with CommonBucketConf should contain a commonBucket Field SKIPPED com.alpha.data.core.ContextSuite > Context Classes with BucketConf should contain a bucket Field SKIPPED
  • 34. Junit4Git - Links of Interest 34 bit.ly/SlidesJunit4Git Real Impact Testing Analysis For JVM bit.ly/TestImpactAnalysis The Rise of Test Impact Analysis
  • 35. Summary 35 Use Transforms as pills Modularize your code Don’t duplicate code between Spark Jobs Don’t be afraid of testing everything Share all you built as a library, so others can reuse your code.
  • 36. 36