SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Validating
Big Data & ML Pipelines
With Apache Spark & Friends:
knowing when you crash
Melinda
Seckington
Some links (slides & etc will be at):
http://bit.ly/2E5qFsC
CatLoversShow
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
What is going to be covered:
● What validation is & why you should do it for your data pipelines
● How to make simple validation rules & our current limitations
● ML Validation - Guessing if our black box is “correct”
● Cute & scary pictures
○ I promise at least one cat
○ And at least one picture of my scooter club
Andrew
Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Possibly Familiar with one of Scala, if your new WELCOME!
● Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if
not)
● Want to make better software
○ (or models, or w/e)
● Or just want to make software good enough to not have to keep your resume
up to date
● Open to the idea that pipeline validation can be explained with a scooter club
that is definitely not a gang.
Test are not perfect: See Motorcycles/Scooters/...
● Are not property checking
● It’s just multiple choice
● You don’t even need one to ride a scoot!
So why should you validate?
● tl;dr - Your tests probably aren’t perfect
● You want to know when you're aboard the failboat
● Our code will most likely fail at some point
○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t all that hard
○ Repeating Holden’s mistakes is still not fun
So why should you test & validate:
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
So why should you test & validate - cont
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
What happens when we don’t
This talk is being recorded so we’ll leave it at:
● Go home after an accident rather than checking on bones
Or with computers:
● Breaking a feature that cost a few million dollars
● Every search result was a coffee shop
● Rabbit (“bunny”) versus rabbit (“queue”) versus rabbit (“health”)
● VA, BoA, etc.
itsbruce
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets focus on validation some more:
*Can be used during integration tests to further validate integration results
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
○ If you don’t you probably want to run it a few times and manually validate it
● Maybe you manually tested and sampled your results
● Hopefully you did a lot of other checks too
● But we can’t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.
How many people have something like this?
val data = ...
val parsed = data.flatMap(x =>
try {
Some(parse(x))
} catch {
case _ => None // Whatever, it's JSON
}
}
Lilithis
But we need some data...
val data = ...
data.cache()
val validData = data.filter(isValid)
val badData = data.filter(! isValid(_))
if validData.count() < badData.count() {
// Ruh Roh! Special business error handling goes here
}
...
Pager photo by Vitachao CC-SA 3
Well that’s less fun :(
● Our optimizer can’t just magically chain everything together anymore
● My flatMap.map.map is fnur :(
● Now I’m blocking on a thing in the driver
Sn.Ho
Counters* to the rescue**!
● Both BEAM & Spark have their it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
○ In UI can also register a listener from spark validator project
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
● We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problems….
Miguel Olaya
So what does that look like?
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker
Ok but what about those *s
● Beam counters are implementation dependent
● Spark counters aren’t great for data properties
● etc.
Miguel Olaya
+ You need to understand your domain, like bubbles
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● spark-validator is still in early stages but interesting proof of concept
○ I’m going to rewrite it over the holidays as a two-stage job (one to collect metrics in your main
application and a second to validate).
○ I was probably a bit sleep deprived when I wrote it because looking at it… idk
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
● Remember those property tests? Could be Validation rules
● Historical data
● Domain specific solutions
Photo by:
Paul Schadler
Input Schema Validation
● Handling the “wrong” type of cat
● Many many different approaches
○ filter/flatMap stages
○ Working in Scala/Java? .as[T]
○ Manually specify your schema after doing inference the first time :p
● Unless your working on mnist.csv there is a good chance your validation is
going to be fuzzy (reject some records accept others)
● How do we know if we’ve rejected too much?
Bradley Gordon
As a relative rule:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Do you do property testing like the cool kids?
● Yes in theory they’re already “tested” but...
● Common function to check accumulator value between validation & tests
● The real-world is can be fuzzier
Counters in BEAM: (1 of 2)
private final Counter matchedWords =
Metrics.counter(FilterTextFn.class, "matchedWords");
private final Counter unmatchedWords =
Metrics.counter(FilterTextFn.class, "unmatchedWords");
// Your special business logic goes here (aka shell out to Fortan
or Cobol)
Luke Jones
Counters in BEAM: (2 of 2)
Long matchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("matchedWords")).counters().next().committed();
Long unmatchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("unmatchedWords")).counters().next().committed();
assertThat("unmatchWords less than matched words",
unmatchedWordsValue,
lessThan(matchedWordsValue));
Luke Jones
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a different meaning
○ Remember that example about almost recommending illegal content?
● Join and see number of rows different on each side
● Expensive operation, but if your data changes slowly / at a constant ish rate
○ Sometimes done as a separate parallel job
● Can also be used on output if applicable
○ You do have a table/file/as applicable to roll back to right?
Validation rules can be a separate stage(s)
● Sometimes data validation in parallel in a separate process
● Combined with counters/metrics from your job
● Can then be compared with a seperate job that looks at the results and
decides if the pipeline should continue
What tools exist for scooters validation?
The scala rider packtalk(R)* could be
used to keep in touch with everyone
and if they fall out of the mesh we
can go find them.
*No endorsement implied in any
manner
TFDV: Magic*
● Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)
Details:
https://medium.com/tensorflow/introducing-tensorflow-data-
validation-data-understanding-validation-and-monitoring-at-
scale-d38e3952c2f0
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like NA fields
○ Don’t expect the results to change - side-by-side run + diff
● Have an ML model?
○ Welcome to new params - or old params with different default values.
○ We’ll talk more about that later
● Excellent PyData London talk about how this can impact
ML models
○ Done with sklearn shows vast differences in CVE results only changing
the model number
Francesco
Extra considerations for ML jobs:
● Harder to look at output size and say if its good
● We can look at the cross-validation performance
● Fixed test set performance
● Number of iterations / convergence rate
● Number of features selected / number of features
changed in selection
● (If applicable) delta in model weights or tree size or ...
Jennifer C.
Cross-validation
because saving a test set is effort
● Trains on X% of the data and tests on Y%
○ Multiple times switching the samples
● org.apache.spark.ml.tuning has the tools for auto fitting
using CB
○ If your going to use this for auto-tuning please please save a test set
○ Otherwise your models will look awesome and perform like a ford
pinto (or whatever a crappy car is here. Maybe a renault reliant?)
● Not perfect: sometimes the real world will still surprise
○ Do canary & A/B testing when possible as well
Jonathan Kotta
False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with previous
orders
● Non-displayed options: unlikely to be chosen
● Sometimes can find previous formulaic corrections
● Sometimes we can “experimentally” determine
● Other times we just hope it’s better than nothing
● Try and make sure your ML isn’t evil or re-encoding
human biases but stronger
Some ending notes
● Your validation rules don’t have to be perfect
○ But they should be good enough they alert infrequently
● You should have a way for the human operator to
override.
● Just like tests, try and make your validation rules
specific and actionable
○ # of input rows changed is not a great message - table XYZ grew
unexpectedly to Y%
● While you can use (some of) your tests as a basis for
your rules, your rules need tests too
○ e.g. add junk records/pure noise and see if it rejects
James Petts
Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base
● Testing strategy for Apache Spark jobs
● The BEAM programming guide
Interested in OSS (especially Spark)?
● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today, not a lot on testing and almost nothing on
validation, but that should not stop you from buying several
copies (if you have an expense account).
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
And some upcoming talks:
● January
○ Data Day Texas
● February
○ Fosdem
● March
○ Strata San Francisco
● April
○ Strata London
● May
○ Code Mania
Sparkling Pink Panda Scooter group photo by Kenzi
k thnx bye! (or questions…)
If you want to fill out a survey:
http://bit.ly/holdenTestingSpark
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I’m sadly heading out to
Berlin right after this but
e-mail me:
holden@pigscanfly.ca
Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/sscheck
● spark-validator: https://github.com/holdenk/spark-validator *Proof of
concept, do not actually use*
● spark-perf - https://github.com/databricks/spark-perf
● spark-integration-tests - https://github.com/databricks/spark-integration-tests
● scalacheck - https://www.scalacheck.org/
Becky Lai
Traditional theory (Models)
● Human decides it's time to “update their models”
● Human goes through a model update run-book
● Human does other work while their “big-data” job runs
● Human deploys X% new models
● Looks at graphs
● Presses deploy
Andrew
Traditional practice (Models)
● Human is cornered by stakeholders and forced to update models
● Spends a few hours trying to remember where the guide is
● Gives up and kind of wings it
● Comes back to a trained model
● Human deploys X% models
● Human reads reddit/hacker news/etc.
● Presses deploy
Bruno Caimi
New possible practice (sometimes)
● Computer kicks off job (probably at an hour boundary because *shrug*) to
update model
● Workflow tool notices new model is available
● Computer deploys X% models
● Software looks at monitoring graphs, uses statistical test to see if it’s bad
● Robot rolls it back & pager goes off
● Human Presses overrides and deploys anyways
Henrique Pinto

Más contenido relacionado

La actualidad más candente

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA PresentationRob Tweed
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011rob_dimarco
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Brian Brazil
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowMateuszSzczyrzyca
 

La actualidad más candente (20)

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
All of Javascript
All of JavascriptAll of Javascript
All of Javascript
 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA Presentation
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitaker
 
Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
 

Similar a Validating big data pipelines - Scala eXchange 2018

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Chris Gates
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...Databricks
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
 
Creative Data Analysis with Python
Creative Data Analysis with PythonCreative Data Analysis with Python
Creative Data Analysis with PythonGrant Paton-Simpson
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 

Similar a Validating big data pipelines - Scala eXchange 2018 (20)

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
 
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
 
Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Creative Data Analysis with Python
Creative Data Analysis with PythonCreative Data Analysis with Python
Creative Data Analysis with Python
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 

Último

Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft DatingDubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Datingkojalkojal131
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...roncy bisnoi
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...tanu pandey
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 

Último (20)

Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft DatingDubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
Dubai Call Girls Milky O525547819 Call Girls Dubai Soft Dating
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 

Validating big data pipelines - Scala eXchange 2018

  • 1. Validating Big Data & ML Pipelines With Apache Spark & Friends: knowing when you crash Melinda Seckington
  • 2. Some links (slides & etc will be at): http://bit.ly/2E5qFsC CatLoversShow
  • 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  • 4.
  • 5. What is going to be covered: ● What validation is & why you should do it for your data pipelines ● How to make simple validation rules & our current limitations ● ML Validation - Guessing if our black box is “correct” ● Cute & scary pictures ○ I promise at least one cat ○ And at least one picture of my scooter club Andrew
  • 6. Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with one of Scala, if your new WELCOME! ● Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if not) ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date ● Open to the idea that pipeline validation can be explained with a scooter club that is definitely not a gang.
  • 7. Test are not perfect: See Motorcycles/Scooters/... ● Are not property checking ● It’s just multiple choice ● You don’t even need one to ride a scoot!
  • 8. So why should you validate? ● tl;dr - Your tests probably aren’t perfect ● You want to know when you're aboard the failboat ● Our code will most likely fail at some point ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun
  • 9. So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 10. So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 11. What happens when we don’t This talk is being recorded so we’ll leave it at: ● Go home after an accident rather than checking on bones Or with computers: ● Breaking a feature that cost a few million dollars ● Every search result was a coffee shop ● Rabbit (“bunny”) versus rabbit (“queue”) versus rabbit (“health”) ● VA, BoA, etc. itsbruce
  • 12. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 13. Lets focus on validation some more: *Can be used during integration tests to further validate integration results
  • 14.
  • 15. So how do we validate our jobs? ● The idea is, at some point, you made software which worked. ○ If you don’t you probably want to run it a few times and manually validate it ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.
  • 16. How many people have something like this? val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } } Lilithis
  • 17. But we need some data... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3
  • 18. Well that’s less fun :( ● Our optimizer can’t just magically chain everything together anymore ● My flatMap.map.map is fnur :( ● Now I’m blocking on a thing in the driver Sn.Ho
  • 19. Counters* to the rescue**! ● Both BEAM & Spark have their it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems…. Miguel Olaya
  • 20. So what does that look like? val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  • 21. Ok but what about those *s ● Beam counters are implementation dependent ● Spark counters aren’t great for data properties ● etc. Miguel Olaya
  • 22. + You need to understand your domain, like bubbles
  • 23. General Rules for making Validation rules ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ○ I’m going to rewrite it over the holidays as a two-stage job (one to collect metrics in your main application and a second to validate). ○ I was probably a bit sleep deprived when I wrote it because looking at it… idk ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Remember those property tests? Could be Validation rules ● Historical data ● Domain specific solutions Photo by: Paul Schadler
  • 24. Input Schema Validation ● Handling the “wrong” type of cat ● Many many different approaches ○ filter/flatMap stages ○ Working in Scala/Java? .as[T] ○ Manually specify your schema after doing inference the first time :p ● Unless your working on mnist.csv there is a good chance your validation is going to be fuzzy (reject some records accept others) ● How do we know if we’ve rejected too much? Bradley Gordon
  • 25. As a relative rule: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 26. Do you do property testing like the cool kids? ● Yes in theory they’re already “tested” but... ● Common function to check accumulator value between validation & tests ● The real-world is can be fuzzier
  • 27. Counters in BEAM: (1 of 2) private final Counter matchedWords = Metrics.counter(FilterTextFn.class, "matchedWords"); private final Counter unmatchedWords = Metrics.counter(FilterTextFn.class, "unmatchedWords"); // Your special business logic goes here (aka shell out to Fortan or Cobol) Luke Jones
  • 28. Counters in BEAM: (2 of 2) Long matchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("matchedWords")).counters().next().committed(); Long unmatchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("unmatchedWords")).counters().next().committed(); assertThat("unmatchWords less than matched words", unmatchedWordsValue, lessThan(matchedWordsValue)); Luke Jones
  • 29. % of data change ● Not just invalid records, if a field’s value changes everywhere it could still be “valid” but have a different meaning ○ Remember that example about almost recommending illegal content? ● Join and see number of rows different on each side ● Expensive operation, but if your data changes slowly / at a constant ish rate ○ Sometimes done as a separate parallel job ● Can also be used on output if applicable ○ You do have a table/file/as applicable to roll back to right?
  • 30. Validation rules can be a separate stage(s) ● Sometimes data validation in parallel in a separate process ● Combined with counters/metrics from your job ● Can then be compared with a seperate job that looks at the results and decides if the pipeline should continue
  • 31. What tools exist for scooters validation? The scala rider packtalk(R)* could be used to keep in touch with everyone and if they fall out of the mesh we can go find them. *No endorsement implied in any manner
  • 32. TFDV: Magic* ● Counters, schema inference, anomaly detection, oh my! # Compute statistics over a new set of data new_stats = tfdv.generate_statistics_from_csv(NEW_DATA) # Compare how new data conforms to the schema anomalies = tfdv.validate_statistics(new_stats, schema) # Display anomalies inline tfdv.display_anomalies(anomalies) Details: https://medium.com/tensorflow/introducing-tensorflow-data- validation-data-understanding-validation-and-monitoring-at- scale-d38e3952c2f0
  • 33. Not just data changes: Software too ● Things change! Yay! Often for the better. ○ Especially with handling edge cases like NA fields ○ Don’t expect the results to change - side-by-side run + diff ● Have an ML model? ○ Welcome to new params - or old params with different default values. ○ We’ll talk more about that later ● Excellent PyData London talk about how this can impact ML models ○ Done with sklearn shows vast differences in CVE results only changing the model number Francesco
  • 34. Extra considerations for ML jobs: ● Harder to look at output size and say if its good ● We can look at the cross-validation performance ● Fixed test set performance ● Number of iterations / convergence rate ● Number of features selected / number of features changed in selection ● (If applicable) delta in model weights or tree size or ... Jennifer C.
  • 35. Cross-validation because saving a test set is effort ● Trains on X% of the data and tests on Y% ○ Multiple times switching the samples ● org.apache.spark.ml.tuning has the tools for auto fitting using CB ○ If your going to use this for auto-tuning please please save a test set ○ Otherwise your models will look awesome and perform like a ford pinto (or whatever a crappy car is here. Maybe a renault reliant?) ● Not perfect: sometimes the real world will still surprise ○ Do canary & A/B testing when possible as well Jonathan Kotta
  • 36. False sense of security: ● A/B test please even if CV says many many $s ● Rank based things can have training bias with previous orders ● Non-displayed options: unlikely to be chosen ● Sometimes can find previous formulaic corrections ● Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  • 37. Some ending notes ● Your validation rules don’t have to be perfect ○ But they should be good enough they alert infrequently ● You should have a way for the human operator to override. ● Just like tests, try and make your validation rules specific and actionable ○ # of input rows changed is not a great message - table XYZ grew unexpectedly to Y% ● While you can use (some of) your tests as a basis for your rules, your rules need tests too ○ e.g. add junk records/pure noise and see if it rejects James Petts
  • 38. Related talks & blog posts ● Testing Spark Best Practices (Spark Summit 2014) ● Every Day I’m Shuffling (Strata 2015) & slides ● Spark and Spark Streaming Unit Testing ● Making Spark Unit Testing With Spark Testing Base ● Testing strategy for Apache Spark jobs ● The BEAM programming guide Interested in OSS (especially Spark)? ● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau & https://www.youtube.com/user/holdenkarau Becky Lai
  • 39. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 40. High Performance Spark! Available today, not a lot on testing and almost nothing on validation, but that should not stop you from buying several copies (if you have an expense account). Cat’s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  • 41. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 42. And some upcoming talks: ● January ○ Data Day Texas ● February ○ Fosdem ● March ○ Strata San Francisco ● April ○ Strata London ● May ○ Code Mania
  • 43. Sparkling Pink Panda Scooter group photo by Kenzi k thnx bye! (or questions…) If you want to fill out a survey: http://bit.ly/holdenTestingSpark Give feedback on this presentation http://bit.ly/holdenTalkFeedback I’m sadly heading out to Berlin right after this but e-mail me: holden@pigscanfly.ca
  • 44. Related packages ● spark-testing-base: https://github.com/holdenk/spark-testing-base ● sscheck: https://github.com/juanrh/sscheck ● spark-validator: https://github.com/holdenk/spark-validator *Proof of concept, do not actually use* ● spark-perf - https://github.com/databricks/spark-perf ● spark-integration-tests - https://github.com/databricks/spark-integration-tests ● scalacheck - https://www.scalacheck.org/ Becky Lai
  • 45. Traditional theory (Models) ● Human decides it's time to “update their models” ● Human goes through a model update run-book ● Human does other work while their “big-data” job runs ● Human deploys X% new models ● Looks at graphs ● Presses deploy Andrew
  • 46. Traditional practice (Models) ● Human is cornered by stakeholders and forced to update models ● Spends a few hours trying to remember where the guide is ● Gives up and kind of wings it ● Comes back to a trained model ● Human deploys X% models ● Human reads reddit/hacker news/etc. ● Presses deploy Bruno Caimi
  • 47. New possible practice (sometimes) ● Computer kicks off job (probably at an hour boundary because *shrug*) to update model ● Workflow tool notices new model is available ● Computer deploys X% models ● Software looks at monitoring graphs, uses statistical test to see if it’s bad ● Robot rolls it back & pager goes off ● Human Presses overrides and deploys anyways Henrique Pinto