SlideShare una empresa de Scribd logo
1 de 72
Descargar para leer sin conexión
@holdenkarau
Effectively Contributing to
Apache Spark
Spark BCN 2019
I am on the PMC but this represents my own personal views
@holdenkarau
Where you can find the slides for this talk
http://bit.ly/2Ogr2BP
@holdenkarau
Who am I?
Holden
● Prefered pronouns: she/her
● Co-author of the Learning Spark & High Performance Spark books
● OSS Big Data Developer advocate @ Google
● Spark PMC & Committer
● Twitter @holdenkarau
● Live stream code & reviews: http://bit.ly/holdenLiveOSS
● http://www.slideshare.net/hkarau
@holdenkarau
@holdenkarau
What we are going to explore together!
Getting a change into Apache Spark & the components
involved:
● The current state of the Apache Spark dev community
● Reason to contribute to Apache Spark
● Different ways to contribute
● Places to find things to contribute
● Tooling around code & doc contributions
Torsten Reuschling
@holdenkarau
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● May know some Apache Spark?
● Want to contribute to Apache Spark
@holdenkarau
Why I’m assuming you might want to contribute:
● Fix your own bugs/problems with Apache Spark
● Learn more about distributed systems (for fun or profit)
● Improve your Scala/Python/R/Java experience
● You <3 functional programming and want to trick more
people into using it
● “Credibility” of some vague type
● You just like hacking on random stuff and Spark seems
shiny
@holdenkarau
What’s the state of the Spark dev community?
● Really large number of contributors
● Active PMC & Committer’s somewhat concentrated
○ Better than we used to be
● Also a lot of SF Bay Area - but certainly not exclusively
so
gigijin
@holdenkarau
How can we contribute to Spark?
● Direct code in the Apache Spark code base
● Code in packages built on top of Spark
● Code reviews
● Yak shaving (aka fixing things that Spark uses)
● Documentation improvements & examples
● Books, Talks, and Blogs
● Answering questions (mailing lists, stack overflow, etc.)
● Testing & Release Validation
Andrey
@holdenkarau
Which is right for you?
● Direct code in the Apache Spark code base
○ High visibility, some things can only really be done here
○ Can take a lot longer to get changes in
● Code in packages built on top of Spark
○ Really great for things like formats or standalone features
● Yak shaving (aka fixing things that Spark uses)
○ Super important to do sometimes - can take even longer to get in
romana klee
@holdenkarau
Which is right for you? (continued)
● Code reviews
○ High visibility to PMC, can be faster to get started, easier to time
box
○ Less tracked in metrics
● Documentation improvements & examples
○ Lots of places to contribute - mixed visibility - large impact
● Advocacy: Books, Talks, and Blogs
○ Can be high visibility
romana klee
@holdenkarau
Contributing Code Directly to Spark
● Maybe we encountered a bug we want to fix
● Maybe we’ve got a feature we want to add
● Either way we should see if other people are doing it
● And if what we want to do is complex, it might be better
to find something simple to start with
● It’s dangerous to go alone - take this
https://cwiki.apache.org/confluence/display/SPARK/Contrib
uting+to+Spark
Jon Nelson
@holdenkarau
The different pieces of Spark
Apache Spark “Core”
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Spark on
Yarn
Spark on
Mesos
Standalone
Spark
@holdenkarau
The different pieces of Spark: 2.0+
Apache Spark “Core”
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
@holdenkarau
The different pieces of Spark: 3+?
Apache Spark “Core”
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
Spark on
Yarn
Spark on
Mesos
Spark on
Kubernetes
Standalone
Spark
@holdenkarau
Choosing a component?
● Core
○ Conservative to external changes, but biggest impact
● ML / MLlib
○ ML is the home of the future - you can improve existing algorithms -
new algorithms face uphill battle
● Structured Streaming
○ Current API is in a lot of flux so it is difficult for external
participation
● SQL
○ Lots of fun stuff - very active - I have limited personal experience
● Python / R
○ Improve coverage of current APIs, structural change hard
● GraphX - Dead see GraphFrames instead
Rikki's Refuge
@holdenkarau
Choosing a component? (cont)
● Kubernetes
○ New, lots of active work and reviewers
● YARN
○ Old faithful, always needs a little work. Hadoop 3 support
● Mesos
○ Needs some love, probably easy-ish-path to committer (still hard)
● Standalone
○ Not a lot going on
Rikki's Refuge
@holdenkarau
Onto JIRA - Issue tracking funtimes
● It’s like bugzilla or fog bugz
● There is an Apache JIRA for many Apache projects
● You can (and should) sign up for an account
● All changes in Spark (now) require a JIRA
● https://www.youtube.com/watch?v=ca8n9uW3afg
● Check it out at:
○ https://issues.apache.org/jira/browse/SPARK
@holdenkarau
What we can do with ASF JIRA?
● Search for issues (remember to filter to Spark project)
● Create new issues
○ search first to see if someone else has reported it
● Comment on issues to let people know we are working on it
● Ask people for clarification or help
○ e.g. “Reading this I think you want the null values to be replaced by
a string when processing - is that correct?”
○ @mentions work here too
@holdenkarau
What can’t we do with ASF JIRA?
● Assign issues (to ourselves or other people)
○ In lieu of assigning we can “watch” & comment
● Post long design documents (create a Google Doc & link to
it from the JIRA)
● Tag issues
○ While we can add tags, they often get removed
@holdenkarau
@holdenkarau
Finding a good “starter” issue:
● There are explicit starter tags in JIRA we can search for
● But often the starter tag isn’t applied
● Read through and look for simple issues
● Pick something in the same component you eventually want
to work in
○ And or consider improving the non-Scala language API for the
component(s) you want to work on.
● Look at the reporter and commenters - is there a
committer or someone whose name you recognize?
● Leave a comment that says you are going to start working
on this
@holdenkarau
Find an issue you want to work on
https://issues.apache.org/jira/browse/SPARK
Also grep for TODO in components you are interested in (e.g.
grep -r TODO ./python/pyspark or grep -R TODO ./core/src)
Look between language APIs and see if anything is missing
that you think is interesting -
http://spark.apache.org/docs/latest/api/scala/index.html#org
.apache.spark.package
http://spark.apache.org/docs/latest/api/python/index.html
neko kabachi
@holdenkarau
Explore things that make sense to revisit
https://issues.apache.org/jira/browse/SPARK
Consider looking for issues which we couldn't fix due to our
compatibility requirements and should revisit for 3+
Maurizio Zanetti
@holdenkarau
Finding SPIPs:
https://issues.apache.org/jira/browse/SPARK-24374?jql=projec
t%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Pro
gress%22%2C%20Reopened)%20AND%20text%20~%20%22SPIP%22
Large pieces of work
Not the easiest to contribute to, but can see design
Warrick Wynne
@holdenkarau
@holdenkarau
But before we get too far:
● Spark wishes to maintain compatibility between releases
● We're working on 3 though so this is the time to break
things
Meagan Fisher
@holdenkarau
Getting at the code: yay for GitHub :)
● https://github.com/apache/spark
● Make a fork of it
● Clone it locally
dougwoods
@holdenkarau
@holdenkarau
Building Spark
./build/sbt
or
./build/mvn
Working in Python? Make sure to build the package target so
your Python code will run :)
You can quickly verify build with the Spark Shell :)
Kara
@holdenkarau
What about documentation changes?
● Still use JIRAs to track
● We can’t edit the wiki :(
● But a lot of documentations lives in docs/*.md
Kreg Steppe
@holdenkarau
Building Spark’s docs
./docs/README.md has a lot of info - but quickly:
SKIP_API=1 jekyll build
SKIP_API=1 jekyll serve --watch
*Requires a recentish jekyll - install instructions assume
ruby2.0 only, on debian based s/gem/gem2.0/
@holdenkarau
Finding your way around the project
● Organized into sub-projects by directory
● IntelliJ is very popular with Spark developers
○ The free version is fine
● Some people like using emacs + ensime or magit too
● Language specific code is in each sub directory
@holdenkarau
Testing the issue
The spark-shell can often be a good way to verify the issue
reported in the JIRA is still occurring and come up with a
reasonable test.
Once you’ve got a handle on the issue in the spark-shell (or
if you decide to skip that step) check out
./[component]/src/test for Scala or doctests for Python
@holdenkarau
While we get our code working:
● Remember to follow the style guides
○ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Gu
ide
● Please always add tests
○ For development we can run scala test with “sbt [module]/testOnly”
○ In python we can specify module with ./python/run-tests
● ./dev/lint-scala & ./dev/lint-python check for some style
● Changing the API? Make sure we pass or you update MiMa!
○ Sometimes its OK to make breaking changes, and MiMa can be a bit
overzealous so adding exceptions is common
@holdenkarau
A bit more on MiMa
● Spark wishes to maintain binary compatibility
○ in non-experimental components
○ 3.0 can be different
● MiMa exclusions can be added if we verify (and document
how we verified) the compatibility
● Often MiMa is a bit over sensitive so don’t feel stressed
- feel free to ask for help if confused
Julie
Krawczyk
@holdenkarau
Making the change:
No arguing about which editor please - kthnx
Making a doc change? Look inside docs/*.md
Making a code change? grep or intellij or github inside
project codesearch can all help you find what you're looking
for.
@holdenkarau
Python API change parity update?
@holdenkarau
Yay! Let’s make a PR :)
● Push to your branch
● Visit github
● Create PR (put JIRA name in title as well as component)
○ Components control where our PR shows up in
https://spark-prs.appspot.com/
● If you’ve been whitelisted tests will run
● Otherwise will wait for someone to verify
● Tag it “WIP” if its a work in progress (but maybe wait)
[puamelia]
@holdenkarau
Code review time
● Note: this is after the pull request creation
● I believe code reviews should be done in the open
○ With an exception of when we are deciding if we want to try and
submit a change
○ Even then should have hopefully decided that back at the JIRA stage
● My personal beliefs & your org’s may not align
● If you have the time you can contribute by reviewing
others code too (please!)
Mitchell
Joyce
@holdenkarau
And now onto the actual code review...
● Most often committers will review your code (eventually)
● Other people can help too
● People can be very busy (check the release schedule)
● If you don’t get traction try pinging people
○ Me ( @holdenkarau - I'm not an expert everywhere but I can look)
○ The author of the JIRA (even if not a committer)
○ The shepherd of the JIRA (if applicable)
○ The person who wrote the code you are changing (git blame)
○ Active committers for the component
Mitchell
Joyce
@holdenkarau
What does the review look like?
● LGTM - Looks good to me
○ Individual thinks the code looks good - ready to merge (sometimes
LGTM pending tests or LGTM but check with @[name]).
● SGTM - Sounds good to me (normally in response to a
suggestion)
● Sometimes get sent back to the drawing board
● Not all PRs get in - its ok!
○ Don’t feel bad & don’t get discouraged.
● Mixture of in-line comments & general comments
● You can see some videos of my live reviews at
http://bit.ly/holdenLiveOSS
Phil Long
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
@holdenkarau
That’s a pretty standard small PR
● It took some time to get merged in
● It was fairly simple
● Review cycles are long - so move on to other things
● Only two reviewers
● Apache Spark Jenkins comments on build status :)
○ “Jenkins retest this please” is great
● Big PRs - like making PySpark pip installable can have >
10 reviewers and take a long time
● Sometimes it can be hard to find reviewers - tag your PRs
& ping people on github
James Joel
@holdenkarau
Don’t get discouraged
David Martyn Hunt
It is normal to not get every pull request accepted
Sometimes other people will “scoop” you on your
pull request
Sometimes people will be super helpful with your
pull request
@holdenkarau
Don’t get discouraged
David Martyn Hunt
If you don’t hear anything there is a good chance it
is a “soft no” - but you can ping me and I can try
and help.
The community has been trying to get better at
explicit “Won’t Fix” or saying no on PRs
@holdenkarau
So who was that “Spark QA”/SparkJenkins/etc.?
● Automated pull request builder
● Jenkins based
● Runs all of the tests & style checks
● Lives in Berkeley
● Test logs live on, artifacts not so much
● https://amplab.cs.berkeley.edu/jenkins
@holdenkarau
Some changes require even more testing
● spark-perf (common for ML changes)
● spark-sql-perf (common for SQL changes)
● spark-integration-tests (integration testing)
Image of FLG by Eric Kilby
@holdenkarau
While we are waiting:
● Keep merging in master when we get out of sync
● If we don’t jenkins can’t run :(
● We get out of sync surprisingly quickly!
● If our pull request gets older than 30 days it might get
auto-closed
● If you don’t here anything try pinging the dev list to
see if it's a “soft no” (and or ping me :))
Moyan Brenn
@holdenkarau
In review: Where do we get started?
● Search for “starter” on JIRA
● Look on the mailing list for problems
● Stackoverflow - lots of questions some of which are bugs
● grep TODO broken FIXME
● Compare APIs between languages
● Customer/user reports?
Serena
@holdenkarau
What about doing reviews?
● You don't need to be an expert (just will be slower)
● It's OK to leave suggestions like "I think does X but
it's a little confusing maybe add a comment"
● First pass reviews from others are super useful
● Helping people find the right reviewers is useful
● We have over 450 open pull request (> 150 "active")
● You can drill down by component in
https://spark-prs.appspot.com/
@holdenkarau
What about when we want to make big changes?
● Talk with the community
○ Developer mailing list dev@spark.apache.org
○ User mailing list user@spark.apache.org
● Consider if it can be published as a spark-package
● Create a public design document (google doc normally)
● Be aware this will be somewhat of an uphill battle (I’m
sorry)
● You can look at SPIPs (Spark's versions of PEPs)
@holdenkarau
Other resources:
● “Contributing to Apache Spark” -
https://cwiki.apache.org/confluence/display/SPARK/Contrib
uting+to+Spark
● Programming guide (along with JavaDoc, PyDoc, ScalaDoc,
etc.) - http://spark.apache.org/docs/latest/
● Developer list -
http://apache-spark-developers-list.1001551.n3.nabble.com
/
@holdenkarau
What things can be good Spark packages?
● Input formats (especially Spark SQL, Streaming)
● Machine learning pipeline components & algorithms
● Testing support
● Monitoring data sinks
● Deployment tools
frankieleon
@holdenkarau
Making your a package
● Relatively simple - need to publish to maven central
● Listed on http://spark-packages.org
● Cross building (Spark versions) not super easy
○ I use a perl script (don’t tell on me)
● If your building with sbt check out
https://github.com/databricks/sbt-spark-package to make
it easy to publish
● Used to do API compatibility checks
● Sometimes flakey - just republish if it doesn’t go
through
frankieleon
@holdenkarau
How about writing a book?
● Can be lots of fun
● Can also take up 100% of your “free” time
● Can get you invited to more nerd parties
● Most of the publisher are looking to improve/broaden
their Spark book line up
● Like an old book that hasn’t been updated? Talk to the
publisher about updating it.
Kreg Steppe
@holdenkarau
How about yak shaving?
● Lots of areas need shaving
● JVM deps are easier to update, Python deps are not :(
● Things built on top are a great place to go yak shaving
○ Jupyter etc.
Jason Crane
@holdenkarau
Testing/Release Validation
● Join the dev@ list and look for [VOTE] threads
○ Check and see if Spark deploys on your environment
○ If your application still works, or if we need to fix something
○ Great way to keep your Spark application working with less work
● Adding more automated tests is good too
○ Especially integration tests
@holdenkarau
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
@holdenkarau
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
@holdenkarau
High Performance Spark!
You can buy it today! On the internet!
Cats love it*
*Or at least the box it comes in. If buying for a cat, get
print rather than e-book.
@holdenkarau
Sign up for the mailing list @
@holdenkarau
And some upcoming talks:
● March
○ Dataworks Barcelona -- tomorrow
○ Strata San Francisco -- next week
● April
○ Spark Summit
● May
○ KiwiCoda Mania
● June
○ "Secret" (for another week or so)
● July
○ OSCON Portland
○ Skills Matter in London
@holdenkarau
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
It’s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback

Más contenido relacionado

La actualidad más candente

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...Holden Karau
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 

La actualidad más candente (19)

Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...Building Recoverable (and optionally async) Pipelines with Apache Spark  (+ s...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
 
Validating big data pipelines - Scala eXchange 2018
Validating big data pipelines -  Scala eXchange 2018Validating big data pipelines -  Scala eXchange 2018
Validating big data pipelines - Scala eXchange 2018
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 

Similar a Contributing to Apache Spark 3

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondDatabricks
 
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesA Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesLightbend
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Scala vs. Python: Which Language Should be learned in 2020
Scala vs. Python: Which Language Should be learned in 2020Scala vs. Python: Which Language Should be learned in 2020
Scala vs. Python: Which Language Should be learned in 2020NexSoftsys
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Efficient Django
Efficient DjangoEfficient Django
Efficient DjangoDavid Arcos
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Apache contribution-bar camp-colombo
Apache contribution-bar camp-colomboApache contribution-bar camp-colombo
Apache contribution-bar camp-colomboSagara Gunathunga
 
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaMichael Mior
 

Similar a Contributing to Apache Spark 3 (20)

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
 
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And KubernetesA Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
A Glimpse At The Future Of Apache Spark 3.0 With Deep Learning And Kubernetes
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Scala vs. Python: Which Language Should be learned in 2020
Scala vs. Python: Which Language Should be learned in 2020Scala vs. Python: Which Language Should be learned in 2020
Scala vs. Python: Which Language Should be learned in 2020
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Efficient Django
Efficient DjangoEfficient Django
Efficient Django
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Apache contribution-bar camp-colombo
Apache contribution-bar camp-colomboApache contribution-bar camp-colombo
Apache contribution-bar camp-colombo
 
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
 

Último

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 

Último (20)

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 

Contributing to Apache Spark 3

  • 1. @holdenkarau Effectively Contributing to Apache Spark Spark BCN 2019 I am on the PMC but this represents my own personal views
  • 2. @holdenkarau Where you can find the slides for this talk http://bit.ly/2Ogr2BP
  • 3. @holdenkarau Who am I? Holden ● Prefered pronouns: she/her ● Co-author of the Learning Spark & High Performance Spark books ● OSS Big Data Developer advocate @ Google ● Spark PMC & Committer ● Twitter @holdenkarau ● Live stream code & reviews: http://bit.ly/holdenLiveOSS ● http://www.slideshare.net/hkarau
  • 5. @holdenkarau What we are going to explore together! Getting a change into Apache Spark & the components involved: ● The current state of the Apache Spark dev community ● Reason to contribute to Apache Spark ● Different ways to contribute ● Places to find things to contribute ● Tooling around code & doc contributions Torsten Reuschling
  • 6. @holdenkarau Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● May know some Apache Spark? ● Want to contribute to Apache Spark
  • 7. @holdenkarau Why I’m assuming you might want to contribute: ● Fix your own bugs/problems with Apache Spark ● Learn more about distributed systems (for fun or profit) ● Improve your Scala/Python/R/Java experience ● You <3 functional programming and want to trick more people into using it ● “Credibility” of some vague type ● You just like hacking on random stuff and Spark seems shiny
  • 8. @holdenkarau What’s the state of the Spark dev community? ● Really large number of contributors ● Active PMC & Committer’s somewhat concentrated ○ Better than we used to be ● Also a lot of SF Bay Area - but certainly not exclusively so gigijin
  • 9. @holdenkarau How can we contribute to Spark? ● Direct code in the Apache Spark code base ● Code in packages built on top of Spark ● Code reviews ● Yak shaving (aka fixing things that Spark uses) ● Documentation improvements & examples ● Books, Talks, and Blogs ● Answering questions (mailing lists, stack overflow, etc.) ● Testing & Release Validation Andrey
  • 10. @holdenkarau Which is right for you? ● Direct code in the Apache Spark code base ○ High visibility, some things can only really be done here ○ Can take a lot longer to get changes in ● Code in packages built on top of Spark ○ Really great for things like formats or standalone features ● Yak shaving (aka fixing things that Spark uses) ○ Super important to do sometimes - can take even longer to get in romana klee
  • 11. @holdenkarau Which is right for you? (continued) ● Code reviews ○ High visibility to PMC, can be faster to get started, easier to time box ○ Less tracked in metrics ● Documentation improvements & examples ○ Lots of places to contribute - mixed visibility - large impact ● Advocacy: Books, Talks, and Blogs ○ Can be high visibility romana klee
  • 12. @holdenkarau Contributing Code Directly to Spark ● Maybe we encountered a bug we want to fix ● Maybe we’ve got a feature we want to add ● Either way we should see if other people are doing it ● And if what we want to do is complex, it might be better to find something simple to start with ● It’s dangerous to go alone - take this https://cwiki.apache.org/confluence/display/SPARK/Contrib uting+to+Spark Jon Nelson
  • 13. @holdenkarau The different pieces of Spark Apache Spark “Core” SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Spark on Yarn Spark on Mesos Standalone Spark
  • 14. @holdenkarau The different pieces of Spark: 2.0+ Apache Spark “Core” SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 15. @holdenkarau The different pieces of Spark: 3+? Apache Spark “Core” SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming Spark on Yarn Spark on Mesos Spark on Kubernetes Standalone Spark
  • 16. @holdenkarau Choosing a component? ● Core ○ Conservative to external changes, but biggest impact ● ML / MLlib ○ ML is the home of the future - you can improve existing algorithms - new algorithms face uphill battle ● Structured Streaming ○ Current API is in a lot of flux so it is difficult for external participation ● SQL ○ Lots of fun stuff - very active - I have limited personal experience ● Python / R ○ Improve coverage of current APIs, structural change hard ● GraphX - Dead see GraphFrames instead Rikki's Refuge
  • 17. @holdenkarau Choosing a component? (cont) ● Kubernetes ○ New, lots of active work and reviewers ● YARN ○ Old faithful, always needs a little work. Hadoop 3 support ● Mesos ○ Needs some love, probably easy-ish-path to committer (still hard) ● Standalone ○ Not a lot going on Rikki's Refuge
  • 18. @holdenkarau Onto JIRA - Issue tracking funtimes ● It’s like bugzilla or fog bugz ● There is an Apache JIRA for many Apache projects ● You can (and should) sign up for an account ● All changes in Spark (now) require a JIRA ● https://www.youtube.com/watch?v=ca8n9uW3afg ● Check it out at: ○ https://issues.apache.org/jira/browse/SPARK
  • 19. @holdenkarau What we can do with ASF JIRA? ● Search for issues (remember to filter to Spark project) ● Create new issues ○ search first to see if someone else has reported it ● Comment on issues to let people know we are working on it ● Ask people for clarification or help ○ e.g. “Reading this I think you want the null values to be replaced by a string when processing - is that correct?” ○ @mentions work here too
  • 20. @holdenkarau What can’t we do with ASF JIRA? ● Assign issues (to ourselves or other people) ○ In lieu of assigning we can “watch” & comment ● Post long design documents (create a Google Doc & link to it from the JIRA) ● Tag issues ○ While we can add tags, they often get removed
  • 22. @holdenkarau Finding a good “starter” issue: ● There are explicit starter tags in JIRA we can search for ● But often the starter tag isn’t applied ● Read through and look for simple issues ● Pick something in the same component you eventually want to work in ○ And or consider improving the non-Scala language API for the component(s) you want to work on. ● Look at the reporter and commenters - is there a committer or someone whose name you recognize? ● Leave a comment that says you are going to start working on this
  • 23. @holdenkarau Find an issue you want to work on https://issues.apache.org/jira/browse/SPARK Also grep for TODO in components you are interested in (e.g. grep -r TODO ./python/pyspark or grep -R TODO ./core/src) Look between language APIs and see if anything is missing that you think is interesting - http://spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.package http://spark.apache.org/docs/latest/api/python/index.html neko kabachi
  • 24. @holdenkarau Explore things that make sense to revisit https://issues.apache.org/jira/browse/SPARK Consider looking for issues which we couldn't fix due to our compatibility requirements and should revisit for 3+ Maurizio Zanetti
  • 27. @holdenkarau But before we get too far: ● Spark wishes to maintain compatibility between releases ● We're working on 3 though so this is the time to break things Meagan Fisher
  • 28. @holdenkarau Getting at the code: yay for GitHub :) ● https://github.com/apache/spark ● Make a fork of it ● Clone it locally dougwoods
  • 30. @holdenkarau Building Spark ./build/sbt or ./build/mvn Working in Python? Make sure to build the package target so your Python code will run :) You can quickly verify build with the Spark Shell :) Kara
  • 31. @holdenkarau What about documentation changes? ● Still use JIRAs to track ● We can’t edit the wiki :( ● But a lot of documentations lives in docs/*.md Kreg Steppe
  • 32. @holdenkarau Building Spark’s docs ./docs/README.md has a lot of info - but quickly: SKIP_API=1 jekyll build SKIP_API=1 jekyll serve --watch *Requires a recentish jekyll - install instructions assume ruby2.0 only, on debian based s/gem/gem2.0/
  • 33. @holdenkarau Finding your way around the project ● Organized into sub-projects by directory ● IntelliJ is very popular with Spark developers ○ The free version is fine ● Some people like using emacs + ensime or magit too ● Language specific code is in each sub directory
  • 34. @holdenkarau Testing the issue The spark-shell can often be a good way to verify the issue reported in the JIRA is still occurring and come up with a reasonable test. Once you’ve got a handle on the issue in the spark-shell (or if you decide to skip that step) check out ./[component]/src/test for Scala or doctests for Python
  • 35. @holdenkarau While we get our code working: ● Remember to follow the style guides ○ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Gu ide ● Please always add tests ○ For development we can run scala test with “sbt [module]/testOnly” ○ In python we can specify module with ./python/run-tests ● ./dev/lint-scala & ./dev/lint-python check for some style ● Changing the API? Make sure we pass or you update MiMa! ○ Sometimes its OK to make breaking changes, and MiMa can be a bit overzealous so adding exceptions is common
  • 36. @holdenkarau A bit more on MiMa ● Spark wishes to maintain binary compatibility ○ in non-experimental components ○ 3.0 can be different ● MiMa exclusions can be added if we verify (and document how we verified) the compatibility ● Often MiMa is a bit over sensitive so don’t feel stressed - feel free to ask for help if confused Julie Krawczyk
  • 37. @holdenkarau Making the change: No arguing about which editor please - kthnx Making a doc change? Look inside docs/*.md Making a code change? grep or intellij or github inside project codesearch can all help you find what you're looking for.
  • 39. @holdenkarau Yay! Let’s make a PR :) ● Push to your branch ● Visit github ● Create PR (put JIRA name in title as well as component) ○ Components control where our PR shows up in https://spark-prs.appspot.com/ ● If you’ve been whitelisted tests will run ● Otherwise will wait for someone to verify ● Tag it “WIP” if its a work in progress (but maybe wait) [puamelia]
  • 40. @holdenkarau Code review time ● Note: this is after the pull request creation ● I believe code reviews should be done in the open ○ With an exception of when we are deciding if we want to try and submit a change ○ Even then should have hopefully decided that back at the JIRA stage ● My personal beliefs & your org’s may not align ● If you have the time you can contribute by reviewing others code too (please!) Mitchell Joyce
  • 41. @holdenkarau And now onto the actual code review... ● Most often committers will review your code (eventually) ● Other people can help too ● People can be very busy (check the release schedule) ● If you don’t get traction try pinging people ○ Me ( @holdenkarau - I'm not an expert everywhere but I can look) ○ The author of the JIRA (even if not a committer) ○ The shepherd of the JIRA (if applicable) ○ The person who wrote the code you are changing (git blame) ○ Active committers for the component Mitchell Joyce
  • 42. @holdenkarau What does the review look like? ● LGTM - Looks good to me ○ Individual thinks the code looks good - ready to merge (sometimes LGTM pending tests or LGTM but check with @[name]). ● SGTM - Sounds good to me (normally in response to a suggestion) ● Sometimes get sent back to the drawing board ● Not all PRs get in - its ok! ○ Don’t feel bad & don’t get discouraged. ● Mixture of in-line comments & general comments ● You can see some videos of my live reviews at http://bit.ly/holdenLiveOSS Phil Long
  • 52. @holdenkarau That’s a pretty standard small PR ● It took some time to get merged in ● It was fairly simple ● Review cycles are long - so move on to other things ● Only two reviewers ● Apache Spark Jenkins comments on build status :) ○ “Jenkins retest this please” is great ● Big PRs - like making PySpark pip installable can have > 10 reviewers and take a long time ● Sometimes it can be hard to find reviewers - tag your PRs & ping people on github James Joel
  • 53. @holdenkarau Don’t get discouraged David Martyn Hunt It is normal to not get every pull request accepted Sometimes other people will “scoop” you on your pull request Sometimes people will be super helpful with your pull request
  • 54. @holdenkarau Don’t get discouraged David Martyn Hunt If you don’t hear anything there is a good chance it is a “soft no” - but you can ping me and I can try and help. The community has been trying to get better at explicit “Won’t Fix” or saying no on PRs
  • 55. @holdenkarau So who was that “Spark QA”/SparkJenkins/etc.? ● Automated pull request builder ● Jenkins based ● Runs all of the tests & style checks ● Lives in Berkeley ● Test logs live on, artifacts not so much ● https://amplab.cs.berkeley.edu/jenkins
  • 56. @holdenkarau Some changes require even more testing ● spark-perf (common for ML changes) ● spark-sql-perf (common for SQL changes) ● spark-integration-tests (integration testing) Image of FLG by Eric Kilby
  • 57. @holdenkarau While we are waiting: ● Keep merging in master when we get out of sync ● If we don’t jenkins can’t run :( ● We get out of sync surprisingly quickly! ● If our pull request gets older than 30 days it might get auto-closed ● If you don’t here anything try pinging the dev list to see if it's a “soft no” (and or ping me :)) Moyan Brenn
  • 58. @holdenkarau In review: Where do we get started? ● Search for “starter” on JIRA ● Look on the mailing list for problems ● Stackoverflow - lots of questions some of which are bugs ● grep TODO broken FIXME ● Compare APIs between languages ● Customer/user reports? Serena
  • 59. @holdenkarau What about doing reviews? ● You don't need to be an expert (just will be slower) ● It's OK to leave suggestions like "I think does X but it's a little confusing maybe add a comment" ● First pass reviews from others are super useful ● Helping people find the right reviewers is useful ● We have over 450 open pull request (> 150 "active") ● You can drill down by component in https://spark-prs.appspot.com/
  • 60. @holdenkarau What about when we want to make big changes? ● Talk with the community ○ Developer mailing list dev@spark.apache.org ○ User mailing list user@spark.apache.org ● Consider if it can be published as a spark-package ● Create a public design document (google doc normally) ● Be aware this will be somewhat of an uphill battle (I’m sorry) ● You can look at SPIPs (Spark's versions of PEPs)
  • 61. @holdenkarau Other resources: ● “Contributing to Apache Spark” - https://cwiki.apache.org/confluence/display/SPARK/Contrib uting+to+Spark ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) - http://spark.apache.org/docs/latest/ ● Developer list - http://apache-spark-developers-list.1001551.n3.nabble.com /
  • 62. @holdenkarau What things can be good Spark packages? ● Input formats (especially Spark SQL, Streaming) ● Machine learning pipeline components & algorithms ● Testing support ● Monitoring data sinks ● Deployment tools frankieleon
  • 63. @holdenkarau Making your a package ● Relatively simple - need to publish to maven central ● Listed on http://spark-packages.org ● Cross building (Spark versions) not super easy ○ I use a perl script (don’t tell on me) ● If your building with sbt check out https://github.com/databricks/sbt-spark-package to make it easy to publish ● Used to do API compatibility checks ● Sometimes flakey - just republish if it doesn’t go through frankieleon
  • 64. @holdenkarau How about writing a book? ● Can be lots of fun ● Can also take up 100% of your “free” time ● Can get you invited to more nerd parties ● Most of the publisher are looking to improve/broaden their Spark book line up ● Like an old book that hasn’t been updated? Talk to the publisher about updating it. Kreg Steppe
  • 65. @holdenkarau How about yak shaving? ● Lots of areas need shaving ● JVM deps are easier to update, Python deps are not :( ● Things built on top are a great place to go yak shaving ○ Jupyter etc. Jason Crane
  • 66. @holdenkarau Testing/Release Validation ● Join the dev@ list and look for [VOTE] threads ○ Check and see if Spark deploys on your environment ○ If your application still works, or if we need to fix something ○ Great way to keep your Spark application working with less work ● Adding more automated tests is good too ○ Especially integration tests
  • 67. @holdenkarau Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark Paul Anderson
  • 68. @holdenkarau Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 69. @holdenkarau High Performance Spark! You can buy it today! On the internet! Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 70. @holdenkarau Sign up for the mailing list @
  • 71. @holdenkarau And some upcoming talks: ● March ○ Dataworks Barcelona -- tomorrow ○ Strata San Francisco -- next week ● April ○ Spark Summit ● May ○ KiwiCoda Mania ● June ○ "Secret" (for another week or so) ● July ○ OSCON Portland ○ Skills Matter in London
  • 72. @holdenkarau k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark . Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF It’s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback