SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Are General Purpose Systems Eating
the Big Data World?
with your friend
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
Some things that may color my views:
● I’m a Spark committer -- if Spark continues to win I can probably make more
$s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
● On twitter @BooProgrammer
● Wishes EU cookie warnings were about real cookies
● Also confused about GDPR
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Think the JVM is kind of cool
Lori Erickson
What are we going to talk about?
● The state of the open big data ecosystem
● Why general purpose matters
○ Where it doesn’t* matter**
● The different kinds of unification and why it matters
● Evolution of Spark/Flink and friends
● What BEAM aims to do
○ Where we are today
○ Where we need to be tomorrow
● Other ways the data world can be unified
● A limited number of GDPR jokes if you opt-in today
● And of course lots of wordcount examples
Our ever growing ecosystem:
http://mattturck.com/bigdata2017/
Enter: Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc. Not great at all of these but “good enough”
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
General purpose probably are eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
○ Interchange: write everything out to disk (or HDFS/GCS/S3/etc.)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Tambako The Jaguar
Even then, lots of general purpose tools:
Resulting in:
flink/
h2o/
hdfs/
integration/
mr/
spark/
viennacl/
And language silos (Scala, Java, Python, Go, etc.!)
Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
And cloud silos….
Photo By: Zechariah Judy
& We’re getting better with cloud silos
● EMR (elastic map reduce) like services on AWS, Azure & Google
○ EMR
○ Azure HDInsight
○ Google Dataproc
● You can move your Spark job super easily
● Vendors are becoming cross-platform too (see Databricks)
torne (where's my lens ca
We’re getting better with data silos
● Customers demand access to their data in many platforms
● Vendors seem to listen
○ Notable exceptions: formats without a vendor, or vendor with a closed sourced locked in
platform
● e.g. Even if your data is a mixture of Parquet & big query (or red shift) you can
access both
We’re starting to recognize these problems:
● Spark is adding more runners (see SparkR) and integrating Arrow
● Small projects (like Sparkling ML) work to expose Python libs to JVM and vice
versa
● Apache Arrow is focused around improved formats for interop
○ Included in Spark 2.3 to vastly improve Python performance
○ Being used by GPU vendors too!
● Apache BEAM is adding a unified runner SDK API
○ Currently each language has to support each backend
○ Common core
● Apache BEAM is attempting to unify streaming APIs
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
*For a small price of your fun libraries. Bad idea.
That was a bad idea, buuut…..
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● POC use Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark. Be
careful.
Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
compiler**
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
BEAM Languages
● JVM: Scala*, Java, etc.
● non-JVM: Python w/Go and more coming
BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
BEAM Beyond the JVM: Random collab branch
*ish
*ish
*ish
Nick
portability
*ish
What do (some of) the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
Spark wordcount (Python*)
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
Flink wordcount (Java)
DataStream<String> text = ...
DataStream<Tuple2<String, Integer>> counts =
text.map(line -> line.toLowerCase().split("W+"))
.flatMap((String[] tokens, Collector<Tuple2<String,
Integer>> out) -> {
Arrays.stream(tokens)
.forEach(t -> out.collect(new Tuple2<>(t, 1)));
})
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.sum(1);
BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
c.output(word);
}}}))
.apply(Count.<String>perElement())
Go Wordcount (BEAM)
func CountWords(s beam.Scope, lines beam.PCollection)
beam.PCollection {
s = s.Scope("CountWords")
// Convert lines of text into individual words.
col := beam.ParDo(s, extractFn, lines)
// Count the number of times each word occurs.
return stats.Count(s, col)
}
Ada Doglace
Go Wordcount (BEAM) continued
p := beam.NewPipeline()
s := p.Root()
lines := textio.Read(s, *input)
counted := CountWords(s, lines)
formatted := beam.ParDo(s, formatFn, counted)
textio.Write(s, *output, formatted)
What about windowed word count?
Trish Hamme
Christer van der Meeren
Flink windowed wordcount (scala)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuple) containing:
(word,1)
.flatMap(_.toLowerCase.split("W+"))
.map((_, 1))
.keyBy(0)
// create windows of windowSize records slided every
slideSize records
.countWindow(windowSize, slideSize)
// sum up tuple field "1"
.sum(1)
Multi-language pipelines?!?
● Totally! I mean kind-of. Spark is ok right?
○ Beam is working on it :) #comejoinus
○ Easier for Python calling Java
○ If SQL counts for sure. If not, yes*.
● Technically what happens when you use Scala Spark + Tensorflow
● And same with Beam, etc.
● Right now painful to write in practice in OSS land outside of libraries
(commercial vendors have some solutions but I stick to OSS when I can)
● So lets focus and libraries (and look head at more general)
Jennifer C.
A proof of concept: Sparkling ML
● A place for useful Spark ML pipeline stages to live
○ Including both feature transformers and estimators
● The why
○ Spark ML can’t keep up with every new algorithm
○ Spark ML is largely Scala based, cool Python tools for NLP exist as well
● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML
or together.
● We make it easier to expose Python transformers into Scala land and vice
versa.
● Our repo is at: https://github.com/sparklingpandas/sparklingml
Supporting non-English languages
With Spacy, so you know more than English*
def inner(inputString):
nlp = SpacyMagic.get(lang)
def spacyTokenToDict(token):
"""Convert the input token into a dictionary"""
return dict(map(lookup_field_or_none, fields))
return list(map(spacyTokenToDict,
list(nlp(inputString))))
test_t51
And from the JVM:
val transformer = new SpacyTokenizePython()
transformer.setLang("en")
val input = spark.createDataset(
List(InputData("hi boo"), InputData("boo")))
transformer.setInputCol("input")
transformer.setOutputCol("output")
val result = transformer.transform(input).collect()
Alexy Khrabrov
How might this all play out?
● A portability layer like Beam could save us
○ And commoditize the execution engine layer
● One execution engine becomes good enough at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Personally you can pry my functions out of my cold dead hands but that’s all right
● People realize there big data problem is actually three small data problems in
a trench coat
Shawn Carpenter
What does this mean for special systems?
● More special systems will run on top of the same general purpose engine
○ Important to work with general purpose engines to make sure they will serve the future special
needs.
● Think about inputs and outputs and how to share them nicely and efficiently
with other systems (please not just files)
● Some of them will become more like libraries
○ This is perhaps “less cool” than being a stand alone system, but maybe better
● More collaboration hopefully
What does “sharing” look like?
val sparkSql = ...
// Start hive on the engine
HiveThriftServer2.startWithContext(sparkSql)
// Start snappy data
val snappy = new
org.apache.spark.sql.SnappySession(spark.sparkContext)
// Do cool things with both
...
Want to help?
● Lot’s of places to contribute
● We’d love help on Spark and or BEAM
○ Improved execution engine support in BEAM
○ More data sources & specialized format support (everywhere)
○ Better Python & non-JVM language support
● I’m sure Flink would too, but I’m less involved with them
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● June
○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python +
Dependencies)
○ Scala Days NYC
○ FOSS Back Stage & BBuzz (Berlin)
● July
○ Curry On Amsterdam
○ OSCON Portland
● September
○ Strata NYC
● October
○ Reversim (Tel Aviv)
That’s all Folks!
Happy GDPR Friday! - May your last minute deploys
not break too much!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Más contenido relacionado

La actualidad más candente

Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EUHolden Karau
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYEmanuel Calvo
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceLucidworks
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesNicholas Crouch
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLPRobert Viseur
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPPieter De Leenheer
 

La actualidad más candente (20)

Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Validating big data jobs - Spark AI Summit EU
Validating big data jobs  - Spark AI Summit EUValidating big data jobs  - Spark AI Summit EU
Validating big data jobs - Spark AI Summit EU
 
RDFa Tutorial
RDFa TutorialRDFa Tutorial
RDFa Tutorial
 
ApacheCon09: Avro
ApacheCon09: AvroApacheCon09: Avro
ApacheCon09: Avro
 
CBOR - The Better JSON
CBOR - The Better JSONCBOR - The Better JSON
CBOR - The Better JSON
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
 

Similar a Are general purpose big data systems eating the world?

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3Holden Karau
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 

Similar a Are general purpose big data systems eating the world? (20)

Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Contributing to Apache Spark 3
Contributing to Apache Spark 3Contributing to Apache Spark 3
Contributing to Apache Spark 3
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 

Último

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 

Último (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Are general purpose big data systems eating the world?

  • 1. Are General Purpose Systems Eating the Big Data World? with your friend @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  • 3.
  • 4. Some things that may color my views: ● I’m a Spark committer -- if Spark continues to win I can probably make more $s ● My employer cares about BEAM (and Spark and other things) ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 5. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ● On twitter @BooProgrammer ● Wishes EU cookie warnings were about real cookies ● Also confused about GDPR
  • 6. Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Think the JVM is kind of cool Lori Erickson
  • 7. What are we going to talk about? ● The state of the open big data ecosystem ● Why general purpose matters ○ Where it doesn’t* matter** ● The different kinds of unification and why it matters ● Evolution of Spark/Flink and friends ● What BEAM aims to do ○ Where we are today ○ Where we need to be tomorrow ● Other ways the data world can be unified ● A limited number of GDPR jokes if you opt-in today ● And of course lots of wordcount examples
  • 8. Our ever growing ecosystem: http://mattturck.com/bigdata2017/
  • 9. Enter: Spark ● Integrated different tools which traditionally required different systems ○ Mahout, hive, etc. Not great at all of these but “good enough” ● e.g. can use same system to do ML and SQL *Often written in Python! Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 10. General purpose probably are eating the world ● Operations overhead ● Moving data from System 1 to System 2 (sqoop and friends) ○ Interchange: write everything out to disk (or HDFS/GCS/S3/etc.) ● We still have specialized tools, but being built on top of general frameworks ○ e.g. see mahout on Spark ○ Less closely tied things like Hive/Pig on Spark ○ TF.Transform etc. Tambako The Jaguar
  • 11. Even then, lots of general purpose tools:
  • 13. And language silos (Scala, Java, Python, Go, etc.!) Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
  • 14. And cloud silos…. Photo By: Zechariah Judy
  • 15. & We’re getting better with cloud silos ● EMR (elastic map reduce) like services on AWS, Azure & Google ○ EMR ○ Azure HDInsight ○ Google Dataproc ● You can move your Spark job super easily ● Vendors are becoming cross-platform too (see Databricks) torne (where's my lens ca
  • 16. We’re getting better with data silos ● Customers demand access to their data in many platforms ● Vendors seem to listen ○ Notable exceptions: formats without a vendor, or vendor with a closed sourced locked in platform ● e.g. Even if your data is a mixture of Parquet & big query (or red shift) you can access both
  • 17. We’re starting to recognize these problems: ● Spark is adding more runners (see SparkR) and integrating Arrow ● Small projects (like Sparkling ML) work to expose Python libs to JVM and vice versa ● Apache Arrow is focused around improved formats for interop ○ Included in Spark 2.3 to vastly improve Python performance ○ Being used by GPU vendors too! ● Apache BEAM is adding a unified runner SDK API ○ Currently each language has to support each backend ○ Common core ● Apache BEAM is attempting to unify streaming APIs
  • 18. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask -- but hard to talk to existing ecosystem David Brown
  • 19. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 20. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 21. *For a small price of your fun libraries. Bad idea.
  • 22. That was a bad idea, buuut….. ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● POC use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 23.
  • 24. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 25. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 26. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Be careful.
  • 27. Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  • 28. BEAM backends: ● BEAM nominally supports* ○ Dataflow ○ Flink* ○ Spark* ○ IBM Streams, etc. ● Goal of more than just lowest-common-demoninator, think of it like a compiler** *Supports as in early-stage, but we’re working on it (and we’d love your help!) **But you know, in the same sense I compare Spark Streams to pandas coming a wooden slide.
  • 29. BEAM Languages ● JVM: Scala*, Java, etc. ● non-JVM: Python w/Go and more coming
  • 30. BEAM Beyond the JVM ● This part doesn’t work outside of Google’s hosted environment yet, so I’m going to be light on the details ● tl;dr : uses grpc / protobuf ● But exciting new plans (w/ some early code) to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/
  • 31. BEAM Beyond the JVM: Random collab branch *ish *ish *ish Nick portability *ish
  • 32. What do (some of) the different APIs look like? ● Everyone's favourite: Streaming Word Count Example ● And then windowed wordcount! ● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
  • 33. Spark wordcount (Python*) # Split the lines into words words = lines.select( # explode turns each item in an array into a separate row explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count()
  • 34. Flink wordcount (Java) DataStream<String> text = ... DataStream<Tuple2<String, Integer>> counts = text.map(line -> line.toLowerCase().split("W+")) .flatMap((String[] tokens, Collector<Tuple2<String, Integer>> out) -> { Arrays.stream(tokens) .forEach(t -> out.collect(new Tuple2<>(t, 1))); }) // group by the tuple field "0" and sum up tuple field "1" .keyBy(0) .sum(1);
  • 35. BEAM wordcount (Java) p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().split("W+")) { c.output(word); }}})) .apply(Count.<String>perElement())
  • 36. Go Wordcount (BEAM) func CountWords(s beam.Scope, lines beam.PCollection) beam.PCollection { s = s.Scope("CountWords") // Convert lines of text into individual words. col := beam.ParDo(s, extractFn, lines) // Count the number of times each word occurs. return stats.Count(s, col) } Ada Doglace
  • 37. Go Wordcount (BEAM) continued p := beam.NewPipeline() s := p.Root() lines := textio.Read(s, *input) counted := CountWords(s, lines) formatted := beam.ParDo(s, formatFn, counted) textio.Write(s, *output, formatted)
  • 38. What about windowed word count? Trish Hamme Christer van der Meeren
  • 39. Flink windowed wordcount (scala) val counts: DataStream[(String, Int)] = text // split up the lines in pairs (2-tuple) containing: (word,1) .flatMap(_.toLowerCase.split("W+")) .map((_, 1)) .keyBy(0) // create windows of windowSize records slided every slideSize records .countWindow(windowSize, slideSize) // sum up tuple field "1" .sum(1)
  • 40. Multi-language pipelines?!? ● Totally! I mean kind-of. Spark is ok right? ○ Beam is working on it :) #comejoinus ○ Easier for Python calling Java ○ If SQL counts for sure. If not, yes*. ● Technically what happens when you use Scala Spark + Tensorflow ● And same with Beam, etc. ● Right now painful to write in practice in OSS land outside of libraries (commercial vendors have some solutions but I stick to OSS when I can) ● So lets focus and libraries (and look head at more general) Jennifer C.
  • 41. A proof of concept: Sparkling ML ● A place for useful Spark ML pipeline stages to live ○ Including both feature transformers and estimators ● The why ○ Spark ML can’t keep up with every new algorithm ○ Spark ML is largely Scala based, cool Python tools for NLP exist as well ● Lots of cool ML on Spark tools exist, but many don’t play nice with Spark ML or together. ● We make it easier to expose Python transformers into Scala land and vice versa. ● Our repo is at: https://github.com/sparklingpandas/sparklingml
  • 42. Supporting non-English languages With Spacy, so you know more than English* def inner(inputString): nlp = SpacyMagic.get(lang) def spacyTokenToDict(token): """Convert the input token into a dictionary""" return dict(map(lookup_field_or_none, fields)) return list(map(spacyTokenToDict, list(nlp(inputString)))) test_t51
  • 43. And from the JVM: val transformer = new SpacyTokenizePython() transformer.setLang("en") val input = spark.createDataset( List(InputData("hi boo"), InputData("boo"))) transformer.setInputCol("input") transformer.setOutputCol("output") val result = transformer.transform(input).collect() Alexy Khrabrov
  • 44. How might this all play out? ● A portability layer like Beam could save us ○ And commoditize the execution engine layer ● One execution engine becomes good enough at everything ● Instead of a compiler like unifier we see something like streaming SQL become our unifier ○ Personally you can pry my functions out of my cold dead hands but that’s all right ● People realize there big data problem is actually three small data problems in a trench coat Shawn Carpenter
  • 45. What does this mean for special systems? ● More special systems will run on top of the same general purpose engine ○ Important to work with general purpose engines to make sure they will serve the future special needs. ● Think about inputs and outputs and how to share them nicely and efficiently with other systems (please not just files) ● Some of them will become more like libraries ○ This is perhaps “less cool” than being a stand alone system, but maybe better ● More collaboration hopefully
  • 46. What does “sharing” look like? val sparkSql = ... // Start hive on the engine HiveThriftServer2.startWithContext(sparkSql) // Start snappy data val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext) // Do cool things with both ...
  • 47. Want to help? ● Lot’s of places to contribute ● We’d love help on Spark and or BEAM ○ Improved execution engine support in BEAM ○ More data sources & specialized format support (everywhere) ○ Better Python & non-JVM language support ● I’m sure Flink would too, but I’m less involved with them
  • 48. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkLearning PySpark
  • 49. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 50. And some upcoming talks: ● June ○ Spark Summit SF (Accelerating Tensorflow & Accelerating Python + Dependencies) ○ Scala Days NYC ○ FOSS Back Stage & BBuzz (Berlin) ● July ○ Curry On Amsterdam ○ OSCON Portland ● September ○ Strata NYC ● October ○ Reversim (Tel Aviv)
  • 51. That’s all Folks! Happy GDPR Friday! - May your last minute deploys not break too much! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)