SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
Apache Spark plus 
many other frameworks: 
How Spark fits into the Big Data landscape 
Big Data Everywhere - Chicago, 2014-10-01 
bigdataeverywhere.com/chicago/ 
Licensed under a Creative Commons Attribution-NonCommercial- 
NoDerivatives 4.0 International License
What is Spark?
What is Spark? 
Developed in 2009 at UC Berkeley AMPLab, then 
open sourced in 2010, Spark has since become 
one of the largest OSS communities in big data, 
with over 200 contributors in 50+ organizations 
spark.apache.org 
“Organizations that are looking at big data challenges – 
including collection, ETL, storage, exploration and analytics – 
should consider Spark for its in-memory performance and 
the breadth of its model. It supports advanced analytics 
solutions on Hadoop clusters, including the iterative model 
required for machine learning and graph analysis.” 
Gartner, Advanced Analytics and Data Science (2014)
What is Spark?
What is Spark? 
Spark Core is the general execution engine for the 
Spark platform that other functionality is built atop: 
! 
• in-memory computing capabilities deliver speed 
• general execution model supports wide variety 
of use cases 
• ease of development – native APIs in Java, Scala, 
Python (+ SQL, Clojure, R)
What is Spark? 
WordCount in 3 lines of Spark 
WordCount in 50+ lines of Java MR
What is Spark? 
Sustained exponential growth, as one of the most 
active Apache projects ohloh.net/orgs/apache
A Brief History
A Brief History: Functional Programming for Big Data 
Theory, Eight Decades Ago: 
what can be computed? 
Haskell Curry 
haskell.org 
Alonso Church 
wikipedia.org 
Praxis, Four Decades Ago: 
algebra for applicative systems 
John Backus 
acm.org 
David Turner 
wikipedia.org 
Reality, Two Decades Ago: 
machine data from web apps 
Pattie Maes 
MIT Media Lab
A Brief History: Functional Programming for Big Data 
circa late 1990s: 
explosive growth e-commerce and machine data 
implied that workloads could not fit on a single 
computer anymore… 
notable firms led the shift to horizontal scale-out 
on clusters of commodity hardware, especially 
for machine learning use cases at scale
A Brief History: Functional Programming for Big Data 
circa 2002: 
mitigate risk of large distributed workloads lost 
due to disk failures on commodity hardware… 
Google File System 
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung 
research.google.com/archive/gfs.html 
! 
MapReduce: Simplified Data Processing on Large Clusters 
Jeffrey Dean, Sanjay Ghemawat 
research.google.com/archive/mapreduce.html
A Brief History: Functional Programming for Big Data 
2002 
2004 
MapReduce paper 
2002 
MapReduce @ Google 
2004 2006 2008 2010 2012 2014 
2006 
Hadoop @ Yahoo! 
2014 
Apache Spark top-level 
2010 
Spark paper 
2008 
Hadoop Summit
A Brief History: Functional Programming for Big Data 
MapReduce 
Pregel Giraph 
Dremel Drill 
S4 Storm 
F1 
MillWheel 
General Batch Processing Specialized Systems: 
Impala 
GraphLab 
iterative, interactive, streaming, graph, etc. 
Tez 
MR doesn’t compose well for large applications, 
and so specialized systems emerged as workarounds
A Brief History: Functional Programming for Big Data 
circa 2010: 
a unified engine for enterprise data workflows, 
based on commodity hardware a decade later… 
Spark: Cluster Computing with Working Sets 
Matei Zaharia, Mosharaf Chowdhury, 
Michael Franklin, Scott Shenker, Ion Stoica 
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf 
! 
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for 
In-Memory Cluster Computing 
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, 
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica 
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
A Brief History: Functional Programming for Big Data 
In addition to simple map and reduce operations, 
Spark supports SQL queries, streaming data, and 
complex analytics such as machine learning and 
graph algorithms out-of-the-box. 
Better yet, combine these capabilities seamlessly 
into one integrated workflow…
TL;DR: Applicative Systems and Functional Programming – RDDs 
action value 
RDD 
RDD 
RDD 
transformations RDD 
// transformed RDDs! 
val errors = lines.filter(_.startsWith("ERROR"))! 
val messages = errors.map(_.split("t")).map(r => r(1))! 
messages.cache() 
// action 1! 
messages.filter(_.contains("mysql")).count()
TL;DR: Generational trade-offs for handling Big Compute 
Cheap 
Memory 
Cheap 
Storage 
Cheap 
Network 
recompute 
replicate 
reference 
(RDD) 
(DFS) 
(URI)
TL;DR: Big Compute in Applicative Systems, by the numbers… 
1. Express business logic in a preferred native 
language (Scala, Java, Python, Clojure, SQL, R, etc.) 
leveraging FP/closures 
2. Build a graph of what must be computed 
3. Rewrite the graph into stages using graph 
reduction to determine how to move predicates, 
what can be computed in parallel, where 
synchronization barriers are required, etc. 
(Wadsworth, Henderson, Turner, et al.) 
4. Handle synchronization using Akka and 
reactive programming, with an LRU 
to manage memory working sets 
5. Profit
TL;DR: Big Compute…Implications 
Of course, if you can define the structure of workloads 
in terms of abstract algebra, this all becomes much more 
interesting – having vast implications on machine learning 
at scale, IoT, industrial applications, optimization in general, 
etc., as we retool the industrial plant 
However, we’ll leave that for another talk… 
http://justenoughmath.com/
Unifying the Pieces
Unifying the Pieces: Spark SQL 
// http://spark.apache.org/docs/latest/sql-programming-guide.html! 
! 
val sqlContext = new org.apache.spark.sql.SQLContext(sc)! 
import sqlContext._! 
! 
// define the schema using a case class! 
case class Person(name: String, age: Int)! 
! 
// create an RDD of Person objects and register it as a table! 
val people = sc.textFile("examples/src/main/resources/ 
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))! 
! 
people.registerAsTable("people")! 
! 
// SQL statements can be run using the SQL methods provided by sqlContext! 
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")! 
! 
// results of SQL queries are SchemaRDDs and support all the ! 
// normal RDD operations…! 
// columns of a row in the result can be accessed by ordinal! 
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Unifying the Pieces: Spark Streaming 
// http://spark.apache.org/docs/latest/streaming-programming-guide.html! 
! 
import org.apache.spark.streaming._! 
import org.apache.spark.streaming.StreamingContext._! 
! 
// create a StreamingContext with a SparkConf configuration! 
val ssc = new StreamingContext(sparkConf, Seconds(10))! 
! 
// create a DStream that will connect to serverIP:serverPort! 
val lines = ssc.socketTextStream(serverIP, serverPort)! 
! 
// split each line into words! 
val words = lines.flatMap(_.split(" "))! 
! 
// count each word in each batch! 
val pairs = words.map(word => (word, 1))! 
val wordCounts = pairs.reduceByKey(_ + _)! 
! 
// print a few of the counts to the console! 
wordCounts.print()! 
! 
ssc.start() // start the computation! 
ssc.awaitTermination() // wait for the computation to terminate
MLI: An API for Distributed Machine Learning 
Evan Sparks, Ameet Talwalkar, et al. 
International Conference on Data Mining (2013) 
http://arxiv.org/abs/1310.5426 
Unifying the Pieces: MLlib 
// http://spark.apache.org/docs/latest/mllib-guide.html! 
! 
val train_data = // RDD of Vector! 
val model = KMeans.train(train_data, k=10)! 
! 
// evaluate the model! 
val test_data = // RDD of Vector! 
test_data.map(t => model.predict(t)).collect().foreach(println)!
Unifying the Pieces: GraphX 
// http://spark.apache.org/docs/latest/graphx-programming-guide.html! 
! 
import org.apache.spark.graphx._! 
import org.apache.spark.rdd.RDD! 
! 
case class Peep(name: String, age: Int)! 
! 
val vertexArray = Array(! 
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! 
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! 
(5L, Peep("Leslie", 45))! 
)! 
val edgeArray = Array(! 
Edge(2L, 1L, 7), Edge(2L, 4L, 2),! 
Edge(3L, 2L, 4), Edge(3L, 5L, 3),! 
Edge(4L, 1L, 1), Edge(5L, 3L, 9)! 
)! 
! 
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)! 
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! 
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)! 
! 
val results = g.triplets.filter(t => t.attr > 7)! 
! 
for (triplet <- results.collect) {! 
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! 
}
Unifying the Pieces: Summary 
Demo, if time permits (perhaps in the hallway): 
Twitter Streaming Language Classifier 
databricks.gitbooks.io/databricks-spark-reference-applications/ 
twitter_classifier/README.html 
! 
For many more Spark resources online, check: 
databricks.com/spark-training-resources
TL;DR: Engineering is about costs 
Sure, maybe you’ll squeeze slightly better performance 
by using many specialized systems… 
However, putting on an Eng Director hat, would you 
be also prepared to pay the corresponding costs of: 
• learning curves for your developers across 
several different frameworks 
• ops for several different kinds of clusters 
• maintenance + troubleshooting mission-critical 
apps across several systems 
• tech-debt for OSS that ignores the math (80 yrs!) 
plus the fundamental h/w trade-offs
Integrations
Spark Integrations: 
Discover 
Insights 
Clean Up 
Your Data 
Run 
Sophisticated 
Analytics 
Integrate With 
Many Other 
Systems 
Use Lots of Different 
Data Sources 
cloud-based notebooks… ETL… the Hadoop ecosystem… 
widespread use of PyData… advanced analytics in streaming… 
rich custom search… web apps for data APIs… 
low-latency + multi-tenancy…
Spark Integrations: Unified platform for building Big Data pipelines 
Databricks Cloud 
databricks.com/blog/2014/07/14/databricks-cloud-making- 
big-data-easy.html 
youtube.com/watch?v=dJQ5lV5Tldw#t=883
Spark Integrations: The proverbial Hadoop ecosystem 
Spark + Hadoop + HBase + etc. 
mapr.com/products/apache-spark 
vision.cloudera.com/apache-spark-in-the-apache-hadoop-ecosystem/ 
hortonworks.com/hadoop/spark/ 
databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the- 
full-apache-spark-stack.html 
unified compute 
hadoop ecosystem
Spark Integrations: Leverage widespread use of Python 
Spark + PyData 
spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience 
cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 
unified compute 
Py Data
Spark Integrations: Advanced analytics for streaming use cases 
Kafka + Spark + Cassandra 
datastax.com/documentation/datastax_enterprise/4.5/ 
datastax_enterprise/spark/sparkIntro.html 
http://helenaedelson.com/?p=991 
github.com/datastax/spark-cassandra-connector 
github.com/dibbhatt/kafka-spark-consumer 
unified compute 
data streams columnar key-value
Spark Integrations: Rich search, immediate insights 
Spark + ElasticSearch 
databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. 
unified compute 
html 
elasticsearch.org/guide/en/elasticsearch/hadoop/current/ 
spark.html 
spark-summit.org/2014/talk/streamlining-search-indexing- 
using-elastic-search-and-spark 
document search
Spark Integrations: Building data APIs with web apps 
Spark + Play 
typesafe.com/blog/apache-spark-and-the-typesafe-reactive- 
platform-a-match-made-in-heaven 
unified compute 
web apps
Spark Integrations: The case for multi-tenancy 
Spark + Mesos 
spark.apache.org/docs/latest/running-on-mesos.html 
+ Mesosphere + Google Cloud Platform 
ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html 
unified compute 
cluster resources
Resources
certification: 
Apache Spark developer certificate program 
• http://oreilly.com/go/sparkcert 
• defined by Spark experts @Databricks 
• assessed by O’Reilly Media 
• preview @Strata NY
community: 
spark.apache.org/community.html 
video+slide archives: spark-summit.org 
local events: Spark Meetups Worldwide 
resources: databricks.com/spark-training-resources 
workshops: databricks.com/spark-training 
Intro to Spark 
Spark 
AppDev 
Spark 
DevOps 
Spark 
DataSci 
Distributed ML 
on Spark 
Streaming Apps 
on Spark 
Spark + 
Cassandra
books: 
Fast Data Processing 
with Spark 
Holden Karau 
Packt (2013) 
shop.oreilly.com/product/ 
9781782167068.do 
Spark in Action 
Chris Fregly 
Manning (2015*) 
sparkinaction.com/ 
Learning Spark 
Holden Karau, 
Andy Konwinski, 
Matei Zaharia 
O’Reilly (2015*) 
shop.oreilly.com/product/ 
0636920028512.do
events: 
Strata NY + Hadoop World 
NYC, Oct 15-17 
strataconf.com/stratany2014 
Big Data TechCon 
SF, Oct 27 
bigdatatechcon.com 
Strata EU 
Barcelona, Nov 19-21 
strataconf.com/strataeu2014 
Data Day Texas 
Austin, Jan 10 
datadaytexas.com 
Strata CA 
San Jose, Feb 18-20 
strataconf.com/strata2015 
Spark Summit East 
NYC, Mar 18-19 
spark-summit.org/east 
Spark Summit 2015 
SF, Jun 15-17 
spark-summit.org
presenter: 
monthly newsletter for updates, 
events, conf summaries, etc.: 
liber118.com/pxn/ 
Just Enough Math 
O’Reilly, 2014 
justenoughmath.com 
preview: youtu.be/TQ58cWgdCpA 
Enterprise Data Workflows 
with Cascading 
O’Reilly, 2013 
shop.oreilly.com/product/ 
0636920028536.do

Más contenido relacionado

La actualidad más candente

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search PerformanceLucidworks (Archived)
 
REST and JAX-RS
REST and JAX-RSREST and JAX-RS
REST and JAX-RSGuy Nir
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join SlidesSri Ambati
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Baruch Sadogursky
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQLOlaf Hartig
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughRenaud Bruyeron
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining MeetupDan Crankshaw
 

La actualidad más candente (20)

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
REST and JAX-RS
REST and JAX-RSREST and JAX-RS
REST and JAX-RS
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
H2O Big Join Slides
H2O Big Join SlidesH2O Big Join Slides
H2O Big Join Slides
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPedia
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
 

Destacado

หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark InternalBhuridech Sudsee
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIDavid Lauzon
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AILex Yu
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

Destacado (20)

หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part II
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Similar a Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How Spark Fits Into the Big Data Landscape (Databricks)

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 

Similar a Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How Spark Fits Into the Big Data Landscape (Databricks) (20)

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 

Más de BigDataEverywhere

Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...BigDataEverywhere
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...BigDataEverywhere
 

Más de BigDataEverywhere (7)

Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
 

Último

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 

Último (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How Spark Fits Into the Big Data Landscape (Databricks)

  • 1. Apache Spark plus many other frameworks: How Spark fits into the Big Data landscape Big Data Everywhere - Chicago, 2014-10-01 bigdataeverywhere.com/chicago/ Licensed under a Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International License
  • 3. What is Spark? Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations spark.apache.org “Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014)
  • 5. What is Spark? Spark Core is the general execution engine for the Spark platform that other functionality is built atop: ! • in-memory computing capabilities deliver speed • general execution model supports wide variety of use cases • ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)
  • 6. What is Spark? WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR
  • 7. What is Spark? Sustained exponential growth, as one of the most active Apache projects ohloh.net/orgs/apache
  • 8.
  • 10. A Brief History: Functional Programming for Big Data Theory, Eight Decades Ago: what can be computed? Haskell Curry haskell.org Alonso Church wikipedia.org Praxis, Four Decades Ago: algebra for applicative systems John Backus acm.org David Turner wikipedia.org Reality, Two Decades Ago: machine data from web apps Pattie Maes MIT Media Lab
  • 11. A Brief History: Functional Programming for Big Data circa late 1990s: explosive growth e-commerce and machine data implied that workloads could not fit on a single computer anymore… notable firms led the shift to horizontal scale-out on clusters of commodity hardware, especially for machine learning use cases at scale
  • 12. A Brief History: Functional Programming for Big Data circa 2002: mitigate risk of large distributed workloads lost due to disk failures on commodity hardware… Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung research.google.com/archive/gfs.html ! MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat research.google.com/archive/mapreduce.html
  • 13. A Brief History: Functional Programming for Big Data 2002 2004 MapReduce paper 2002 MapReduce @ Google 2004 2006 2008 2010 2012 2014 2006 Hadoop @ Yahoo! 2014 Apache Spark top-level 2010 Spark paper 2008 Hadoop Summit
  • 14. A Brief History: Functional Programming for Big Data MapReduce Pregel Giraph Dremel Drill S4 Storm F1 MillWheel General Batch Processing Specialized Systems: Impala GraphLab iterative, interactive, streaming, graph, etc. Tez MR doesn’t compose well for large applications, and so specialized systems emerged as workarounds
  • 15. A Brief History: Functional Programming for Big Data circa 2010: a unified engine for enterprise data workflows, based on commodity hardware a decade later… Spark: Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ! Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
  • 16. A Brief History: Functional Programming for Big Data In addition to simple map and reduce operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Better yet, combine these capabilities seamlessly into one integrated workflow…
  • 17. TL;DR: Applicative Systems and Functional Programming – RDDs action value RDD RDD RDD transformations RDD // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache() // action 1! messages.filter(_.contains("mysql")).count()
  • 18. TL;DR: Generational trade-offs for handling Big Compute Cheap Memory Cheap Storage Cheap Network recompute replicate reference (RDD) (DFS) (URI)
  • 19. TL;DR: Big Compute in Applicative Systems, by the numbers… 1. Express business logic in a preferred native language (Scala, Java, Python, Clojure, SQL, R, etc.) leveraging FP/closures 2. Build a graph of what must be computed 3. Rewrite the graph into stages using graph reduction to determine how to move predicates, what can be computed in parallel, where synchronization barriers are required, etc. (Wadsworth, Henderson, Turner, et al.) 4. Handle synchronization using Akka and reactive programming, with an LRU to manage memory working sets 5. Profit
  • 20. TL;DR: Big Compute…Implications Of course, if you can define the structure of workloads in terms of abstract algebra, this all becomes much more interesting – having vast implications on machine learning at scale, IoT, industrial applications, optimization in general, etc., as we retool the industrial plant However, we’ll leave that for another talk… http://justenoughmath.com/
  • 22. Unifying the Pieces: Spark SQL // http://spark.apache.org/docs/latest/sql-programming-guide.html! ! val sqlContext = new org.apache.spark.sql.SQLContext(sc)! import sqlContext._! ! // define the schema using a case class! case class Person(name: String, age: Int)! ! // create an RDD of Person objects and register it as a table! val people = sc.textFile("examples/src/main/resources/ people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))! ! people.registerAsTable("people")! ! // SQL statements can be run using the SQL methods provided by sqlContext! val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")! ! // results of SQL queries are SchemaRDDs and support all the ! // normal RDD operations…! // columns of a row in the result can be accessed by ordinal! teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 23. Unifying the Pieces: Spark Streaming // http://spark.apache.org/docs/latest/streaming-programming-guide.html! ! import org.apache.spark.streaming._! import org.apache.spark.streaming.StreamingContext._! ! // create a StreamingContext with a SparkConf configuration! val ssc = new StreamingContext(sparkConf, Seconds(10))! ! // create a DStream that will connect to serverIP:serverPort! val lines = ssc.socketTextStream(serverIP, serverPort)! ! // split each line into words! val words = lines.flatMap(_.split(" "))! ! // count each word in each batch! val pairs = words.map(word => (word, 1))! val wordCounts = pairs.reduceByKey(_ + _)! ! // print a few of the counts to the console! wordCounts.print()! ! ssc.start() // start the computation! ssc.awaitTermination() // wait for the computation to terminate
  • 24. MLI: An API for Distributed Machine Learning Evan Sparks, Ameet Talwalkar, et al. International Conference on Data Mining (2013) http://arxiv.org/abs/1310.5426 Unifying the Pieces: MLlib // http://spark.apache.org/docs/latest/mllib-guide.html! ! val train_data = // RDD of Vector! val model = KMeans.train(train_data, k=10)! ! // evaluate the model! val test_data = // RDD of Vector! test_data.map(t => model.predict(t)).collect().foreach(println)!
  • 25. Unifying the Pieces: GraphX // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! case class Peep(name: String, age: Int)! ! val vertexArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )! val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )! ! val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)! val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)! ! val results = g.triplets.filter(t => t.attr > 7)! ! for (triplet <- results.collect) {! println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! }
  • 26. Unifying the Pieces: Summary Demo, if time permits (perhaps in the hallway): Twitter Streaming Language Classifier databricks.gitbooks.io/databricks-spark-reference-applications/ twitter_classifier/README.html ! For many more Spark resources online, check: databricks.com/spark-training-resources
  • 27. TL;DR: Engineering is about costs Sure, maybe you’ll squeeze slightly better performance by using many specialized systems… However, putting on an Eng Director hat, would you be also prepared to pay the corresponding costs of: • learning curves for your developers across several different frameworks • ops for several different kinds of clusters • maintenance + troubleshooting mission-critical apps across several systems • tech-debt for OSS that ignores the math (80 yrs!) plus the fundamental h/w trade-offs
  • 29. Spark Integrations: Discover Insights Clean Up Your Data Run Sophisticated Analytics Integrate With Many Other Systems Use Lots of Different Data Sources cloud-based notebooks… ETL… the Hadoop ecosystem… widespread use of PyData… advanced analytics in streaming… rich custom search… web apps for data APIs… low-latency + multi-tenancy…
  • 30. Spark Integrations: Unified platform for building Big Data pipelines Databricks Cloud databricks.com/blog/2014/07/14/databricks-cloud-making- big-data-easy.html youtube.com/watch?v=dJQ5lV5Tldw#t=883
  • 31. Spark Integrations: The proverbial Hadoop ecosystem Spark + Hadoop + HBase + etc. mapr.com/products/apache-spark vision.cloudera.com/apache-spark-in-the-apache-hadoop-ecosystem/ hortonworks.com/hadoop/spark/ databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the- full-apache-spark-stack.html unified compute hadoop ecosystem
  • 32. Spark Integrations: Leverage widespread use of Python Spark + PyData spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience cwiki.apache.org/confluence/display/SPARK/PySpark+Internals unified compute Py Data
  • 33. Spark Integrations: Advanced analytics for streaming use cases Kafka + Spark + Cassandra datastax.com/documentation/datastax_enterprise/4.5/ datastax_enterprise/spark/sparkIntro.html http://helenaedelson.com/?p=991 github.com/datastax/spark-cassandra-connector github.com/dibbhatt/kafka-spark-consumer unified compute data streams columnar key-value
  • 34. Spark Integrations: Rich search, immediate insights Spark + ElasticSearch databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. unified compute html elasticsearch.org/guide/en/elasticsearch/hadoop/current/ spark.html spark-summit.org/2014/talk/streamlining-search-indexing- using-elastic-search-and-spark document search
  • 35. Spark Integrations: Building data APIs with web apps Spark + Play typesafe.com/blog/apache-spark-and-the-typesafe-reactive- platform-a-match-made-in-heaven unified compute web apps
  • 36. Spark Integrations: The case for multi-tenancy Spark + Mesos spark.apache.org/docs/latest/running-on-mesos.html + Mesosphere + Google Cloud Platform ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html unified compute cluster resources
  • 38. certification: Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • preview @Strata NY
  • 39. community: spark.apache.org/community.html video+slide archives: spark-summit.org local events: Spark Meetups Worldwide resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training Intro to Spark Spark AppDev Spark DevOps Spark DataSci Distributed ML on Spark Streaming Apps on Spark Spark + Cassandra
  • 40. books: Fast Data Processing with Spark Holden Karau Packt (2013) shop.oreilly.com/product/ 9781782167068.do Spark in Action Chris Fregly Manning (2015*) sparkinaction.com/ Learning Spark Holden Karau, Andy Konwinski, Matei Zaharia O’Reilly (2015*) shop.oreilly.com/product/ 0636920028512.do
  • 41. events: Strata NY + Hadoop World NYC, Oct 15-17 strataconf.com/stratany2014 Big Data TechCon SF, Oct 27 bigdatatechcon.com Strata EU Barcelona, Nov 19-21 strataconf.com/strataeu2014 Data Day Texas Austin, Jan 10 datadaytexas.com Strata CA San Jose, Feb 18-20 strataconf.com/strata2015 Spark Summit East NYC, Mar 18-19 spark-summit.org/east Spark Summit 2015 SF, Jun 15-17 spark-summit.org
  • 42. presenter: monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/ Just Enough Math O’Reilly, 2014 justenoughmath.com preview: youtu.be/TQ58cWgdCpA Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do