SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Apache Spark & Elasticsearch 
Holden Karau - UMD 2014 
Now with 
delicious 
Spark SQL*
Who am I? 
Holden Karau 
● Software Engineer @ Databricks 
● I’ve worked with Elasticsearch before 
● I prefer she/her for pronouns 
● Author of a book on Spark and co-writing another 
● github https://github.com/holdenk 
○ Has all of the code from this talk :) 
● e-mail holden@databricks.com 
● twitter: @holdenkarau 
● linkedin: https://www.linkedin.com/in/holdenkarau
What is Elasticsearch? 
● Lucene based distributed search system 
● Powerful tokenizing, stemming & other 
IR tools 
● Geographic query support 
● Capable of scaling to many nodes
Elasticsearch
Talk overview 
Goal: understand how to work with ES & Spark 
● Spark & Spark streaming let us re-use indexing code 
● We can customize the ES connector to write to the shard 
based on partition 
● Illustrate with twitter & show top tags per region 
● Maybe a live demo of the above demo* 
Assumptions: 
● Familiar(ish) with Search 
● Can read Scala 
Things you don’t have to worry about: 
● All the code is on-line, so don’t worry if you miss 
some 
*If we have extra time at the end
Spark + Elasticsearch 
● We can index our data on-line & off-line 
● Gain the power to query our data 
○ based on location 
○ free text search 
○ etc. 
Twitter Spark 
Streaming 
Elasticsearch Spark Query: 
Top Hash Tags 
Spark Re- 
Indexing 
Twitter
Why should you care? 
Small differences between off-line and on-line 
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: 
Spot_the_difference.png
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline 
val ssc = new StreamingContext(master, "IndexTweetsLive", 
Seconds(1)) 
val tweets = TwitterUtils.createStream(ssc, None)
Lets get ready to write the data into 
Elasticsearch 
Photo by Cloned Milkmen
Lets get ready to write the data into 
Elasticsearch 
def setupEsOnSparkContext(sc: SparkContext) = { 
val jobConf = new JobConf(sc.hadoopConfiguration) 
jobConf.set("mapred.output.format.class", 
"org.elasticsearch.hadoop.mr.EsOutputFormat") 
jobConf.setOutputCommitter(classOf[FileOutputCommitter]) 
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, 
“twitter/tweet”) 
FileOutputFormat.setOutputPath(jobConf, new Path("-")) 
jobconf 
}
Add a schema 
curl -XPUT 'http://localhost: 
9200/twitter/tweet/_mapping' -d ' 
{ 
"tweet" : { 
"properties" : { 
"message" : {"type" : "string"}, 
"hashTags" : {"type" : "string"}, 
"location" : {"type" : "geo_point"} 
} 
} 
}'
Lets format our tweets 
def prepareTweets(tweet: twitter4j.Status) = { 
… 
val hashTags = tweet.getHashtagEntities().map(_.getText()) 
HashMap( 
"docid" -> tweet.getId().toString, 
"message" -> tweet.getText(), 
"hashTags" -> hashTags.mkString(" "), 
"location" -> s"$lat,$lon" 
) 
} 
} 
// Convert to HadoopWritable types 
mapToOutput(fields) 
}
And save them... 
tweets.foreachRDD{(tweetRDD, time) => 
val sc = tweetRDD.context 
// The jobConf isn’t serilizable so we create it here 
val jobConf = SharedESConfig.setupEsOnSparkContext(sc, 
esResource, Some(esNodes)) 
// Convert our tweets to something that can be indexed 
val tweetsAsMap = tweetRDD.map( 
SharedIndex.prepareTweets) 
tweetsAsMap.saveAsHadoopDataset(jobConf) 
}
Elastic Search now has 
Spark SQL! 
case class tweetCS(docid: String, message: String, 
hashTags: String, location: Opiton[String])
Format as a Schema RDD 
def prepareTweetsCaseClass(tweet: twitter4j.Status) = { 
tweetCS(tweet.getId().toString, tweet.getText(), 
hashTags, 
tweet.getGeoLocation() match { 
case null => None 
case loc => { 
val lat = loc.getLatitude() 
val lon = loc.getLongitude() 
Some(s"$lat,$lon") 
} 
}) 
}
And save them… with SQL 
tweets.foreachRDD{(tweetRDD, time) => 
val sc = tweetRDD.context 
// The jobConf isn’t serilizable so we create it here 
val sqlCtx = new SQLContext(sc) 
import sqlCtx.createSchemaRDD 
val tweetsAsCS = createSchemaRDD( 
tweetRDD.map(SharedIndex.prepareTweetsCaseClass)) 
tweetsAsCS.saveToEs(esResource) 
}
Now let’s query them! 
{"filtered" : { 
"query" : { 
"match_all" : {} 
} 
,"filter" : 
{"geo_distance" : 
{ 
"distance" : "${dist}km", 
"location" : 
{ 
"lat" : "${lat}", 
"lon" : "${lon}" 
}}}}}}
Now let’s find the hash tags :) 
// Set our query 
jobConf.set("es.query", query) 
// Create an RDD of the tweets 
val currentTweets = sc.hadoopRDD(jobConf, 
classOf[EsInputFormat[Object, MapWritable]], 
classOf[Object], classOf[MapWritable]) 
// Convert to a format we can work with 
val tweets = currentTweets.map{ case (key, value) => 
SharedIndex.mapWritableToInput(value) } 
// Extract the hashtags 
val hashTags = tweets.flatMap{t => 
t.getOrElse("hashTags", "").split(" ") 
}
and extract the top hashtags 
object WordCountOrdering extends Ordering[(String, Int)]{ 
def compare(a: (String, Int), b: (String, Int)) = { 
b._2 compare a._2 
} 
} 
val ht = hashtags.map(x => (x, 1)).reduceByKey((x,y) => x+y) 
val topTags = ht.takeOrdered(40)(WordCountOrdering)
or with SQL... 
// Create a Schema RDD of the tweets 
val currentTweets = sqlCtx.esRDD(esResource, query) 
// Extract the hashtags. We could do this in a 
// more SQL way but I’m more comfortable in Scala 
val hashTags = tweets.select(‘hashtags).flatMap{t => 
t.getString(0).split(" ") 
} 
*We used a customized connector to handle location information
NYC SF 
#MEX,11 
#Job,11 
#Jobs,8 
#nyc,7 
#CarterFollowMe,7 
#Mexico,6 
#BRA,6 
#selfie,5 
#TweetMyJobs,5 
#LHHATL,5 
#NYC,5 
#ETnow,4 
#TeenWolf,4 
#CRO,4 
#Job,6 
#Jobs,5 
#MEX,4 
#TweetMyJobs,3 
#TeenWolfSeason4,2 
#CRO,2 
#Roseville,2 
#Healthcare,2 
#GOT7COMEBACK,2 
#autodeskinterns,2
UMD SF 
I,24 
to,18 
a,13 
the,9 
me,8 
and,7 
in,7 
you,6 
my,5 
that,5 
for,5 
is,5 
of,5 
it,5 
to,14 
in,13 
the,11 
of,11 
a,9 
I,9 
and,8 
you,6 
my,6 
for,5 
our,5 
I didn’t have enough time to index anything fun :(
Indexing Part 2 
(electric boogaloo) 
Writing directly to a node with the correct shard saves us network overhead 
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
So what does that give us? 
Spark sets the filename to part-[part #] 
If we have same partitioner we write 
directly 
Partition 1 
Partition 2 
Partition 3 
ES Node 1 
Partition {1,2} 
ES Node 2 
Partition {3}
Re-index all the things* 
// Fetch them from twitter 
val t4jt = tweets.flatMap{ tweet => 
val twitter = TwitterFactory.getSingleton() 
val tweetID = tweet.getOrElse("docid", "") 
Option(twitter.showStatus(tweetID.toLong)) 
} 
t4jt.map(SharedIndex.prepareTweets) 
.saveAsHadoopDataset(jobConf) 
*Until you hit your twitter rate limit…. oops
Demo time! 
my ip is 10.109.32.18 
*Until you hit your twitter rate limit…. oops
“Useful” links 
● Feedback holden@databricks.com 
● My twitter: https://twitter.com/holdenkarau 
● Customized ES connector*: https://github. 
com/holdenk/elasticsearch-hadoop 
● Demo code: https://github.com/holdenk/elasticsearchspark 
● Elasticsearch: http://www.elasticsearch.org/ 
● Spark: http://spark.apache.org/ 
● Spark streaming: http://spark.apache.org/streaming/ 
● Elasticsearch Spark documentation: http://www.elasticsearch. 
org/guide/en/elasticsearch/hadoop/current/spark.html 
● http://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. 
html
So what did we cover? 
● Indexing data with Spark to Elasticsearch 
● Sharing indexing code between Spark & Spark Streaming 
● Using Elasticsearch for geolocal** data in Spark 
● Making our indexing aware of Elasticsearch 
● Lots* of cat pictures 
* There were more before.
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/

Más contenido relacionado

La actualidad más candente

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkSpark Summit
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 

La actualidad más candente (20)

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Python database interfaces
Python database  interfacesPython database  interfaces
Python database interfaces
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 

Destacado

Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...
Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...
Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...AWCConnect
 
El 23 de diciembre
El 23 de diciembreEl 23 de diciembre
El 23 de diciembrelroczey
 
El 3 de enero
El 3 de eneroEl 3 de enero
El 3 de enerolroczey
 
Bio movie chapter 37
Bio movie  chapter 37Bio movie  chapter 37
Bio movie chapter 37allybove
 
El 6 de febrero
El 6 de febreroEl 6 de febrero
El 6 de febrerolroczey
 
Cover Letter and Resume
Cover Letter and ResumeCover Letter and Resume
Cover Letter and Resumekhiara_albaran
 
The 2012 Project: A Year of Opportunity for Women -- March Webinar
The 2012 Project: A Year of Opportunity for Women -- March WebinarThe 2012 Project: A Year of Opportunity for Women -- March Webinar
The 2012 Project: A Year of Opportunity for Women -- March WebinarAWCConnect
 
O voceiro do piñeiro manso 2014
O voceiro do piñeiro manso 2014O voceiro do piñeiro manso 2014
O voceiro do piñeiro manso 2014trasnoparoleiro
 
El 24 de febrero
El 24 de febreroEl 24 de febrero
El 24 de febrerolroczey
 
Play & Learn Toys Holiday 2011 Gift Guide
Play & Learn Toys Holiday 2011 Gift GuidePlay & Learn Toys Holiday 2011 Gift Guide
Play & Learn Toys Holiday 2011 Gift GuideSoConnected
 
El primero de marzo
El primero de marzoEl primero de marzo
El primero de marzolroczey
 
El 2 de enero
El 2 de eneroEl 2 de enero
El 2 de enerolroczey
 
El 7 de febrero
El 7 de febreroEl 7 de febrero
El 7 de febrerolroczey
 
El 10 de enero
El 10 de eneroEl 10 de enero
El 10 de enerolroczey
 
Gramer book #2
Gramer book #2Gramer book #2
Gramer book #2carlielynn
 
IGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching PresentationIGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching Presentationze1337
 
Cg's senior project proposal approval pp
Cg's senior project proposal approval ppCg's senior project proposal approval pp
Cg's senior project proposal approval ppcarlielynn
 

Destacado (20)

Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...
Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...
Talkin' to my Generation: How to Market to Baby Boomers and Beyond Using Soci...
 
El 23 de diciembre
El 23 de diciembreEl 23 de diciembre
El 23 de diciembre
 
El 3 de enero
El 3 de eneroEl 3 de enero
El 3 de enero
 
Bio movie chapter 37
Bio movie  chapter 37Bio movie  chapter 37
Bio movie chapter 37
 
El 6 de febrero
El 6 de febreroEl 6 de febrero
El 6 de febrero
 
Cover Letter and Resume
Cover Letter and ResumeCover Letter and Resume
Cover Letter and Resume
 
The 2012 Project: A Year of Opportunity for Women -- March Webinar
The 2012 Project: A Year of Opportunity for Women -- March WebinarThe 2012 Project: A Year of Opportunity for Women -- March Webinar
The 2012 Project: A Year of Opportunity for Women -- March Webinar
 
O voceiro do piñeiro manso 2014
O voceiro do piñeiro manso 2014O voceiro do piñeiro manso 2014
O voceiro do piñeiro manso 2014
 
El 24 de febrero
El 24 de febreroEl 24 de febrero
El 24 de febrero
 
Play & Learn Toys Holiday 2011 Gift Guide
Play & Learn Toys Holiday 2011 Gift GuidePlay & Learn Toys Holiday 2011 Gift Guide
Play & Learn Toys Holiday 2011 Gift Guide
 
El primero de marzo
El primero de marzoEl primero de marzo
El primero de marzo
 
El 2 de enero
El 2 de eneroEl 2 de enero
El 2 de enero
 
Experience
Experience Experience
Experience
 
El 7 de febrero
El 7 de febreroEl 7 de febrero
El 7 de febrero
 
Presentation 1
Presentation 1Presentation 1
Presentation 1
 
El 10 de enero
El 10 de eneroEl 10 de enero
El 10 de enero
 
C de campo2012
C de campo2012C de campo2012
C de campo2012
 
Gramer book #2
Gramer book #2Gramer book #2
Gramer book #2
 
IGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching PresentationIGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching Presentation
 
Cg's senior project proposal approval pp
Cg's senior project proposal approval ppCg's senior project proposal approval pp
Cg's senior project proposal approval pp
 

Similar a Spark with Elasticsearch - umd version 2014

Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Toying with spark
Toying with sparkToying with spark
Toying with sparkRaymond Tay
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 

Similar a Spark with Elasticsearch - umd version 2014 (20)

Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 

Spark with Elasticsearch - umd version 2014

  • 1. Apache Spark & Elasticsearch Holden Karau - UMD 2014 Now with delicious Spark SQL*
  • 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another ● github https://github.com/holdenk ○ Has all of the code from this talk :) ● e-mail holden@databricks.com ● twitter: @holdenkarau ● linkedin: https://www.linkedin.com/in/holdenkarau
  • 3. What is Elasticsearch? ● Lucene based distributed search system ● Powerful tokenizing, stemming & other IR tools ● Geographic query support ● Capable of scaling to many nodes
  • 5. Talk overview Goal: understand how to work with ES & Spark ● Spark & Spark streaming let us re-use indexing code ● We can customize the ES connector to write to the shard based on partition ● Illustrate with twitter & show top tags per region ● Maybe a live demo of the above demo* Assumptions: ● Familiar(ish) with Search ● Can read Scala Things you don’t have to worry about: ● All the code is on-line, so don’t worry if you miss some *If we have extra time at the end
  • 6. Spark + Elasticsearch ● We can index our data on-line & off-line ● Gain the power to query our data ○ based on location ○ free text search ○ etc. Twitter Spark Streaming Elasticsearch Spark Query: Top Hash Tags Spark Re- Indexing Twitter
  • 7. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  • 8. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 9. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) val tweets = TwitterUtils.createStream(ssc, None)
  • 10. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  • 11. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  • 12. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } }'
  • 13. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { … val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  • 14. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  • 15. Elastic Search now has Spark SQL! case class tweetCS(docid: String, message: String, hashTags: String, location: Opiton[String])
  • 16. Format as a Schema RDD def prepareTweetsCaseClass(tweet: twitter4j.Status) = { tweetCS(tweet.getId().toString, tweet.getText(), hashTags, tweet.getGeoLocation() match { case null => None case loc => { val lat = loc.getLatitude() val lon = loc.getLongitude() Some(s"$lat,$lon") } }) }
  • 17. And save them… with SQL tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val sqlCtx = new SQLContext(sc) import sqlCtx.createSchemaRDD val tweetsAsCS = createSchemaRDD( tweetRDD.map(SharedIndex.prepareTweetsCaseClass)) tweetsAsCS.saveToEs(esResource) }
  • 18. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  • 19. Now let’s find the hash tags :) // Set our query jobConf.set("es.query", query) // Create an RDD of the tweets val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Convert to a format we can work with val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } // Extract the hashtags val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") }
  • 20. and extract the top hashtags object WordCountOrdering extends Ordering[(String, Int)]{ def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val ht = hashtags.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topTags = ht.takeOrdered(40)(WordCountOrdering)
  • 21. or with SQL... // Create a Schema RDD of the tweets val currentTweets = sqlCtx.esRDD(esResource, query) // Extract the hashtags. We could do this in a // more SQL way but I’m more comfortable in Scala val hashTags = tweets.select(‘hashtags).flatMap{t => t.getString(0).split(" ") } *We used a customized connector to handle location information
  • 22. NYC SF #MEX,11 #Job,11 #Jobs,8 #nyc,7 #CarterFollowMe,7 #Mexico,6 #BRA,6 #selfie,5 #TweetMyJobs,5 #LHHATL,5 #NYC,5 #ETnow,4 #TeenWolf,4 #CRO,4 #Job,6 #Jobs,5 #MEX,4 #TweetMyJobs,3 #TeenWolfSeason4,2 #CRO,2 #Roseville,2 #Healthcare,2 #GOT7COMEBACK,2 #autodeskinterns,2
  • 23. UMD SF I,24 to,18 a,13 the,9 me,8 and,7 in,7 you,6 my,5 that,5 for,5 is,5 of,5 it,5 to,14 in,13 the,11 of,11 a,9 I,9 and,8 you,6 my,6 for,5 our,5 I didn’t have enough time to index anything fun :(
  • 24. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  • 25. So what does that give us? Spark sets the filename to part-[part #] If we have same partitioner we write directly Partition 1 Partition 2 Partition 3 ES Node 1 Partition {1,2} ES Node 2 Partition {3}
  • 26. Re-index all the things* // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  • 27. Demo time! my ip is 10.109.32.18 *Until you hit your twitter rate limit…. oops
  • 28. “Useful” links ● Feedback holden@databricks.com ● My twitter: https://twitter.com/holdenkarau ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/ ● Elasticsearch Spark documentation: http://www.elasticsearch. org/guide/en/elasticsearch/hadoop/current/spark.html ● http://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. html
  • 29. So what did we cover? ● Indexing data with Spark to Elasticsearch ● Sharing indexing code between Spark & Spark Streaming ● Using Elasticsearch for geolocal** data in Spark ● Making our indexing aware of Elasticsearch ● Lots* of cat pictures * There were more before.
  • 30. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/