SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Spark with Elasticsearch
Who am I? 
Holden Karau 
● Software Engineer @ Databricks 
● I’ve worked with Elasticsearch before 
● I prefer she/her for pronouns 
● Author of a book on Spark and co-writing another 
● github https://github.com/holdenk 
○ Has all of the code from this talk :) 
● e-mail holden@databricks.com 
● @holdenkarau
What is Elasticsearch? 
● Lucene based distributed search system 
● Powerful tokenizing, stemming & other 
IR tools 
● Geographic query support 
● Capable of scaling to many nodes
Elasticsearch
Talk overview 
Goal: understand how to work with ES & Spark 
● Spark & Spark streaming let us re-use indexing code 
● We can customize the ES connector to write to the shard 
based on partition 
● Illustrate with twitter & show top tags per region 
● Maybe a live demo of the above demo* 
Assumptions: 
● Familiar(ish) with Search 
● Can read Scala 
Things you don’t have to worry about: 
● All the code is on-line, so don’t worry if you miss 
some 
*If we have extra time at the end
Spark + Elasticsearch 
● We can index our data on-line & off-line 
● Gain the power to query our data 
○ based on location 
○ free text search 
○ etc. 
Twitter Spark 
Streaming 
Elasticsearch Spark Query: 
Top Hash Tags 
Spark Re- 
Indexing 
Twitter
Why should you care? 
Small differences between off-line and on-line 
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: 
Spot_the_difference.png
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline 
val ssc = new StreamingContext(master, "IndexTweetsLive", 
Seconds(1)) 
val tweets = TwitterUtils.createStream(ssc, None)
Lets get ready to write the data into 
Elasticsearch 
Photo by Cloned Milkmen
Lets get ready to write the data into 
Elasticsearch 
def setupEsOnSparkContext(sc: SparkContext) = { 
val jobConf = new JobConf(sc.hadoopConfiguration) 
jobConf.set("mapred.output.format.class", 
"org.elasticsearch.hadoop.mr.EsOutputFormat") 
jobConf.setOutputCommitter(classOf[FileOutputCommitter]) 
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, 
“twitter/tweet”) 
FileOutputFormat.setOutputPath(jobConf, new Path("-")) 
jobconf 
}
Add a schema 
curl -XPUT 'http://localhost: 
9200/twitter/tweet/_mapping' -d ' 
{ 
"tweet" : { 
"properties" : { 
"message" : {"type" : "string"}, 
"hashTags" : {"type" : "string"}, 
"location" : {"type" : "geo_point"} 
} 
} 
}'
Lets format our tweets 
def prepareTweets(tweet: twitter4j.Status) = { 
… 
val hashTags = tweet.getHashtagEntities().map(_.getText()) 
HashMap( 
"docid" -> tweet.getId().toString, 
"message" -> tweet.getText(), 
"hashTags" -> hashTags.mkString(" "), 
"location" -> s"$lat,$lon" 
) 
} 
} 
// Convert to HadoopWritable types 
mapToOutput(fields) 
}
And save them... 
tweets.foreachRDD{(tweetRDD, time) => 
val sc = tweetRDD.context 
// The jobConf isn’t serilizable so we create it here 
val jobConf = SharedESConfig.setupEsOnSparkContext(sc, 
esResource, Some(esNodes)) 
// Convert our tweets to something that can be indexed 
val tweetsAsMap = tweetRDD.map( 
SharedIndex.prepareTweets) 
tweetsAsMap.saveAsHadoopDataset(jobConf) 
}
Now let’s query them! 
{"filtered" : { 
"query" : { 
"match_all" : {} 
} 
,"filter" : 
{"geo_distance" : 
{ 
"distance" : "${dist}km", 
"location" : 
{ 
"lat" : "${lat}", 
"lon" : "${lon}" 
}}}}}}
Now let’s find the hash tags :) 
// Set our query 
jobConf.set("es.query", query) 
// Create an RDD of the tweets 
val currentTweets = sc.hadoopRDD(jobConf, 
classOf[EsInputFormat[Object, MapWritable]], 
classOf[Object], classOf[MapWritable]) 
// Convert to a format we can work with 
val tweets = currentTweets.map{ case (key, value) => 
SharedIndex.mapWritableToInput(value) } 
// Extract the hashtags 
val hashTags = tweets.flatMap{t => 
t.getOrElse("hashTags", "").split(" ") 
}
and extract the top hashtags 
object WordCountOrdering extends Ordering[(String, Int)]{ 
def compare(a: (String, Int), b: (String, Int)) = { 
b._2 compare a._2 
} 
} 
val ht = hashtags.map(x => (x, 1)).reduceByKey((x,y) => x+y) 
val topTags = ht.takeOrdered(40)(WordCountOrdering)
NYC SF 
#MEX,11 
#Job,11 
#Jobs,8 
#nyc,7 
#CarterFollowMe,7 
#Mexico,6 
#BRA,6 
#selfie,5 
#TweetMyJobs,5 
#LHHATL,5 
#NYC,5 
#ETnow,4 
#TeenWolf,4 
#CRO,4 
#Job,6 
#Jobs,5 
#MEX,4 
#TweetMyJobs,3 
#TeenWolfSeason4,2 
#CRO,2 
#Roseville,2 
#Healthcare,2 
#GOT7COMEBACK,2 
#autodeskinterns,2
Indexing Part 2 
(electric boogaloo) 
Writing directly to a node with the correct shard saves us network overhead 
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
So what does that give us? 
Spark sets the filename to part-[part #] 
If we have same partitioner we write 
directly 
Partition 1 
Partition 2 
Partition 3 
ES Node 1 
Partition {1,2} 
ES Node 2 
Partition {3}
Re-index all the things* 
// Fetch them from twitter 
val t4jt = tweets.flatMap{ tweet => 
val twitter = TwitterFactory.getSingleton() 
val tweetID = tweet.getOrElse("docid", "") 
Option(twitter.showStatus(tweetID.toLong)) 
} 
t4jt.map(SharedIndex.prepareTweets) 
.saveAsHadoopDataset(jobConf) 
*Until you hit your twitter rate limit…. oops
“Useful” links 
● Feedback holden@databricks.com 
● Customized ES connector*: https://github. 
com/holdenk/elasticsearch-hadoop 
● Demo code: https://github.com/holdenk/elasticsearchspark 
● Elasticsearch: http://www.elasticsearch.org/ 
● Spark: http://spark.apache.org/ 
● Spark streaming: http://spark.apache.org/streaming/ 
● Elasticsearch Spark documentation: http://www.elasticsearch. 
org/guide/en/elasticsearch/hadoop/current/spark.html 
● http://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. 
html
So what did we cover? 
● Indexing data with Spark to Elasticsearch 
● Sharing indexing code between Spark & Spark Streaming 
● Using Elasticsearch for geolocal data in Spark 
● Making our indexing aware of Elasticsearch 
● Lots* of cat pictures 
* There were more before.
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/

Más contenido relacionado

La actualidad más candente

How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK hypto
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibanadknx01
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic searchmarkstory
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Modelssparktc
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSematext Group, Inc.
 

La actualidad más candente (20)

How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and Solr
 

Destacado

Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツHolden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 

Destacado (9)

Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 

Similar a Spark with Elasticsearch

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQLSimon Elliston Ball
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchPedro Franceschi
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introductionantoinegirbal
 

Similar a Spark with Elasticsearch (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearch
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Angular Schematics
Angular SchematicsAngular Schematics
Angular Schematics
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 

Último

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 

Último (20)

20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 

Spark with Elasticsearch

  • 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another ● github https://github.com/holdenk ○ Has all of the code from this talk :) ● e-mail holden@databricks.com ● @holdenkarau
  • 3. What is Elasticsearch? ● Lucene based distributed search system ● Powerful tokenizing, stemming & other IR tools ● Geographic query support ● Capable of scaling to many nodes
  • 5. Talk overview Goal: understand how to work with ES & Spark ● Spark & Spark streaming let us re-use indexing code ● We can customize the ES connector to write to the shard based on partition ● Illustrate with twitter & show top tags per region ● Maybe a live demo of the above demo* Assumptions: ● Familiar(ish) with Search ● Can read Scala Things you don’t have to worry about: ● All the code is on-line, so don’t worry if you miss some *If we have extra time at the end
  • 6. Spark + Elasticsearch ● We can index our data on-line & off-line ● Gain the power to query our data ○ based on location ○ free text search ○ etc. Twitter Spark Streaming Elasticsearch Spark Query: Top Hash Tags Spark Re- Indexing Twitter
  • 7. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  • 8. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 9. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) val tweets = TwitterUtils.createStream(ssc, None)
  • 10. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  • 11. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  • 12. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } }'
  • 13. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { … val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  • 14. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  • 15. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  • 16. Now let’s find the hash tags :) // Set our query jobConf.set("es.query", query) // Create an RDD of the tweets val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Convert to a format we can work with val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } // Extract the hashtags val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") }
  • 17. and extract the top hashtags object WordCountOrdering extends Ordering[(String, Int)]{ def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val ht = hashtags.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topTags = ht.takeOrdered(40)(WordCountOrdering)
  • 18. NYC SF #MEX,11 #Job,11 #Jobs,8 #nyc,7 #CarterFollowMe,7 #Mexico,6 #BRA,6 #selfie,5 #TweetMyJobs,5 #LHHATL,5 #NYC,5 #ETnow,4 #TeenWolf,4 #CRO,4 #Job,6 #Jobs,5 #MEX,4 #TweetMyJobs,3 #TeenWolfSeason4,2 #CRO,2 #Roseville,2 #Healthcare,2 #GOT7COMEBACK,2 #autodeskinterns,2
  • 19. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  • 20. So what does that give us? Spark sets the filename to part-[part #] If we have same partitioner we write directly Partition 1 Partition 2 Partition 3 ES Node 1 Partition {1,2} ES Node 2 Partition {3}
  • 21. Re-index all the things* // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  • 22. “Useful” links ● Feedback holden@databricks.com ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/ ● Elasticsearch Spark documentation: http://www.elasticsearch. org/guide/en/elasticsearch/hadoop/current/spark.html ● http://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch. html
  • 23. So what did we cover? ● Indexing data with Spark to Elasticsearch ● Sharing indexing code between Spark & Spark Streaming ● Using Elasticsearch for geolocal data in Spark ● Making our indexing aware of Elasticsearch ● Lots* of cat pictures * There were more before.
  • 24. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/