SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
Wifi
•GTVisitor  
•hotel/guest  
•pass:  76FCE
Data Engineering with
Solr and Spark
Grant Ingersoll
@gsingers
CTO, Lucidworks
Lucidworks  Fusion  Is  Search-­‐Driven  Everything
•Drive  next  genera=on  relevance  
via  Content,  Collabora=on  and  
Context  
•Harness  best  in  class  Open  
Source:  Apache  Solr  +  Spark  
•Simplify  applica=on  
development  and  reduce  
ongoing  maintenance
Fusion  is  built  on  three  
core  principles:
Fusion  Architecture
RESTAPI Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HDFS(Optional)
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Aler=ng/Messaging
NLP
Pipelines
Blob  Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin  UI
SECURITY  BUILT-­‐IN
https://twitter.com/gsingers/status/700459516362625026
Get Started
https://github.com/Lucidworks/fusion-examples/tree/master/great-
wide-open-2016
• Why  Search  for  Data  Engineering?  
• Quick  intro  to  Solr  
• Quick  intro  to  Spark  
• Solr  +  Spark  
• Relevance  101  
• Machine  learning  with  Spark  and  Solr  
• What’s  next?
Let’s  Do  This
Examples  throughout!
The Importance of Importance
Search-­‐Driven  
Everything
Customer  
Service
Customer  
Insights
Fraud  Surveillance
Research  
Portal
Online  Retail
Digital  
Content
• Data  Engineering,  esp.  with  text,  is  a  
strange  and  magical  world  filled  with…  
– Evil  villains  
– Jesters  
– Wizards  
– Unicorns  
– Heroes!  
• In  other  words,  no  system  will  be  perfect
Caveat  Emptor:  Data  Engineering  EdiLon
• You  will  spend  most  of  your  time  in  data  
engineering,  search,  machine  learning  and  NLP  
doing  “grunt”  work  nicely  labeled  as:  
– Preprocessing  
– Feature  Selection  
– Sampling  
– Validation/testing/etc.  
– Content  extraction  
– ETL  
• Corollary:  Start  with  simple,  tried  and  true  
algorithms,  then  iterate
Why  do  data  engineering  with  Solr  and  Spark?
Solr Spark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Fast, large scale iterative
algorithms
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Lots of integrations with other big
data systems
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr  Key  Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
Lucene  for  the  Win!
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
more
• Easy to plug-in ranking
Solr  and  Your  Tools
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
others
• Output formats: JSON, CSV, XML, custom
Basics  of  Solr  Requests
• Querying:
• Simple: term, phrases, boolean, wildcards, weights
• Advanced: query parsers, spatial, etc.
• Facets: term, query, range, pivot, stats
• Highlighting
• Spell checking
Solr Basics
Spark  Key  Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing,
integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos
Spark  Basics
• Resilient Distributed Datasets
• Spark SQL provides a Data Source, which provides a
DataFrame
• DataFrames — a DSL for distributed data manipulation
• Seamless integration with other Spark tech: SparkR,
Python
Spark  Components
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
components
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory
Why  Spark  for  Solr?
• Build the index very, very quickly!
• Aggregations
• Boosts, stats, iterative computations
• Offline compute to update index with additional info (e.g.
PageRank, popularity)
• Whole corpus analytics, clustering, classification
• Joins with other storage (Cassandra, HDFS, DB, HBase)
Why  Solr  for  Spark?
• Massive simplification of operations!
• Non “dumb” distributed, resilient storage
• Random access with smart queries
• Table scans
• Advanced filtering, feature selection
• Schemaless when you want, predefined when you don’t
• Spatial, columnar, sparse
Spark  +  Solr  in  Anger
http://github.com/lucidworks/spark-solr
Map<String,	
  String>	
  options	
  =	
  new	
  HashMap<String,	
  String>();

options.put("zkhost",	
  zkHost);

options.put("collection”,	
  "tweets");



DataFrame	
  df	
  =	
  sqlContext.read().format("solr").options(options).load();	
  
count	
  =	
  df.filter(df.col("type_s").equalTo(“echo")).count();
Spark  Shell  in  a  Nutshell
• Common commands
• Solr in Spark: queries, filters and other requests
• See commands.md in the Github repo
But is it relevant?
Tales from the
trenches
Look before
you leap
• Wing it
• Ask — Caveat Emptor
• Log analysis
• Experimentation: A/B (A/A) testing
Approaches
• Precision/Recall (also, Mean Avg. Precision)
• Mean Reciprocal Rank (MRR)
• Number of {Zero|Embarrassing} Results
• Inter-Annotator Agreement
• Normalized Discounted Cumulative Gain (NDCG)
Common  Metrics
Tips and Traps
Algorithms Collective Intelligence Editors/Rules
The mainstay of any approach: leverages
Lucene/Solr’s built in similarity engine,
function queries and other capabilities to
determine importance based on core index
Especially effective for curating the long
tail, feedback from users and other systems
provide key insights into importance. Can
also be used to inform the business about
trends and interests.
Should be used sparingly to handle key
situations such as promotions and edge
cases. Review often. Encourage
experimentation instead. Works well for
landing pages, boosts and blocks where
you know the answers. Not to be confused
with curating content.
Big  Picture  on  Relevance
• Similarity Models
Default, BM25F, others
• Function Queries, Reranking, Boosts
• Phrases are almost always a win (edismax does most of this for you)
e.g.: (exact match terms)^100 AND (“termA termB…”~10)^50 AND (termA AND
termB…)^10 AND (termA OR termB)
• Mind your analysis
Algorithms
• UI, UI, UI!
• 1000’s of rules
• Second is the first loser
• Local minimum
• Pet peeve queries
• Oprah effect
• Assumptions
It’s a trap!
Level up
• Spark ships with good out of the box machine learning capabilities
• Spark-Solr brings enhanced feature selection tools via Lucene analyzers
• Examples
k-means
word2vec
Find synonyms
Machine  Learning  at  Work
Sneak Peek
• Parallel  Execu=on  of  SQL  across  
SolrCloud  
• Real=me  Map-­‐Reduce  (“ish”)  
Func=onality  
• Parallel  Rela=onal  Algebra  
• Builds  on  streaming  capabili=es  in  5.x  
• JDBC  client  in  the  works
Just  When  You  Thought  SQL  was  Dead
Full, Parallelized, SQL Support
• Lots  of  Func=ons:  
• Search,  Merge,  Group,  Unique,  Parallel,  
Select,  Reduce,  Select,  innerJoin,  
hashJoin,  Top,  Rollup,  Facet,  Stats,  
Update,  JDBC,  Intersect,  Complement,  
Logit  
• Composable  Streams  
• Query  op=miza=on  built  in
SQL  Guts
Example
select	
  str_s,	
  count(*),	
  sum(field_i),	
  min(field_i),	
  max(field_i),	
  
avg(field_i)	
  from	
  collection1	
  where	
  text=’XXXX’	
  group	
  by	
  str_s
rollup(	
  
	
  	
  	
  search(collection1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  q=”(text:XXXX)”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  qt=”/export”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl=”str_s,field_i”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKeys=str_s,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort=”str_s	
  asc”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  zkHost=”localhost:9989/solr”),	
  
	
  	
  	
  over=str_s,	
  
	
  	
  	
  count(*),	
  
	
  	
  	
  sum(field_i),	
  
	
  	
  	
  min(field_i),	
  
	
  	
  	
  max(field_i),	
  
	
  	
  	
  avg(field_i)
• Provides  replica=on  between  two  or  more  SolrCloud  clusters  located  in  two  or  
more  data  centers  
• Uses  exis=ng  transac=on  logs  
• Asynchronous  indexing  
• No  Single  Point  of  Failure  or  boglenecks  
• Leader-­‐to-­‐leader  communica=on  to  ensure  updates  are  only  sent  once  
Never  Go  Down,  or  at  least  Recover  Quickly!
Cross Data Center Replication
• Graph  Traversal  
• Find  all  tweets  men=oning  “Solr”  by  me  or  people  I  follow  
• Find  all  drah  blog  posts  about  “Parallel  SQL”  wrigen  by  a  developer  
• Find  3-­‐star  hotels  in  NYC  my  friends  stayed  in  last  year  
• BM25F  Default  Similarity  
• Geo3D  search
Make  ConnecLons,  Get  BeXer  Results
• Jegy  9.3  and  hgp2  (6.x)  
• Fully  mul=plexed  over  a  single  connec=on  
• Reduced  chance  of  distributed  deadlock  
• Backup/Restore  API  
• Op=miza=ons  to  distributed  search  algorithm  
• AngularJS-­‐based  UI
But  Wait!    There’s  More!
2016
OCTOBER 13-16, 2016
BOSTON, MA
Resources
• This code: https://github.com/Lucidworks/fusion-
examples/tree/master/great-wide-open-2016
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Appendix  A:  SQL  details
Streaming API & Expressions
●API
○ Java API to provide programming framework
○ Returns tuples as a JSON stream
○ org.apache.solr.client.solrj.io	
  
●Expressions
○ String Query Language
○ Serialization format
○ Allows non-Java programmers to access Streaming API
DocValues must be enabled for any field to be returned
Streaming Expression Request
curl	
  -­‐-­‐data-­‐urlencode	
  	
  
	
  	
  	
  'stream=search(sample,	
  
	
  	
  	
  	
  	
  	
  q="*:*",	
  
	
  	
  	
  	
  	
  	
  fl="id,field_i",	
  
	
  	
  	
  	
  	
  	
  sort="field_i	
  asc")'	
  http://localhost:8901/solr/sample/stream
Streaming Expression Response
{"responseHeader":	
  {"status":	
  0,	
  "QTime":	
  1},	
  
	
  	
  	
  	
  "tuples":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "numFound":	
  -­‐1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "start":	
  -­‐1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "docs":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"id":	
  "doc1",	
  "field_i":	
  1},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"id":	
  "doc2",	
  "field_i":	
  2},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"EOF":	
  true}]	
  
	
  	
  	
  	
  }}
Architecture
●MapReduce-ish
○ Borrows Shuffling concept from M/R
●Logical tiers for performing the query
○ SQL tier: translates SQL to streaming expressions for parallel query plan,
selects worker nodes, merges results
○ Worker tier: executes parallel query plan, streams tuples from data tables
back
○ Data Table tier: queries SolrCloud collections, performs initial sort and
partitioning of results for worker nodes
JDBC Client
●Parallel SQL includes a “thin” JDBC client
●Expanded to include SQL Clients such as DbVisualizer
(SOLR-8502)
●Client only works with Parallel SQL features
Learning More
Joel Bernstein’s presentation at Lucene Revolution:
●https://www.youtube.com/watch?v=baWQfHWozXc
Apache Solr Reference Guide:
●https://cwiki.apache.org/confluence/display/solr/
Streaming+Expressions
●https://cwiki.apache.org/confluence/display/solr/Parallel
+SQL+Interface
Spark  Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
•  Keeps track of live workers
•  Web UI on port 8080
•  Task Scheduler
•  Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
•  RDD Graph
•  DAG Scheduler
•  Block tracker
•  Shuffle tracker

Más contenido relacionado

La actualidad más candente

MySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn TutorialMySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn TutorialKenny Gryp
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오PgDay.Seoul
 
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQLPgDay.Seoul
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務Kedy Chang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesSeveralnines
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
Scouter와 influx db – grafana 연동 가이드
Scouter와 influx db – grafana 연동 가이드Scouter와 influx db – grafana 연동 가이드
Scouter와 influx db – grafana 연동 가이드Ji-Woong Choi
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 

La actualidad más candente (20)

MySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn TutorialMySQL Group Replication - HandsOn Tutorial
MySQL Group Replication - HandsOn Tutorial
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
ZIO Queue
ZIO QueueZIO Queue
ZIO Queue
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
[Pgday.Seoul 2021] 1. 예제로 살펴보는 포스트그레스큐엘의 독특한 SQL
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務
 
Load Data Fast!
Load Data Fast!Load Data Fast!
Load Data Fast!
 
Flower and celery
Flower and celeryFlower and celery
Flower and celery
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Scouter와 influx db – grafana 연동 가이드
Scouter와 influx db – grafana 연동 가이드Scouter와 influx db – grafana 연동 가이드
Scouter와 influx db – grafana 연동 가이드
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Similar a Data Engineering with Solr and Spark

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langit0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langitData Con LA
 
Bleeding Edge Databases
Bleeding Edge DatabasesBleeding Edge Databases
Bleeding Edge DatabasesLynn Langit
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 

Similar a Data Engineering with Solr and Spark (20)

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langit0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langit
 
Bleeding Edge Databases
Bleeding Edge DatabasesBleeding Edge Databases
Bleeding Edge Databases
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 

Más de Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

Más de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Último

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Data Engineering with Solr and Spark

  • 1.
  • 3. Data Engineering with Solr and Spark Grant Ingersoll @gsingers CTO, Lucidworks
  • 4. Lucidworks  Fusion  Is  Search-­‐Driven  Everything •Drive  next  genera=on  relevance   via  Content,  Collabora=on  and   Context   •Harness  best  in  class  Open   Source:  Apache  Solr  +  Spark   •Simplify  applica=on   development  and  reduce   ongoing  maintenance Fusion  is  built  on  three   core  principles:
  • 5. Fusion  Architecture RESTAPI Worker Worker Cluster Mgr. Apache Spark Shards Shards Apache Solr HDFS(Optional) Shared Config Mgmt Leader Election Load Balancing ZK 1 Apache Zookeeper ZK N DATABASEWEBFILELOGSHADOOP CLOUD Connectors Aler=ng/Messaging NLP Pipelines Blob  Storage Scheduling Recommenders/Signals … Core Services Admin  UI SECURITY  BUILT-­‐IN
  • 8. • Why  Search  for  Data  Engineering?   • Quick  intro  to  Solr   • Quick  intro  to  Spark   • Solr  +  Spark   • Relevance  101   • Machine  learning  with  Spark  and  Solr   • What’s  next? Let’s  Do  This Examples  throughout!
  • 9. The Importance of Importance
  • 10. Search-­‐Driven   Everything Customer   Service Customer   Insights Fraud  Surveillance Research   Portal Online  Retail Digital   Content
  • 11. • Data  Engineering,  esp.  with  text,  is  a   strange  and  magical  world  filled  with…   – Evil  villains   – Jesters   – Wizards   – Unicorns   – Heroes!   • In  other  words,  no  system  will  be  perfect Caveat  Emptor:  Data  Engineering  EdiLon
  • 12. • You  will  spend  most  of  your  time  in  data   engineering,  search,  machine  learning  and  NLP   doing  “grunt”  work  nicely  labeled  as:   – Preprocessing   – Feature  Selection   – Sampling   – Validation/testing/etc.   – Content  extraction   – ETL   • Corollary:  Start  with  simple,  tried  and  true   algorithms,  then  iterate
  • 13. Why  do  data  engineering  with  Solr  and  Spark? Solr Spark • Data exploration and visualization • Easy ingestion and feature selection • Powerful ranking features • Quick and dirty classification and clustering • Simple operation and scaling • Stats and math built in • Advanced machine learning: MLLib, Mahout, Deep Learning4j • Fast, large scale iterative algorithms • General purpose batch/streaming compute engine Whole collection analysis! • Lots of integrations with other big data systems
  • 14. • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr  Key  Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication
  • 15. Lucene  for  the  Win! • Vector Space or Probabilistic, it’s your choice! • Killer FST • Wicked fast • Pluggable compression, queries, indexing and more • Advanced Similarity Models • Lang. Modeling, Divergence from Random, more • Easy to plug-in ranking
  • 16. Solr  and  Your  Tools • Data ingest: • JSON, CSV, XML, Rich types (PDF, etc.), custom • Clients for Python, R, Java, .NET and more • http://cran.r-project.org/web/packages/solr/index.html, amongst others • Output formats: JSON, CSV, XML, custom
  • 17. Basics  of  Solr  Requests • Querying: • Simple: term, phrases, boolean, wildcards, weights • Advanced: query parsers, spatial, etc. • Facets: term, query, range, pivot, stats • Highlighting • Spell checking
  • 19. Spark  Key  Features • General purpose, high powered cluster computing system • Modern, faster alternative to MapReduce • 3x faster w/ 10x less hardware for Terasort • Great for iterative algorithms • APIs for Java, Scala, Python and R • Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems • Deploys: Standalone, Hadoop YARN, Mesos
  • 20. Spark  Basics • Resilient Distributed Datasets • Spark SQL provides a Data Source, which provides a DataFrame • DataFrames — a DSL for distributed data manipulation • Seamless integration with other Spark tech: SparkR, Python
  • 21. Spark  Components Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching components engine cluster mgmt Tachyon languages Scala Java Python R shared memory
  • 22. Why  Spark  for  Solr? • Build the index very, very quickly! • Aggregations • Boosts, stats, iterative computations • Offline compute to update index with additional info (e.g. PageRank, popularity) • Whole corpus analytics, clustering, classification • Joins with other storage (Cassandra, HDFS, DB, HBase)
  • 23. Why  Solr  for  Spark? • Massive simplification of operations! • Non “dumb” distributed, resilient storage • Random access with smart queries • Table scans • Advanced filtering, feature selection • Schemaless when you want, predefined when you don’t • Spatial, columnar, sparse
  • 24. Spark  +  Solr  in  Anger http://github.com/lucidworks/spark-solr Map<String,  String>  options  =  new  HashMap<String,  String>();
 options.put("zkhost",  zkHost);
 options.put("collection”,  "tweets");
 
 DataFrame  df  =  sqlContext.read().format("solr").options(options).load();   count  =  df.filter(df.col("type_s").equalTo(“echo")).count();
  • 25. Spark  Shell  in  a  Nutshell • Common commands • Solr in Spark: queries, filters and other requests • See commands.md in the Github repo
  • 26.
  • 27. But is it relevant?
  • 30. • Wing it • Ask — Caveat Emptor • Log analysis • Experimentation: A/B (A/A) testing Approaches
  • 31. • Precision/Recall (also, Mean Avg. Precision) • Mean Reciprocal Rank (MRR) • Number of {Zero|Embarrassing} Results • Inter-Annotator Agreement • Normalized Discounted Cumulative Gain (NDCG) Common  Metrics
  • 33. Algorithms Collective Intelligence Editors/Rules The mainstay of any approach: leverages Lucene/Solr’s built in similarity engine, function queries and other capabilities to determine importance based on core index Especially effective for curating the long tail, feedback from users and other systems provide key insights into importance. Can also be used to inform the business about trends and interests. Should be used sparingly to handle key situations such as promotions and edge cases. Review often. Encourage experimentation instead. Works well for landing pages, boosts and blocks where you know the answers. Not to be confused with curating content. Big  Picture  on  Relevance
  • 34. • Similarity Models Default, BM25F, others • Function Queries, Reranking, Boosts • Phrases are almost always a win (edismax does most of this for you) e.g.: (exact match terms)^100 AND (“termA termB…”~10)^50 AND (termA AND termB…)^10 AND (termA OR termB) • Mind your analysis Algorithms
  • 35. • UI, UI, UI! • 1000’s of rules • Second is the first loser • Local minimum • Pet peeve queries • Oprah effect • Assumptions It’s a trap!
  • 37. • Spark ships with good out of the box machine learning capabilities • Spark-Solr brings enhanced feature selection tools via Lucene analyzers • Examples k-means word2vec Find synonyms Machine  Learning  at  Work
  • 39. • Parallel  Execu=on  of  SQL  across   SolrCloud   • Real=me  Map-­‐Reduce  (“ish”)   Func=onality   • Parallel  Rela=onal  Algebra   • Builds  on  streaming  capabili=es  in  5.x   • JDBC  client  in  the  works Just  When  You  Thought  SQL  was  Dead Full, Parallelized, SQL Support
  • 40. • Lots  of  Func=ons:   • Search,  Merge,  Group,  Unique,  Parallel,   Select,  Reduce,  Select,  innerJoin,   hashJoin,  Top,  Rollup,  Facet,  Stats,   Update,  JDBC,  Intersect,  Complement,   Logit   • Composable  Streams   • Query  op=miza=on  built  in SQL  Guts Example select  str_s,  count(*),  sum(field_i),  min(field_i),  max(field_i),   avg(field_i)  from  collection1  where  text=’XXXX’  group  by  str_s rollup(        search(collection1,                      q=”(text:XXXX)”,                      qt=”/export”,                      fl=”str_s,field_i”,                      partitionKeys=str_s,                      sort=”str_s  asc”,                      zkHost=”localhost:9989/solr”),        over=str_s,        count(*),        sum(field_i),        min(field_i),        max(field_i),        avg(field_i)
  • 41. • Provides  replica=on  between  two  or  more  SolrCloud  clusters  located  in  two  or   more  data  centers   • Uses  exis=ng  transac=on  logs   • Asynchronous  indexing   • No  Single  Point  of  Failure  or  boglenecks   • Leader-­‐to-­‐leader  communica=on  to  ensure  updates  are  only  sent  once   Never  Go  Down,  or  at  least  Recover  Quickly! Cross Data Center Replication
  • 42. • Graph  Traversal   • Find  all  tweets  men=oning  “Solr”  by  me  or  people  I  follow   • Find  all  drah  blog  posts  about  “Parallel  SQL”  wrigen  by  a  developer   • Find  3-­‐star  hotels  in  NYC  my  friends  stayed  in  last  year   • BM25F  Default  Similarity   • Geo3D  search Make  ConnecLons,  Get  BeXer  Results
  • 43. • Jegy  9.3  and  hgp2  (6.x)   • Fully  mul=plexed  over  a  single  connec=on   • Reduced  chance  of  distributed  deadlock   • Backup/Restore  API   • Op=miza=ons  to  distributed  search  algorithm   • AngularJS-­‐based  UI But  Wait!    There’s  More!
  • 45. Resources • This code: https://github.com/Lucidworks/fusion- examples/tree/master/great-wide-open-2016 • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Book: http://www.manning.com/ingersoll • Solr: http://lucene.apache.org/solr • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @gsingers
  • 46. Appendix  A:  SQL  details
  • 47. Streaming API & Expressions ●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io   ●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API DocValues must be enabled for any field to be returned
  • 48. Streaming Expression Request curl  -­‐-­‐data-­‐urlencode          'stream=search(sample,              q="*:*",              fl="id,field_i",              sort="field_i  asc")'  http://localhost:8901/solr/sample/stream
  • 49. Streaming Expression Response {"responseHeader":  {"status":  0,  "QTime":  1},          "tuples":  {                  "numFound":  -­‐1,                  "start":  -­‐1,                  "docs":  [                          {"id":  "doc1",  "field_i":  1},                          {"id":  "doc2",  "field_i":  2},                          {"EOF":  true}]          }}
  • 50. Architecture ●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan, selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables back ○ Data Table tier: queries SolrCloud collections, performs initial sort and partitioning of results for worker nodes
  • 51. JDBC Client ●Parallel SQL includes a “thin” JDBC client ●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502) ●Client only works with Parallel SQL features
  • 52. Learning More Joel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/ Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel +SQL+Interface
  • 53. Spark  Architecture Spark Master (daemon) Spark Slave (daemon) my-spark-job.jar (w/ shaded deps) My Spark App SparkContext (driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks Spark Executor (JVM process) Tasks Executor runs in separate process than slave daemon Spark Worker Node (1...N of these) Each task works on some partition of a data set to apply a transformation or action Cache Losing a master prevents new applications from being executed Can achieve HA using ZooKeeper and multiple master nodes Tasks are assigned based on data-locality When selecting which node to execute a task on, the master takes into account data locality •  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker