SlideShare una empresa de Scribd logo
1 de 22
Next Generation Search with
Lucene and Solr 4
Grant Ingersoll
CTO, LucidWorks
Read More: http://ibm.co/1dJvL9k

© Copyright 2013
Search is Dead, Long Live Search
• Search is Everywhere!

• The Bar is Raised

• Holistic view of the data AND the users is critical

© 2013 LucidWorks
Search is good for…
• Classic: Fast, fuzzy text matching across a large
document collection
• NoSQL and De-normalized data
- ―light‖ relational

• Top N problems
• Faceting, slicing and dicing of numerical/enumerated
data
• Spatial, spell checking, record linkage, highlighting
3

© 2013 LucidWorks
© Copyright 2013
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
- Per segment
- FieldCache can be controlled to only load new segments
- Soft commit -- faster without fsync, allows quicker update
visibility

• DWPT (Document Writer per Thread)
- Faster more consistent index speed

• Faster fuzzy & wildcard query processing
• String -> BytesRef
- Much improved data structure
- … means less memory and less garbage collection effort
© 2013 LucidWorks
Up and to the Right

• http://people.apache.org/~mikemccand/lucenebench/in
dexing.html
6

© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats
- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc
- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks

• Pluggable Scoring
- Decoupled from TF/IDF
- Built in alternatives include BM25 & DFR, and others
» http://en.wikipedia.org/wiki/Okapi_BM25
» http://terrier.org/docs/v3.5/dfr_description.html
- Add your own

© 2013 LucidWorks
FS(A|T)
• Keys:
- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted, which isn’t our case)

- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra

• Uses:
- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More:
- http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0

- ―Smaller Representation of Finite State Automata‖
» Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011,
vol. 6807, 2011, pp. 118—192.
© 2013 LucidWorks
Recent Additions
• Replication module

• New Faceting capabilities
• New Suggester to handle infix suggestions

• Analysis Additions
- Norwegian, Scandinavian alternatives

• Memory and FST improvements

9

© 2013 LucidWorks
© Copyright 2013
Solr 4: New Features
• Search/Faceting/Relevance
-

New Relevance Function Queries (tf, df, others)
Pivot Faceting
Pseudo-join
Improved Spatial (more later)
Full support for Lucene Codecs, pluggable scoring

• Indexing
- New Update Processors, including scripting option
- Near real time

• Codec/Similarity support from Lucene 4
• Other
- New Admin UI
© 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle using
Well Known Text
• Indexing:
- "geo‖:‖43.17614,-90.57341‖
- ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖
- ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖

• Searching:
- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))‖
© 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search
- Auto distributes updates and queries to appropriate shards
- Near Real Time (NRT) indexing capable

• Dynamically scalable
- New SolrCloud instances add indexing and query capacity
- Supports re-balancing

• Reliable
- No single point of failure
- Transactions logged
- Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

© 2013 LucidWorks
Solr as NoSQL
• Characteristics
-

Non-traditional data stores
Not designed for SQL type queries
Distributed fault tolerant architecture
Document oriented, data format agnostic(JSON, XML, CSV,
binary)

• Updated durability via transaction log
• Real-time /get fetches latest version w/o hard commit
• Versioning and optimistic locking
- w/ Real Time GET, allows read/write/update w/o conflicts

• Atomic updates
- Can add/remove/change and increment a field in existing doc
w/o re-indexing

© 2013 LucidWorks
Recent Additions
• HDFS backed directory for storing index and
transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI

15

© 2013 LucidWorks
Applications

16 Copyright 2013
©
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value
store
- Bonus: search your values!

• NoSQL before NoSQL was cool
• Solr: distributed key/value
- Durable, Isolated, Redundant, Fast,
Real-time
- Joins, Column Storage

• Solr or Tika + Lucene can index
popular office formats
• Solr can backup/replicate and
scale as content grows
© 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search
- with search used to build cross recommendation!

• Recommend content to people who exhibit certain behaviors (clicks,
query terms, other)
• (Ab)use of a search engine
- but not as a search engine for content
- more like a search engine for behavior

• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal
Recommendation Algorithms
- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms

• Go get Mahout/Myrrix or just do it in y(our) search engine

© 2013 LucidWorks
… Avoid Delays

19

© 2013 LucidWorks
… Wibbly-wobbly Timey-wimey Stuff
• Leverage Solr’s new
spatial capabilities to
index non-spatial data,
such as time ranges
- Useful for Open Hours, Shifts,
etc.

• Query using rectangle
intersections
- q = shift:"Intersects(0 19 23
365)‖

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

20

© 2013 LucidWorks
Summary
• Lucene/Solr 4.x:
-

Faster
More Flexible
Easier than ever scaling
More reliable than ever

• If you need to rank a bunch of stuff according to some
notion of similarity, a search engine is the way to go

21

© 2013 LucidWorks
Where to Next?
• Full article: http://ibm.co/1dJvL9k
•
• http://www.lucidworks.com
• http://lucene.apache.org/
• Training: http://bit.ly/lws-training

• LucidWorks Search (Solr++) more info: http://bit.ly/lws-moreinfo
• Twitter: @gsingers, @LucidWorks
• Taming Text: http://www.manning.com/ingersoll
22

© 2013 LucidWorks

Más contenido relacionado

La actualidad más candente

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionLucidworks
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentOpenSource Connections
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
PyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch LandPyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch LandSam Witteveen
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Mike King
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 

La actualidad más candente (20)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with Fusion
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
PyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch LandPyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch Land
 
Big Search 4 Big Data War Stories
Big Search 4 Big Data War StoriesBig Search 4 Big Data War Stories
Big Search 4 Big Data War Stories
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 

Destacado

Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksLucidworks
 

Destacado (7)

Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
 

Similar a Data IO: Next Generation Search with Lucene and Solr 4

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Alluxio, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
This Ain't Your Parents' Search Engine
This Ain't Your Parents' Search EngineThis Ain't Your Parents' Search Engine
This Ain't Your Parents' Search EngineLucidworks
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 

Similar a Data IO: Next Generation Search with Lucene and Solr 4 (20)

Solr 4
Solr 4Solr 4
Solr 4
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
This Ain't Your Parents' Search Engine
This Ain't Your Parents' Search EngineThis Ain't Your Parents' Search Engine
This Ain't Your Parents' Search Engine
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 

Más de Grant Ingersoll

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

Más de Grant Ingersoll (11)

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Último

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Último (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Data IO: Next Generation Search with Lucene and Solr 4

  • 1. Next Generation Search with Lucene and Solr 4 Grant Ingersoll CTO, LucidWorks Read More: http://ibm.co/1dJvL9k © Copyright 2013
  • 2. Search is Dead, Long Live Search • Search is Everywhere! • The Bar is Raised • Holistic view of the data AND the users is critical © 2013 LucidWorks
  • 3. Search is good for… • Classic: Fast, fuzzy text matching across a large document collection • NoSQL and De-normalized data - ―light‖ relational • Top N problems • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting 3 © 2013 LucidWorks
  • 5. Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments - Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort © 2013 LucidWorks
  • 6. Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html 6 © 2013 LucidWorks
  • 7. Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR, and others » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html - Add your own © 2013 LucidWorks
  • 8. FS(A|T) • Keys: - byte[] – write-once - Linear time build of min. automata (nlogn if not sorted, which isn’t our case) - Compression - Reverse lookups - Weights (used for auto-suggest) - Pluggable Algebra • Uses: - Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others - FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: - http://slidesha.re/vKtpVA - http://bit.ly/Pkjyu0 - ―Smaller Representation of Finite State Automata‖ » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192. © 2013 LucidWorks
  • 9. Recent Additions • Replication module • New Faceting capabilities • New Suggester to handle infix suggestions • Analysis Additions - Norwegian, Scandinavian alternatives • Memory and FST improvements 9 © 2013 LucidWorks
  • 11. Solr 4: New Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) Pivot Faceting Pseudo-join Improved Spatial (more later) Full support for Lucene Codecs, pluggable scoring • Indexing - New Update Processors, including scripting option - Near real time • Codec/Similarity support from Lucene 4 • Other - New Admin UI © 2013 LucidWorks
  • 12. Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle using Well Known Text • Indexing: - "geo‖:‖43.17614,-90.57341‖ - ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖ - ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖ • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))‖ © 2013 LucidWorks
  • 13. Scaling Solr • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity - Supports re-balancing • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud © 2013 LucidWorks
  • 14. Solr as NoSQL • Characteristics - Non-traditional data stores Not designed for SQL type queries Distributed fault tolerant architecture Document oriented, data format agnostic(JSON, XML, CSV, binary) • Updated durability via transaction log • Real-time /get fetches latest version w/o hard commit • Versioning and optimistic locking - w/ Real Time GET, allows read/write/update w/o conflicts • Atomic updates - Can add/remove/change and increment a field in existing doc w/o re-indexing © 2013 LucidWorks
  • 15. Recent Additions • HDFS backed directory for storing index and transaction logs in Apache Hadoop • New Core discovery capabilities • Schemaless/External Schema/Field Guessing • Schema APIs • Add documents from the Admin UI 15 © 2013 LucidWorks
  • 17. … Find your Keys, Store Your Content • Lucene/Solr is a fast key-value store - Bonus: search your values! • NoSQL before NoSQL was cool • Solr: distributed key/value - Durable, Isolated, Redundant, Fast, Real-time - Joins, Column Storage • Solr or Tika + Lucene can index popular office formats • Solr can backup/replicate and scale as content grows © 2013 LucidWorks
  • 18. … Find Love! Upsell! Cross-sell! • Cross recommendation as search - with search used to build cross recommendation! • Recommend content to people who exhibit certain behaviors (clicks, query terms, other) • (Ab)use of a search engine - but not as a search engine for content - more like a search engine for behavior • See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms - http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms • Go get Mahout/Myrrix or just do it in y(our) search engine © 2013 LucidWorks
  • 19. … Avoid Delays 19 © 2013 LucidWorks
  • 20. … Wibbly-wobbly Timey-wimey Stuff • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges - Useful for Open Hours, Shifts, etc. • Query using rectangle intersections - q = shift:"Intersects(0 19 23 365)‖ https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/ 20 © 2013 LucidWorks
  • 21. Summary • Lucene/Solr 4.x: - Faster More Flexible Easier than ever scaling More reliable than ever • If you need to rank a bunch of stuff according to some notion of similarity, a search engine is the way to go 21 © 2013 LucidWorks
  • 22. Where to Next? • Full article: http://ibm.co/1dJvL9k • • http://www.lucidworks.com • http://lucene.apache.org/ • Training: http://bit.ly/lws-training • LucidWorks Search (Solr++) more info: http://bit.ly/lws-moreinfo • Twitter: @gsingers, @LucidWorks • Taming Text: http://www.manning.com/ingersoll 22 © 2013 LucidWorks

Notas del editor

  1. The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  2. Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  3. CharacteristicsConflicts from other clients
  4. Oh, BTW, it can do search over the valuesKeys can be anything, not just strings