SlideShare una empresa de Scribd logo
1 de 20
Where It All Began Using Apache Hadoop for Search with Apache Lucene and Solr
Topics Search What is: Apache Lucene? Apache Nutch? Apache Solr? Where does Hadoop (ecosystem) fit? Indexing Search Other
Search 101 Search tools are designed for dealing with fuzzy data Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?
Search 101 Relevance Indexing Finds  and maps terms and documents  Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM
Lucene is a mature, high performance Java API to provide search capabilities to applications Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.) Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc. Created in 1997 and now part of the Apache Software Foundation Important to note that Lucene does not have distributed index (shard) support
Nutch ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat http://labs.google.com/papers/mapreduce.html Only much later did it spin out to become the Hadoop that we all know In other words, Hadoop was born from the need to scale search crawling and indexing Originally used Lucene for search/indexing, now uses Solr
Solr Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene  Without knowing Java! Also provides: Easy setup and configuration Faceting Highlighting Replication/Sharding Lucene Best Practices http://search.lucidimagination.com
Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards, other
Quick Solr Demo Pre-reqs: Apache Ant 1.7.x SVN svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java –jar start.jar cdexampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse
Anatomy of a Distributed Search System Users Input Docs Application Fan In/Out Shard[0] Shard[n] Sharding Alg. Coordination Layer Searchers Indexers … … … … … Shard[0] Shard[n]
Sharding Algorithm Good document distribution across shards is important Simple approach: hash(id) % numShards Fine if number of shards doesn’t change or easy to reindex Better: Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing Also key: how to deal with the shape/size of the cluster changing
Hadoop and Search Much of the Hadoop ecosystem is useful for search related functionality Indexing Process of adding documents to inverted index to make them searchable In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help Search Query the index and return documents and other info (facets, etc.) related to the result set Subsecond response time usually required ZooKeeper, Avro and others are still useful
Indexing (Lucene) Hadoop ships with contrib/index Almost no documentation, but… Good example of map-side indexing Mapper does analysis and creates in memory index which is written out to segments Indexes merged on the reduce side Katta http://katta.sourceforge.net Shard management, distributed search, etc. Both give you large amount of control, but you have to build out all the search framework around it
Indexing (Solr) https://issues.apache.org/jira/browse/SOLR-1301 Map side formats Reduce-side indexing Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.) Manually install index into a Solr core once built https://issues.apache.org/jira/browse/SOLR-1045 Map-side indexing Incomplete, but based on Hadoop contrib/index Write a distributed Update Handler to handle on the server side
Indexing (Nutch to Solr) Use Nutch to crawl content, Solr to index and serve Doesn’t support indexing to Solr shards just yet Need to write/use Solr distributed Update Handler Still useful for smaller crawls (< 100M pages) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
Searching Hadoop Core is not all that useful for distributed search Exception: Hadoop RPC layer, possibly Exception: Log analysis, etc. for search related items Other Hadoop ecosystem tools are useful: Apache ZooKeeper (more in a moment) HDFS – storage of shards (pull down to local disk) Avro, Thrift, Protocol Buffers (serialization utilities)
ZooKeeper and Search ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization In the context of search, it’s useful for: Sharing configuration across nodes Maintaining status about shards Up/down/latency/rebalancing and more Coordinating searches across shards/load balancing
ZooKeeper and Search (Practical) Katta employs ZooKeeper for search coordination, etc. Query distribution, status, etc. Solr Cloud All the benefits of Solr + ZooKeeper for coordinating distributed capabilities Query distribution, configuration sharing, status, etc. About to be committed to Solr trunk http://wiki.apache.org/solr/SolrCloud
Other Search Related Tasks Log Analysis Query analytics Related Searches Relevance assessments Classification and Clustering Mahout – http://mahout.apache.org HBase and other stores for documents Avro, Thrift, Protocol Buffers for serialization of objects across the wire
Resources http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://hadoop.apache.org http://nutch.apache.org http://lucene.apache.org http://www.lucidimagination.com

Más contenido relacionado

La actualidad más candente

Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basicspmanvi
 
Ahsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetAhsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetRonnie Chan
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMadaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMtherealgaston
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredAcquia
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistpmanvi
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going stronglucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineXtremcoin and Geliyoo
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 

La actualidad más candente (20)

Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basics
 
Ahsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetAhsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - Datasheet
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
EVOLVE'13 | Enhance | External Search | Matthias Wermund
EVOLVE'13 | Enhance | External Search | Matthias WermundEVOLVE'13 | Enhance | External Search | Matthias Wermund
EVOLVE'13 | Enhance | External Search | Matthias Wermund
 
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEMadaptTo() 2014 - Integrating Open Source Search with CQ/AEM
adaptTo() 2014 - Integrating Open Source Search with CQ/AEM
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding Required
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologist
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going strong
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search Engine
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 

Destacado

Migration from FAST ESP to Solr
Migration from FAST ESP to SolrMigration from FAST ESP to Solr
Migration from FAST ESP to SolrTNR Global
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Enterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectiveEnterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectivePeter Morville
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopRogue Wave Software
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solrMike Frampton
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchSearch Technologies
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Dlvr.it 使用說明
Dlvr.it 使用說明Dlvr.it 使用說明
Dlvr.it 使用說明waytorich
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 

Destacado (11)

Migration from FAST ESP to Solr
Migration from FAST ESP to SolrMigration from FAST ESP to Solr
Migration from FAST ESP to Solr
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Enterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectiveEnterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's Perspective
 
Real-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and HadoopReal-time searching of big data with Solr and Hadoop
Real-time searching of big data with Solr and Hadoop
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Dlvr.it 使用說明
Dlvr.it 使用說明Dlvr.it 使用說明
Dlvr.it 使用說明
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 

Similar a TriHUG: Lucene Solr Hadoop

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 

Similar a TriHUG: Lucene Solr Hadoop (20)

Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 
Searchlight
SearchlightSearchlight
Searchlight
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 

Más de Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 

Más de Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

TriHUG: Lucene Solr Hadoop

  • 1. Where It All Began Using Apache Hadoop for Search with Apache Lucene and Solr
  • 2. Topics Search What is: Apache Lucene? Apache Nutch? Apache Solr? Where does Hadoop (ecosystem) fit? Indexing Search Other
  • 3. Search 101 Search tools are designed for dealing with fuzzy data Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?
  • 4. Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM
  • 5. Lucene is a mature, high performance Java API to provide search capabilities to applications Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.) Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc. Created in 1997 and now part of the Apache Software Foundation Important to note that Lucene does not have distributed index (shard) support
  • 6. Nutch ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat http://labs.google.com/papers/mapreduce.html Only much later did it spin out to become the Hadoop that we all know In other words, Hadoop was born from the need to scale search crawling and indexing Originally used Lucene for search/indexing, now uses Solr
  • 7. Solr Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene Without knowing Java! Also provides: Easy setup and configuration Faceting Highlighting Replication/Sharding Lucene Best Practices http://search.lucidimagination.com
  • 8. Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards, other
  • 9. Quick Solr Demo Pre-reqs: Apache Ant 1.7.x SVN svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java –jar start.jar cdexampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse
  • 10. Anatomy of a Distributed Search System Users Input Docs Application Fan In/Out Shard[0] Shard[n] Sharding Alg. Coordination Layer Searchers Indexers … … … … … Shard[0] Shard[n]
  • 11. Sharding Algorithm Good document distribution across shards is important Simple approach: hash(id) % numShards Fine if number of shards doesn’t change or easy to reindex Better: Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing Also key: how to deal with the shape/size of the cluster changing
  • 12. Hadoop and Search Much of the Hadoop ecosystem is useful for search related functionality Indexing Process of adding documents to inverted index to make them searchable In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help Search Query the index and return documents and other info (facets, etc.) related to the result set Subsecond response time usually required ZooKeeper, Avro and others are still useful
  • 13. Indexing (Lucene) Hadoop ships with contrib/index Almost no documentation, but… Good example of map-side indexing Mapper does analysis and creates in memory index which is written out to segments Indexes merged on the reduce side Katta http://katta.sourceforge.net Shard management, distributed search, etc. Both give you large amount of control, but you have to build out all the search framework around it
  • 14. Indexing (Solr) https://issues.apache.org/jira/browse/SOLR-1301 Map side formats Reduce-side indexing Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.) Manually install index into a Solr core once built https://issues.apache.org/jira/browse/SOLR-1045 Map-side indexing Incomplete, but based on Hadoop contrib/index Write a distributed Update Handler to handle on the server side
  • 15. Indexing (Nutch to Solr) Use Nutch to crawl content, Solr to index and serve Doesn’t support indexing to Solr shards just yet Need to write/use Solr distributed Update Handler Still useful for smaller crawls (< 100M pages) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
  • 16. Searching Hadoop Core is not all that useful for distributed search Exception: Hadoop RPC layer, possibly Exception: Log analysis, etc. for search related items Other Hadoop ecosystem tools are useful: Apache ZooKeeper (more in a moment) HDFS – storage of shards (pull down to local disk) Avro, Thrift, Protocol Buffers (serialization utilities)
  • 17. ZooKeeper and Search ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization In the context of search, it’s useful for: Sharing configuration across nodes Maintaining status about shards Up/down/latency/rebalancing and more Coordinating searches across shards/load balancing
  • 18. ZooKeeper and Search (Practical) Katta employs ZooKeeper for search coordination, etc. Query distribution, status, etc. Solr Cloud All the benefits of Solr + ZooKeeper for coordinating distributed capabilities Query distribution, configuration sharing, status, etc. About to be committed to Solr trunk http://wiki.apache.org/solr/SolrCloud
  • 19. Other Search Related Tasks Log Analysis Query analytics Related Searches Relevance assessments Classification and Clustering Mahout – http://mahout.apache.org HBase and other stores for documents Avro, Thrift, Protocol Buffers for serialization of objects across the wire
  • 20. Resources http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://hadoop.apache.org http://nutch.apache.org http://lucene.apache.org http://www.lucidimagination.com

Notas del editor

  1. Do this
  2. Talk about why the need to do this