SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
Big Search w/ Big Data
                          Principles
                           Basis Technology Open Source Search 2012
                            Eric Pugh | epugh@o19s.com | @dep4b




Tuesday, October 2, 2012
What is Big Search?
Tuesday, October 2, 2012
Who am I?
      •       Principal of OpenSource Connections
             - Solr/Lucene Search Consultancy

      •      Member of Apache Software
             Foundation

      •      SOLR-284 UpdateRichDocuments
             (July 07)

      •      Fascinated by the art of software
             development


Tuesday, October 2, 2012
2n
                            d
                            ed
                                it
                                io
                                 n!
       CO-AUTHOR




Tuesday, October 2, 2012
war
                           Telling some stories
                                              ^




                    • Prototyping
                    • Application Development
                    • Maintaining Your Big Search Indexes


Tuesday, October 2, 2012
Not an intro to
                               SolrCloud!

                    • Great tutorials given by Tomás Fernández
                           Löbbe from LucidWorks yesterday!




Tuesday, October 2, 2012
Background for Client
                           X’s Project
                    • Big Data is any data set that is primarily at
                           rest due to the difficulty of working with it.
                    • 100’s of millions of documents to search
                    • Limited selection of tools available.
                    • Aggressive timeline.
                    • All the data must be searched per query.
                    • On Solr 3.x line
Tuesday, October 2, 2012
Telling some stories

                    • Prototyping
                    • Application Development
                    • Maintaining Your Big Search Indexes


Tuesday, October 2, 2012
Boy meets Girl Story

                           Metadata

                                       Ingest    Solr
                                                  Solr
                                      Pipeline     Solr
                                                    Solr
                           Content
                            Files



Tuesday, October 2, 2012
Bash Rocks




Tuesday, October 2, 2012
Bash Rocks
                    • Remote Solr stop/start scripts
                    • Remote Indexer stop/start scripts
                    • Performance Monitoring
                    • Content Extraction scripts (+Java)
                    • Ingestor Scripts (+Java)
                    • Artifact Deployment (CM)
Tuesday, October 2, 2012
Make it easy to change
                          approach




Tuesday, October 2, 2012
Make it easy to change
                           sharding
        	 public void run(Map options, List<SolrInputDocument> docs) throws
        InstantiationException, IllegalAccessException, ClassNotFoundException {
        	 	 IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
        	 	 	 	 "com.o19s.solr.ModShardIndexStrategy").newInstance();
        	 	 indexStrategy.configure(options);
        	 	
        	 	 for (SolrInputDocument doc:docs){
        	 	 	 indexStrategy.addDocument(doc);
        	 	 }
        	 }




Tuesday, October 2, 2012
Separate JVM from Solr
                         Cores
                    • Step 1: Fire up empty Solr’s on all the
                           servers (nohup &).
                    • Step 2:Verify they started cleanly.
                    • Step 3: Create Cores (curl http://
                           search1.o19s.com:8983/solr/admin?
                           action=create&name=run2)
                    • Step 4: Create a “aggregator” core, passing
                           in urls of Cores. (&property.shards=)
Tuesday, October 2, 2012
Go Wide Quickly



Tuesday, October 2, 2012
search1.o19s.com
        search1.o19s.com
                                    shard1
                                     shard1
                                      shard1
                                       shard1   :8983
             shard1
              shard1
               shard1
                shard1     :8983
                                   search2.o19s.com
             shard1
              shard1
               shard1
                shard8     :8984    shard1
                                     shard1
                                      shard1    :8983
                                       shard8

             shard1
              shard1
               shard1 :8985
                shard12            search3.o19s.com
                                    shard1
                                     shard1
                                      shard1 :8985
                                       shard12
                                     shard1
                                      shard1
                                       shard1 :8983
                                        shard12
Tuesday, October 2, 2012
Simple Pipeline


                    •      Simple pipeline

                    •      mv is atomic




Tuesday, October 2, 2012
Don’t Move Files
                    • SCP across machines is slow/error prone
                    • NFS share, single point of failure.
                    • Clustered file system like GFS (Global File
                           System) can have “fencing” issues
                    • HDFS shines here.
                    • ZooKeeper shines here.
Tuesday, October 2, 2012
Can you test your
                               changes?


Tuesday, October 2, 2012
JVM tuning is black art
                    -verbose:gc
                    -XX:+PrintGCDetails
                    -server
                    -Xmx8G
                    -Xms8G
                    -XX:MaxPermSize=256m
                    -XX:PermSize=256m
                    -XX:+AggressiveHeap
                    -XX:+DisableExplicitGC
                    -XX:ParallelGCThreads=16
                    -XX:+UseParallelOldGC

Tuesday, October 2, 2012
Tuesday, October 2, 2012
Run, don’t Walk




Tuesday, October 2, 2012
Telling some stories

                    • Prototyping
                    • Application Development
                    • Maintaining Your Big Search Indexes


Tuesday, October 2, 2012
Using Solr as key/value store
                                       Solr Key/
                                      Value Cache
                           Metadata

                                         Ingest     Solr
                                                     Solr
                                        Pipeline      Solr
                                                       Solr
                           Content
                            Files



Tuesday, October 2, 2012
Using Solr as key/value store
                    • thousands of queries per second without
                           real time get.
        http://localhost:8983/solr/run2_enrichment/select?
        q=id:DOC45242&fl=entities,html



                    • how fast with real time get?
          http://localhost:8983/solr/run2_enrichment/get?
          id=DOC45242&fl=entities,html




Tuesday, October 2, 2012
Push schema definition
                     to the application
                    • Not “schema less”
                    • Just different owner of schema!
                    • Schema may have common set of fields like
                           id, type, timestamp, version
                    • Nothing required.
        q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor




Tuesday, October 2, 2012
Don’t do expensive
                             things in Solr

                    • Tika content extraction aka Solr Cell

                    • UpdateRequestProcessorChain


Tuesday, October 2, 2012
Don’t do expensive
                             things in Solr

                    • Tika content extraction aka Solr Cell

                    • UpdateRequestProcessorChain


Tuesday, October 2, 2012
Beware JavaBin
                                       Solr Key/
                                      Value Cache
                           Metadata

                                         Ingest     Solr
                                                     Solr
                                        Pipeline      Solr
                                                       Solr
                           Content
                            Files



Tuesday, October 2, 2012
Beware JavaBin
                                       Solr Key/  Solr 3.4
                                      Value Cache
                           Metadata

                                         Ingest         Solr
                                                         Solr
                                        Pipeline          Solr
                                                           Solr
                           Content
                            Files



Tuesday, October 2, 2012
Beware JavaBin
                                       Solr Key/  Solr 3.4
                                      Value Cache
                           Metadata
                                                             Solr 4
                                         Ingest         Solr
                                                         Solr
                                        Pipeline          Solr
                                                           Solr
                           Content
                            Files



Tuesday, October 2, 2012
Beware JavaBin
                                       Solr Key/  Solr 3.4
                                      Value Cache
                           Metadata
                                                             Solr 4
                                         Ingest         Solr
                                                         Solr
                                        Pipeline          Solr
                                                           Solr
                           Content
                                      Which SolrJ
                            Files
                                      version do I
                                          use?

Tuesday, October 2, 2012
No JavaBin




                                        /u
                                          G te
                                           p
                                           iv /
                                            da
                                             e av
                                               m r
                                                e o!
             • Avoid Jarmaggeddon
             • Reflection? Ugh.


Tuesday, October 2, 2012
Avro!
                    • Supports serialization of data readable from
                           multiple languages
                    • It’s smart XML, w/o the XML!
                    • Handles forward and reverse versions of an
                           object
                    • Compact and fast to read.

Tuesday, October 2, 2012
Avro!
                    Solr Key/
                   Value Cache
                                                .avro


                       Metadata       Ingest            Solr
                                                         Solr
                                     Pipeline             Solr
                                                           Solr



                           Content
                            Files
Tuesday, October 2, 2012
Telling some stories

                    • Prototyping
                    • Application Development
                    • Maintaining Your Big Search Indexes


Tuesday, October 2, 2012
Upgrade Lucene
                                Indexes Easily
                   • Don’t reindex!
                   • Try out new versions of
                           Lucene based search engines.
                                                          David Lyle
        java -cp lucene-core.jar
        org.apache.lucene.index.IndexUpgrader [-delete-prior-
        commits] [-verbose] indexDir



Tuesday, October 2, 2012
Indexing is Easy and
                                  Quick


Tuesday, October 2, 2012
CHEAP AND CHEERFUL



                           <       >

Tuesday, October 2, 2012
NRT versus BigData



Tuesday, October 2, 2012
The tension between
                           scale and update rate

          10 million             Bad Place   100’s of millions




Tuesday, October 2, 2012
Grim Reaper
Tuesday, October 2, 2012
Delayed Replication
                   <requestHandler name="/replication" class="solr.ReplicationHandler" >
                   <lst name="slave">
                    <str name="masterUrl">http://localhost:8983/solr/replication</str>
                    <str name="pollInterval">36:00:00</str>
                   </lst>
                   </requestHandler>




Tuesday, October 2, 2012
Enable/Disable

                    • Solr-3301




Tuesday, October 2, 2012
Enable/Disable
        <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
        <lst name="invariants">
          <str name="q">MY HARD QUERY</str>
          <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://
        search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2
        </lst>
        <lst name="defaults">
          <str name="echoParams">all</str>
        </lst>
        <str name="healthcheckFile">server-enabled.txt</str>
        </requestHandler>




Tuesday, October 2, 2012
Provisioning

                    • Chef/Puppet
                    • ZooKeeper
                    • Have you versioned everything to build an
                           index over again?




Tuesday, October 2, 2012
TRADITIONAL ENVIRONMENT




Tuesday, October 2, 2012
th
                                in
               POOLED ENVIRONMENT




                                k
                                    Cl
                                      ou
                                      d!
Tuesday, October 2, 2012
Do I need Failover?

                    • Can I build quickly?
                    • Do I have a reliable cluster of servers?
                    • Am I spread across data centers?
                    • Is sooo 90’s....

Tuesday, October 2, 2012
Telling some stories

                    • Prototyping
                    • Application Development
                    • Maintaining Your Big Search Indexes


Tuesday, October 2, 2012
One more thought...



Tuesday, October 2, 2012
Measuring the impact
                         of our algorithms
                       changes is just getting
                       harder with Big Data.

Tuesday, October 2, 2012
Project SolrPanl
Tuesday, October 2, 2012
Thank you!

                             Questions?

                    • epugh@o19s.com
                    • @dep4b
                    • www.opensourceconnections.com

Tuesday, October 2, 2012

Más contenido relacionado

La actualidad más candente

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...Lucidworks
 
Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionDavid Pilato
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)foundsearch
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetupBharvi Dixit
 

La actualidad más candente (20)

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Core Principles Of Ci
Core Principles Of CiCore Principles Of Ci
Core Principles Of Ci
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetup
 

Similar a OSSCON: Big Search 4 Big Data

Batch Indexing & Near Real Time, keeping things fast
Batch Indexing & Near Real Time, keeping things fastBatch Indexing & Near Real Time, keeping things fast
Batch Indexing & Near Real Time, keeping things fastMarc Sturlese
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDBStone Gao
 
Ionic Framework and Typescript
Ionic Framework and TypescriptIonic Framework and Typescript
Ionic Framework and TypescriptDavid Hohl
 
MongoDB - Who, What & Where!
MongoDB - Who, What & Where!MongoDB - Who, What & Where!
MongoDB - Who, What & Where!Mark Hillick
 
Statsd backends presentation
Statsd backends presentationStatsd backends presentation
Statsd backends presentationDraco2002
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Projectroumia
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databasesESUG
 
GemStone/S Update
GemStone/S UpdateGemStone/S Update
GemStone/S UpdateESUG
 
Pinterest的数据库分片架构
Pinterest的数据库分片架构Pinterest的数据库分片架构
Pinterest的数据库分片架构Tommy Chiu
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr CloudCominvent AS
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?Ovidiu Dimulescu
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08Koan-Sin Tan
 
Using Ampere for Super-Fast Data Caching Awesomeness
Using Ampere for Super-Fast Data Caching AwesomenessUsing Ampere for Super-Fast Data Caching Awesomeness
Using Ampere for Super-Fast Data Caching Awesomenessvillainous
 
Overview of Backbone
Overview of BackboneOverview of Backbone
Overview of BackboneJohn Ashmead
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Skills Matter
 

Similar a OSSCON: Big Search 4 Big Data (20)

Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Batch Indexing & Near Real Time, keeping things fast
Batch Indexing & Near Real Time, keeping things fastBatch Indexing & Near Real Time, keeping things fast
Batch Indexing & Near Real Time, keeping things fast
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Ionic Framework and Typescript
Ionic Framework and TypescriptIonic Framework and Typescript
Ionic Framework and Typescript
 
MongoDB - Who, What & Where!
MongoDB - Who, What & Where!MongoDB - Who, What & Where!
MongoDB - Who, What & Where!
 
Statsd backends presentation
Statsd backends presentationStatsd backends presentation
Statsd backends presentation
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Project
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databases
 
GemStone/S Update
GemStone/S UpdateGemStone/S Update
GemStone/S Update
 
Pinterest的数据库分片架构
Pinterest的数据库分片架构Pinterest的数据库分片架构
Pinterest的数据库分片架构
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 
Using Ampere for Super-Fast Data Caching Awesomeness
Using Ampere for Super-Fast Data Caching AwesomenessUsing Ampere for Super-Fast Data Caching Awesomeness
Using Ampere for Super-Fast Data Caching Awesomeness
 
Overview of Backbone
Overview of BackboneOverview of Backbone
Overview of Backbone
 
Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)Databases -- Have it Your Way (Frederick Cheung)
Databases -- Have it Your Way (Frederick Cheung)
 
Mesos con europa 2017
Mesos con europa 2017Mesos con europa 2017
Mesos con europa 2017
 

Más de OpenSource Connections

How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
 

Más de OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 

OSSCON: Big Search 4 Big Data

  • 1. Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4b Tuesday, October 2, 2012
  • 2. What is Big Search? Tuesday, October 2, 2012
  • 3. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software development Tuesday, October 2, 2012
  • 4. 2n d ed it io n! CO-AUTHOR Tuesday, October 2, 2012
  • 5. war Telling some stories ^ • Prototyping • Application Development • Maintaining Your Big Search Indexes Tuesday, October 2, 2012
  • 6. Not an intro to SolrCloud! • Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday! Tuesday, October 2, 2012
  • 7. Background for Client X’s Project • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • 100’s of millions of documents to search • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x line Tuesday, October 2, 2012
  • 8. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search Indexes Tuesday, October 2, 2012
  • 9. Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 11. Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM) Tuesday, October 2, 2012
  • 12. Make it easy to change approach Tuesday, October 2, 2012
  • 13. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } } Tuesday, October 2, 2012
  • 14. Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=) Tuesday, October 2, 2012
  • 15. Go Wide Quickly Tuesday, October 2, 2012
  • 16. search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12 Tuesday, October 2, 2012
  • 17. Simple Pipeline • Simple pipeline • mv is atomic Tuesday, October 2, 2012
  • 18. Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here. Tuesday, October 2, 2012
  • 19. Can you test your changes? Tuesday, October 2, 2012
  • 20. JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGC Tuesday, October 2, 2012
  • 22. Run, don’t Walk Tuesday, October 2, 2012
  • 23. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search Indexes Tuesday, October 2, 2012
  • 24. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 25. Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html Tuesday, October 2, 2012
  • 26. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor Tuesday, October 2, 2012
  • 27. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChain Tuesday, October 2, 2012
  • 28. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChain Tuesday, October 2, 2012
  • 29. Beware JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 30. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 31. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 32. Beware JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use? Tuesday, October 2, 2012
  • 33. No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh. Tuesday, October 2, 2012
  • 34. Avro! • Supports serialization of data readable from multiple languages • It’s smart XML, w/o the XML! • Handles forward and reverse versions of an object • Compact and fast to read. Tuesday, October 2, 2012
  • 35. Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content Files Tuesday, October 2, 2012
  • 36. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search Indexes Tuesday, October 2, 2012
  • 37. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDir Tuesday, October 2, 2012
  • 38. Indexing is Easy and Quick Tuesday, October 2, 2012
  • 39. CHEAP AND CHEERFUL < > Tuesday, October 2, 2012
  • 40. NRT versus BigData Tuesday, October 2, 2012
  • 41. The tension between scale and update rate 10 million Bad Place 100’s of millions Tuesday, October 2, 2012
  • 43. Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler> Tuesday, October 2, 2012
  • 44. Enable/Disable • Solr-3301 Tuesday, October 2, 2012
  • 45. Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler> Tuesday, October 2, 2012
  • 46. Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index over again? Tuesday, October 2, 2012
  • 48. th in POOLED ENVIRONMENT k Cl ou d! Tuesday, October 2, 2012
  • 49. Do I need Failover? • Can I build quickly? • Do I have a reliable cluster of servers? • Am I spread across data centers? • Is sooo 90’s.... Tuesday, October 2, 2012
  • 50. Telling some stories • Prototyping • Application Development • Maintaining Your Big Search Indexes Tuesday, October 2, 2012
  • 51. One more thought... Tuesday, October 2, 2012
  • 52. Measuring the impact of our algorithms changes is just getting harder with Big Data. Tuesday, October 2, 2012
  • 54. Thank you! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.com Tuesday, October 2, 2012