SlideShare una empresa de Scribd logo
1 de 75
Descargar para leer sin conexión
Solr, Lucene & Hadoop
                       @



Thursday, May 10, 12
david@etsy.com
                       4 Years Lucene and Solr @ Etsy


Thursday, May 10, 12
History of Search @ Etsy
                       Hadoop + HBase Indexing
                             (in development)

                             Replication

Thursday, May 10, 12
About
                        Us

Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
13MM Listings
                         39MM Unique Visitors
                       880K Shops / 150 Countries
                            100+ Engineers

Thursday, May 10, 12
Architecture
                        Overview

Thursday, May 10, 12
Overview
                        Search       Web         Database
                       +n slaves    +n webs     +n db shards




                                   Memcached
                                    +n caches




Thursday, May 10, 12
Thrift

             Search                                Web

                slave                              web
                         query = hats for cats

                slave                              web
                         result = 402, 283, 837

            +n slaves                             +n webs




Thursday, May 10, 12
Hydration
                        Database

                          shard


                          shard
                                     Web

                                     web
                        +n shards


                                     web

                       Memcached
                                    +n webs

                         cache


                         cache


                        +n caches

Thursday, May 10, 12
The Results




Thursday, May 10, 12
History of Search
                            at Etsy

Thursday, May 10, 12
History of Search
                2007
                       •1 Million Listings
                       •A Single “Master” Postgres Database
                       •PHP > Twisted > Stored Proc > TSearch
                       •18 “Baby” Postgres Databases
                       •Baby Replicator
Thursday, May 10, 12
History of Search
                2008
                       •2 Million Listings
                       •A Single “Master” Postgres Database
                       •PHP > Solr
                       •4 Solr Slaves + 2 Masters
                       •Baby Replicator + DIH for Reindexing
Thursday, May 10, 12
History of Search
                2009
                       •4 Million Listings
                       •A Single “Master” Postgres Database
                       •PHP > Solr
                       •6 Solr Slaves + 2 Masters
                       •Webs >ActiveMQ > Solr
Thursday, May 10, 12
History of Search
                2010
                       •7 Million Listings
                       •A Single “Master” Postgres Database
                       •PHP > Thrift > Solr
                       •10 Solr Slaves + 1 Master
                       •Custom Import Handler
Thursday, May 10, 12
History of Search
                2011
               •10 Million Listings
               •“Master” Postgres Database + DB SHARDS!
               •PHP > Thrift > Solr
               •24 Solr Slaves + 1 Master
               •Custom Import Handler
Thursday, May 10, 12
Future of Search
                2012
                       •?? Million Listings
                       •MORE DB SHARDS!
                       •PHP > Thrift > Solr
                       •?? Solr Slaves + 1 Master
                       •HBase + Hadoop Indexers
Thursday, May 10, 12
What Did We Learn?

Thursday, May 10, 12
Lucene + Solr > TSearch
                       http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/


Thursday, May 10, 12
Love Lucene + Solr Trunk!


Thursday, May 10, 12
Run, Don’t Walk...




Thursday, May 10, 12
Deployinator
                       Fork it: https://github.com/etsy/deployinator


Thursday, May 10, 12
Smoker


Thursday, May 10, 12
StatsD, Graph Everything!
                           Fork it: https://github.com/etsy/statsd


Thursday, May 10, 12
Thursday, May 10, 12
95th Percentile


Thursday, May 10, 12
start · build_query · perform_search · receive_search_ads · search_side_response ·
                create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render
Thursday, May 10, 12
Solr Top Level Cache > Memcached


Thursday, May 10, 12
etsy-index.properties

    $ cat /search/data/person/index/etsy-index.properties
    #Tue Mar 27 13:05:51 EDT 2012
    max_update_time=2012-03-27T17:05:51.955Z

Thursday, May 10, 12
Check Index Size
               Don’t Install if < 50% Current Size


Thursday, May 10, 12
Check if Index is Too Old
                       Don’t Update if > 10 Days Old

Thursday, May 10, 12
What Did We Learn?




                       Store Nothing


Thursday, May 10, 12
Keep Denormalized Data


Thursday, May 10, 12
DB Shard



                                      PHP        JSON    Search
                       DB Shard   Denormalizer          Database




                       DB Shard




Thursday, May 10, 12
Full       Apply
                                               Install
                       Reindex   Incremental




Thursday, May 10, 12
Full       Apply         Apply
                                                        Install
                  Reindex   Incremental   Incremental




Thursday, May 10, 12
r


                                Database
                        exe
                       Ind




Thursday, May 10, 12
HBase + Hadoop
                          Indexing
Thursday, May 10, 12
HBase + Hadoop Indexing




                       Why HBase?


Thursday, May 10, 12
HBase + Hadoop Indexing

                       DB Shard



                                      PHP        JSON
                       DB Shard   Denormalizer          HBase




                       DB Shard




Thursday, May 10, 12
HBase + Hadoop Indexing

                listings_denormalized
               {NAME => 'listings_denormalized', FAMILIES =>
               [{NAME => 'listing_data', BLOOMFILTER => 'ROW',
               REPLICATION_SCOPE => '0', COMPRESSION =>
               'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE =>
               '65536', IN_MEMORY => 'false', BLOCKCACHE =>
               'false'}]}




Thursday, May 10, 12
HBase + Hadoop Indexing

                listings_denormalized_modified_index
               {NAME => 'listings_denormalized_modified_index',
               FAMILIES => [{NAME => 'pks', BLOOMFILTER =>
               'ROW', REPLICATION_SCOPE => '0', COMPRESSION
               => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE
               => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
               'false'}]}




Thursday, May 10, 12
HBase + Hadoop Indexing




                                   SOLR-1301
                       https://issues.apache.org/jira/browse/SOLR-1301


Thursday, May 10, 12
HBase + Hadoop Indexing



                  Solr
                              Disk
                                     •Solr Document Converter
              Output Format
                                     •Solr Requires Posix Disk
                              HDFS   •Index Copied Back to HDFS

Thursday, May 10, 12
HBase + Hadoop Indexing

                       •Not Great with Multi-Core Configs
                        •Added Solr Multi-Core Support
                       • Solr Config Issues
                        •Added ENV support for Configs
                       •Uses “new” style Hadoop API
                        •Added Support for both Old and New
Thursday, May 10, 12
HBase + Hadoop Indexing




                       SolrInputDocumentWritable
    public class SolrInputDocumentWritable extends SolrInputDocument
    implements org.apache.hadoop.io.Writable {



Thursday, May 10, 12
HBase + Hadoop Indexing




                         Oozie


Thursday, May 10, 12
HBase + Hadoop Indexing



                       Oozie + HBase?


Thursday, May 10, 12
HBase + Hadoop Indexing




                       ScanStringGenerator
    http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/


Thursday, May 10, 12
HBase + Hadoop Indexing
                                Hadoop           Indexer


                       Oozie                      Start




                        Map              HBase    Copy




                       Reduce            HDFS    Merge




                        Solr
                                         Disk     Install
                       Output




Thursday, May 10, 12
HBase + Hadoop Indexing




                       IndexerActionMain


Thursday, May 10, 12
HBase + Hadoop Indexing




                       Deployinator


Thursday, May 10, 12
HBase + Hadoop Indexing




                       IndexCompare


Thursday, May 10, 12
HBase + Hadoop Indexing

    $ ./compare

    ERROR: please provide two index directories

    example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588
    options:
        -p --percent= percent of the index to check
        -i --id=      primary key id field in the index
        -h --hash=    comparison or hash field in the index
        <index> <index>




Thursday, May 10, 12
HBase + Hadoop Indexing
      $ ./compare 
      /search/data/person/index-1332867952588/ 
      /search/data/person/index-1335378487672

        id field: user_id
      hash field: hash
      percentage: 0.0010
           files: /search/data/person/index-1332867952588/ /search/
      data/person/index-1335378487672

      /search/data/person/index-1332867952588 contains 1515512 docs
      /search/data/person/index-1335378487672 contains 14837972 docs
      1516 of 1516 documents are the same




Thursday, May 10, 12
HBase + Hadoop Indexing




                       Copy and Merge


Thursday, May 10, 12
HBase + Hadoop Indexing




                       Open Source


Thursday, May 10, 12
Replication

Thursday, May 10, 12
Replication




Thursday, May 10, 12
Replication

                                Slaves

                       Master




                                +n slaves




Thursday, May 10, 12
Thursday, May 10, 12
BitTorrent
                       Replication
Thursday, May 10, 12
Bit Torrent

                Using BitTornado:




Thursday, May 10, 12
Replication
                Bit Torrent + Solr




Thursday, May 10, 12
Replication
                Bit Torrent + Solr




Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
Replication

                       Fork of TTorent: https://github.com/etsy/ttorrent
                                      Multi-File Support
                                      Large File Support

                               Fork BitTorrent: Comming Soon




Thursday, May 10, 12
Need a job?

Thursday, May 10, 12
Thursday, May 10, 12
Thanks!

Thursday, May 10, 12
david@etsy.com

Thursday, May 10, 12

Más contenido relacionado

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Solr, Lucene and Hadoop @ Etsy

  • 1. Solr, Lucene & Hadoop @ Thursday, May 10, 12
  • 2. david@etsy.com 4 Years Lucene and Solr @ Etsy Thursday, May 10, 12
  • 3. History of Search @ Etsy Hadoop + HBase Indexing (in development) Replication Thursday, May 10, 12
  • 4. About Us Thursday, May 10, 12
  • 8. 13MM Listings 39MM Unique Visitors 880K Shops / 150 Countries 100+ Engineers Thursday, May 10, 12
  • 9. Architecture Overview Thursday, May 10, 12
  • 10. Overview Search Web Database +n slaves +n webs +n db shards Memcached +n caches Thursday, May 10, 12
  • 11. Thrift Search Web slave web query = hats for cats slave web result = 402, 283, 837 +n slaves +n webs Thursday, May 10, 12
  • 12. Hydration Database shard shard Web web +n shards web Memcached +n webs cache cache +n caches Thursday, May 10, 12
  • 14. History of Search at Etsy Thursday, May 10, 12
  • 15. History of Search 2007 •1 Million Listings •A Single “Master” Postgres Database •PHP > Twisted > Stored Proc > TSearch •18 “Baby” Postgres Databases •Baby Replicator Thursday, May 10, 12
  • 16. History of Search 2008 •2 Million Listings •A Single “Master” Postgres Database •PHP > Solr •4 Solr Slaves + 2 Masters •Baby Replicator + DIH for Reindexing Thursday, May 10, 12
  • 17. History of Search 2009 •4 Million Listings •A Single “Master” Postgres Database •PHP > Solr •6 Solr Slaves + 2 Masters •Webs >ActiveMQ > Solr Thursday, May 10, 12
  • 18. History of Search 2010 •7 Million Listings •A Single “Master” Postgres Database •PHP > Thrift > Solr •10 Solr Slaves + 1 Master •Custom Import Handler Thursday, May 10, 12
  • 19. History of Search 2011 •10 Million Listings •“Master” Postgres Database + DB SHARDS! •PHP > Thrift > Solr •24 Solr Slaves + 1 Master •Custom Import Handler Thursday, May 10, 12
  • 20. Future of Search 2012 •?? Million Listings •MORE DB SHARDS! •PHP > Thrift > Solr •?? Solr Slaves + 1 Master •HBase + Hadoop Indexers Thursday, May 10, 12
  • 21. What Did We Learn? Thursday, May 10, 12
  • 22. Lucene + Solr > TSearch http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/ Thursday, May 10, 12
  • 23. Love Lucene + Solr Trunk! Thursday, May 10, 12
  • 25. Deployinator Fork it: https://github.com/etsy/deployinator Thursday, May 10, 12
  • 27. StatsD, Graph Everything! Fork it: https://github.com/etsy/statsd Thursday, May 10, 12
  • 30. start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render Thursday, May 10, 12
  • 31. Solr Top Level Cache > Memcached Thursday, May 10, 12
  • 32. etsy-index.properties $ cat /search/data/person/index/etsy-index.properties #Tue Mar 27 13:05:51 EDT 2012 max_update_time=2012-03-27T17:05:51.955Z Thursday, May 10, 12
  • 33. Check Index Size Don’t Install if < 50% Current Size Thursday, May 10, 12
  • 34. Check if Index is Too Old Don’t Update if > 10 Days Old Thursday, May 10, 12
  • 35. What Did We Learn? Store Nothing Thursday, May 10, 12
  • 37. DB Shard PHP JSON Search DB Shard Denormalizer Database DB Shard Thursday, May 10, 12
  • 38. Full Apply Install Reindex Incremental Thursday, May 10, 12
  • 39. Full Apply Apply Install Reindex Incremental Incremental Thursday, May 10, 12
  • 40. r Database exe Ind Thursday, May 10, 12
  • 41. HBase + Hadoop Indexing Thursday, May 10, 12
  • 42. HBase + Hadoop Indexing Why HBase? Thursday, May 10, 12
  • 43. HBase + Hadoop Indexing DB Shard PHP JSON DB Shard Denormalizer HBase DB Shard Thursday, May 10, 12
  • 44. HBase + Hadoop Indexing listings_denormalized {NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]} Thursday, May 10, 12
  • 45. HBase + Hadoop Indexing listings_denormalized_modified_index {NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]} Thursday, May 10, 12
  • 46. HBase + Hadoop Indexing SOLR-1301 https://issues.apache.org/jira/browse/SOLR-1301 Thursday, May 10, 12
  • 47. HBase + Hadoop Indexing Solr Disk •Solr Document Converter Output Format •Solr Requires Posix Disk HDFS •Index Copied Back to HDFS Thursday, May 10, 12
  • 48. HBase + Hadoop Indexing •Not Great with Multi-Core Configs •Added Solr Multi-Core Support • Solr Config Issues •Added ENV support for Configs •Uses “new” style Hadoop API •Added Support for both Old and New Thursday, May 10, 12
  • 49. HBase + Hadoop Indexing SolrInputDocumentWritable public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable { Thursday, May 10, 12
  • 50. HBase + Hadoop Indexing Oozie Thursday, May 10, 12
  • 51. HBase + Hadoop Indexing Oozie + HBase? Thursday, May 10, 12
  • 52. HBase + Hadoop Indexing ScanStringGenerator http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/ Thursday, May 10, 12
  • 53. HBase + Hadoop Indexing Hadoop Indexer Oozie Start Map HBase Copy Reduce HDFS Merge Solr Disk Install Output Thursday, May 10, 12
  • 54. HBase + Hadoop Indexing IndexerActionMain Thursday, May 10, 12
  • 55. HBase + Hadoop Indexing Deployinator Thursday, May 10, 12
  • 56. HBase + Hadoop Indexing IndexCompare Thursday, May 10, 12
  • 57. HBase + Hadoop Indexing $ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588 options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index> Thursday, May 10, 12
  • 58. HBase + Hadoop Indexing $ ./compare /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672 id field: user_id hash field: hash percentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/ data/person/index-1335378487672 /search/data/person/index-1332867952588 contains 1515512 docs /search/data/person/index-1335378487672 contains 14837972 docs 1516 of 1516 documents are the same Thursday, May 10, 12
  • 59. HBase + Hadoop Indexing Copy and Merge Thursday, May 10, 12
  • 60. HBase + Hadoop Indexing Open Source Thursday, May 10, 12
  • 63. Replication Slaves Master +n slaves Thursday, May 10, 12
  • 65. BitTorrent Replication Thursday, May 10, 12
  • 66. Bit Torrent Using BitTornado: Thursday, May 10, 12
  • 67. Replication Bit Torrent + Solr Thursday, May 10, 12
  • 68. Replication Bit Torrent + Solr Thursday, May 10, 12
  • 71. Replication Fork of TTorent: https://github.com/etsy/ttorrent Multi-File Support Large File Support Fork BitTorrent: Comming Soon Thursday, May 10, 12
  • 72. Need a job? Thursday, May 10, 12