SlideShare una empresa de Scribd logo
1 de 25
Elasticsearch & “PeopleSearch”
      Leveraging Elasticsearch @
About Traackr

A search engine
A people discovery engine
Subscription-based
Migrated from Solr to
Elasticsearch in Q3 ’12
About me
14+ years of experience building
full-stack web software systems
with a past focus on e-
commerce and publishing

VP Engineering @ Traackr,
responsible for building
engineering capability to enable
Traackr's growth goals

about.me/george-stathis
About this talk


 Short intro to Elasticsearch
 How search is done @ Traackr
 Why Elasticsearch was the right fit
About Elasticsearch
Lucene under the covers
Distributed from the ground up
Full support for Lucene Near Real-Time search
Native JSON Query DSL
Automatic schema detection (“schema-less”)
Supports document types
Elasticsearch - Distributed
 Indices broken into shards

 shards have 0 or more replicas

 data nodes hold one or more shards

 data nodes can coordinate/forward
 requests

 automatic routing & rebalancing but
 overrides available

 Default mode is multicast (zen
 discovery), unicast available for
 multicast unfriendly networks, AWS
 plug-in available, Zookeeper plug-in
 available made possible by Sonian.

 YouTube demo: http://youtu.be/         Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch
 l4ReamjCxHo
Elasticsearch - NRT

Uses Lucene’s IndexReader.open(IndexWriter
writer, boolean applyAllDeletes)
Opens a near real time IndexReader from the
IndexWriter
By default, flushes and makes new updates available
every second
Elasticsearch - JSON DSL
  # Query String
  curl 'localhost:9200/test/_search?pretty=1' -d '{
      "query" : {
          "query_string" : {
              "query" : "tags:scala"
          }
      }
  }'
              Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh


  # Range
  curl 'localhost:9200/test/_search?pretty=1' -d '{
      "query" : {
           "range" : {
                     "price" : { "gt" : 15 }
           }
      }
  }'     Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
Elasticsearch - JSON DSL                                                                       (cont)


# Filtered Query
#     Filters are similar to queries, except they do no scoring
#     and are easily cached.
#     There are many filter types as well, including range and term
curl 'localhost:9200/test/_search?pretty=1' -d '{
    "query" : {
        "filtered" : {
            "query" : {
                    "query_string" : {
                             "query" : "tags:scala"
                    }
            },
            "filter" : {
                    "range" : {
                             "price" : { "gt" : 15 }
                    }
            }
        }
    }
}'         Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
Elasticsearch - Schema
Dynamic object mapping with intelligent defaults
Can be turned off
Can be overridden globally or on a per index basis:

 {
     "_default_" : {
       "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"],
     }
 }
Elasticsearch Demo
Search @ Traackr
   Answering authors   by searching posts
Traackr search requirements

Posts are coming in at about 1 million a day
Each author averages several hundred posts
Posts need to be available for search immediately
Relevance and sorting has to be rolled up/grouped at
the author level
Early approach to search
search posts

group matched posts by author

for each grouped set, add up the
lucene scores of the posts

combine sum of post scores with
author social and website metrics
for final group score

sort groups (i.e. authors)

try to do this quickly!
Early approach to search
search posts

group matched posts by author

for each grouped set, add up the
lucene scores of the posts

combine sum of post scores with
                                    Performance hit
author social and website metrics
for final group score

sort groups (i.e. authors)

try to do this quickly!
Room for improvement

How can we avoid the “late binding” performance
penalty?
  Get the search engine to do as much of the scoring
  as possible
  Store all data needed for displaying results in the
  search engine (i.e. no db calls)
Alternatives - Denormalize?
 Index authors and their posts together
 under one document.
 Pros
    straight forward
    built-in post relevance sum
 Cons
    each profile change would trigger the
    reindexing of all the author’s posts
    each new post would trigger the re-
    indexing of all the author’s posts +
    profile
    a non-starter for real-time search
Alternatives - Solr Join?
 “In many cases, documents have relationships between them and it is too expensive to denormalize
 them. Thus, a join operation is needed. Preserving the document relationship allows documents to
 be updated independently without having to reindex large numbers of denormalized documents.” -
 http://wiki.apache.org/solr/Join

 E.g. Find all post docs matching "search engines", then join them against author docs and return
 that list of authors:

 ...?q={!join+from=author_id+to=id}search+engines

 Pros
     addresses the issue of loading author profiles from db
 Cons
     Does not preserve the post relevance scores -> non-starter
     Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:
Alternatives - Solr Grouping?
  Groups results by a given document field (e.g. author_id)
  http://wiki.apache.org/solr/FieldCollapsing
  ...&q=real+time+search&group=true&group.field=author_id
[...]
  "grouped":{
    "author_id":{
      "matches":2,
      "groups":[{
          "groupValue":"04e3bc5078344ad1a065815f0bb9f14d",
          "doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[
               {
                 "id":"5d09240934eb331bada1ff3f0b773153",
                 "title":"Refresh API",
                 "url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html",
                 "author_id":"04e3bc5078344ad1a065815f0bb9f14d"}]
           }},
        {
          "groupValue":"9e4f40e1aa82f2e1a9368748d1268082",
          "doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[
               {
                 "id":"831ce82bdff34abeb495f260bc7d67d2",
                 "title":"Realtime Search: Solr vs Elasticsearch"},
                 "url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/",
                 "author_id":"9e4f40e1aa82f2e1a9368748d1268082"},
               [...]]
          }}]}}
Alternatives - Solr Grouping?
 Pros
   Faster than doing grouping at the app layer: no
   need for post counting

   Possible to sort groups by sum of post relevance
   scores inside the engine (with some custom
   work):

 Cons
   No concept of author; author profiles still need to
   be fetched from db, so still suffers from some
   performance penalty

   Submit patch for group sort options? Not a lot of
   interest in sorting groups by anything other than
   max score:

        Don’t want to be stuck maintaining custom
        Solr code (been there done that with HBase:
        http://www.slideshare.net/gstathis/finding-
        the-right-nosql-db-for-the-job-the-path-to-a-
        nonrdbms-solution-at-traackr         )
Alternatives - Elasticsearch!
 Supports document types         {
 and parent/child document           "post" : {
                                       "_parent" : {
 mappings: http://                       "type" : "author"
 www.elasticsearch.org/guide/          }
 reference/mapping/parent-           }
                                 }
 field.html

 Out-of-the-box support for      curl 'localhost:9200/traackr/_search?pretty=1' -d
                                 '{
 querying child documents           "query": {
 and obtaining their parents:         "top_children": {
 http://www.elasticsearch.org/          "type": "post",
                                        "query": {
 guide/reference/query-dsl/                "query_string": {
 top-children-query.html.                    "query": "elasticsearch NRT"
                                           }
    Con: memory heavy                   },                             can order parent
                                        "score": "sum"                 results by sum of
                                      }                                  child scores!
 Parent documents can be            }
 sorted but sum/avg/max of       }'
Alternatives - Elasticsearch!
 Supports document types         {
 and parent/child document           "post" : {
                                       "_parent" : {
 mappings: http://                       "type" : "author"
 www.elasticsearch.org/guide/          }
 reference/mapping/parent-           }
                                 }
 field.html

 Out-of-the-box support for      curl 'localhost:9200/traackr/_search?pretty=1' -d
                                 '{
 querying child documents           "query": {
 and obtaining their parents:         "top_children": {
 http://www.elasticsearch.org/          "type": "post",
                                        "query": {
 guide/reference/query-dsl/                "query_string": {
 top-children-query.html.                    "query": "elasticsearch NRT"
                                           }
    Con: memory heavy                   },                             can order parent
                                        "score": "sum"                 results by sum of
                                      }                                  child scores!
 Parent documents can be            }
 sorted but sum/avg/max of       }'
           Big win
Top Children Demo
Other Elasticsearch benefits
 Lucene: don’t have to give up query syntax if you come from Solr

 In-JVM nodes: can use Java API to unit test different permutations of indexing
 configurations (e.g. different analyzers and tokenizers): great help for testing search
 on a qualitative basis; allows for embedded ES instances

 Index API and Cluster API: a great deal of cluster and index configuration changes
 can be made on the fly through curl API calls without restarting the cluster; very
 convenient for testing and cluster management

 Warmer API: significant help in avoiding search time drops due to segment merges;
 https://github.com/elasticsearch/elasticsearch/issues/1913

 Percolators: register queries and let the engine tell you which queries match on a
 given document; great potential for real-time; http://www.elasticsearch.org/guide/
 reference/api/percolate.html
Q&A

Más contenido relacionado

La actualidad más candente

ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014Roy Russo
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseAlexandre Rafalovitch
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic searchmarkstory
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search WalkthroughSuhel Meman
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchJason Austin
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphSocialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSematext Group, Inc.
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with ElasticsearchAleksander Stensby
 
Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearchmartijnvg
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter
 

La actualidad más candente (20)

ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphSocialite, the Open Source Status Feed Part 2: Managing the Social Graph
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and Solr
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearch
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
MongoDB and Schema Design
MongoDB and Schema DesignMongoDB and Schema Design
MongoDB and Schema Design
 

Destacado

OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform@CULT Srl
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015Lukas Vlcek
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearchLukas Vlcek
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLAlex Tomic
 
Social Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner
 
Oxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Simone Onofri
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
Using Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchUsing Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchSumy PHP User Grpoup
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014Vivek Sachdeva
 
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...Jozias Rolim
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data OutOpenThink Labs
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPiMasters
 
Elastic Search Indexing Internals
Elastic Search Indexing InternalsElastic Search Indexing Internals
Elastic Search Indexing InternalsGaurav Kukal
 
Varnish & blue/green deployments
Varnish & blue/green deploymentsVarnish & blue/green deployments
Varnish & blue/green deploymentsOxalide
 

Destacado (20)

OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearch
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQL
 
Social Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 min
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Oxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic Search
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Elastic search
Elastic searchElastic search
Elastic search
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Using Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchUsing Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text Search
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014
 
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Elastic Search Indexing Internals
Elastic Search Indexing InternalsElastic Search Indexing Internals
Elastic Search Indexing Internals
 
Varnish & blue/green deployments
Varnish & blue/green deploymentsVarnish & blue/green deployments
Varnish & blue/green deployments
 

Similar a Elasticsearch & "PeopleSearch"

Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...kristgen
 
Elasticsearch And Ruby [RuPy2012]
Elasticsearch And Ruby [RuPy2012]Elasticsearch And Ruby [RuPy2012]
Elasticsearch And Ruby [RuPy2012]Karel Minarik
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersBen van Mol
 
Elastic search intro-@lamper
Elastic search intro-@lamperElastic search intro-@lamper
Elastic search intro-@lampermedcl
 
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017Codemotion
 
曾勇 Elastic search-intro
曾勇 Elastic search-intro曾勇 Elastic search-intro
曾勇 Elastic search-introShaoning Pan
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachSymfonyMu
 
Finding the right stuff, an intro to Elasticsearch (at Rug::B)
Finding the right stuff, an intro to Elasticsearch (at Rug::B) Finding the right stuff, an intro to Elasticsearch (at Rug::B)
Finding the right stuff, an intro to Elasticsearch (at Rug::B) Michael Reinsch
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data ModelingDATAVERSITY
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowKarsten Dambekalns
 

Similar a Elasticsearch & "PeopleSearch" (20)

Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
 
Elasticsearch And Ruby [RuPy2012]
Elasticsearch And Ruby [RuPy2012]Elasticsearch And Ruby [RuPy2012]
Elasticsearch And Ruby [RuPy2012]
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
Elastic search intro-@lamper
Elastic search intro-@lamperElastic search intro-@lamper
Elastic search intro-@lamper
 
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
 
曾勇 Elastic search-intro
曾勇 Elastic search-intro曾勇 Elastic search-intro
曾勇 Elastic search-intro
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approach
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Finding the right stuff, an intro to Elasticsearch (at Rug::B)
Finding the right stuff, an intro to Elasticsearch (at Rug::B) Finding the right stuff, an intro to Elasticsearch (at Rug::B)
Finding the right stuff, an intro to Elasticsearch (at Rug::B)
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling10gen Presents Schema Design and Data Modeling
10gen Presents Schema Design and Data Modeling
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
Elasticsearch as a Database?
Elasticsearch as a Database?Elasticsearch as a Database?
Elasticsearch as a Database?
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 

Elasticsearch & "PeopleSearch"

  • 1. Elasticsearch & “PeopleSearch” Leveraging Elasticsearch @
  • 2. About Traackr A search engine A people discovery engine Subscription-based Migrated from Solr to Elasticsearch in Q3 ’12
  • 3. About me 14+ years of experience building full-stack web software systems with a past focus on e- commerce and publishing VP Engineering @ Traackr, responsible for building engineering capability to enable Traackr's growth goals about.me/george-stathis
  • 4. About this talk Short intro to Elasticsearch How search is done @ Traackr Why Elasticsearch was the right fit
  • 5. About Elasticsearch Lucene under the covers Distributed from the ground up Full support for Lucene Near Real-Time search Native JSON Query DSL Automatic schema detection (“schema-less”) Supports document types
  • 6. Elasticsearch - Distributed Indices broken into shards shards have 0 or more replicas data nodes hold one or more shards data nodes can coordinate/forward requests automatic routing & rebalancing but overrides available Default mode is multicast (zen discovery), unicast available for multicast unfriendly networks, AWS plug-in available, Zookeeper plug-in available made possible by Sonian. YouTube demo: http://youtu.be/ Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch l4ReamjCxHo
  • 7. Elasticsearch - NRT Uses Lucene’s IndexReader.open(IndexWriter writer, boolean applyAllDeletes) Opens a near real time IndexReader from the IndexWriter By default, flushes and makes new updates available every second
  • 8. Elasticsearch - JSON DSL # Query String curl 'localhost:9200/test/_search?pretty=1' -d '{ "query" : { "query_string" : { "query" : "tags:scala" } } }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh # Range curl 'localhost:9200/test/_search?pretty=1' -d '{ "query" : { "range" : { "price" : { "gt" : 15 } } } }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  • 9. Elasticsearch - JSON DSL (cont) # Filtered Query # Filters are similar to queries, except they do no scoring # and are easily cached. # There are many filter types as well, including range and term curl 'localhost:9200/test/_search?pretty=1' -d '{ "query" : { "filtered" : { "query" : { "query_string" : { "query" : "tags:scala" } }, "filter" : { "range" : { "price" : { "gt" : 15 } } } } } }' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  • 10. Elasticsearch - Schema Dynamic object mapping with intelligent defaults Can be turned off Can be overridden globally or on a per index basis: { "_default_" : { "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"], } }
  • 12. Search @ Traackr Answering authors by searching posts
  • 13. Traackr search requirements Posts are coming in at about 1 million a day Each author averages several hundred posts Posts need to be available for search immediately Relevance and sorting has to be rolled up/grouped at the author level
  • 14. Early approach to search search posts group matched posts by author for each grouped set, add up the lucene scores of the posts combine sum of post scores with author social and website metrics for final group score sort groups (i.e. authors) try to do this quickly!
  • 15. Early approach to search search posts group matched posts by author for each grouped set, add up the lucene scores of the posts combine sum of post scores with Performance hit author social and website metrics for final group score sort groups (i.e. authors) try to do this quickly!
  • 16. Room for improvement How can we avoid the “late binding” performance penalty? Get the search engine to do as much of the scoring as possible Store all data needed for displaying results in the search engine (i.e. no db calls)
  • 17. Alternatives - Denormalize? Index authors and their posts together under one document. Pros straight forward built-in post relevance sum Cons each profile change would trigger the reindexing of all the author’s posts each new post would trigger the re- indexing of all the author’s posts + profile a non-starter for real-time search
  • 18. Alternatives - Solr Join? “In many cases, documents have relationships between them and it is too expensive to denormalize them. Thus, a join operation is needed. Preserving the document relationship allows documents to be updated independently without having to reindex large numbers of denormalized documents.” - http://wiki.apache.org/solr/Join E.g. Find all post docs matching "search engines", then join them against author docs and return that list of authors: ...?q={!join+from=author_id+to=id}search+engines Pros addresses the issue of loading author profiles from db Cons Does not preserve the post relevance scores -> non-starter Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:
  • 19. Alternatives - Solr Grouping? Groups results by a given document field (e.g. author_id) http://wiki.apache.org/solr/FieldCollapsing ...&q=real+time+search&group=true&group.field=author_id [...] "grouped":{ "author_id":{ "matches":2, "groups":[{ "groupValue":"04e3bc5078344ad1a065815f0bb9f14d", "doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[ { "id":"5d09240934eb331bada1ff3f0b773153", "title":"Refresh API", "url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html", "author_id":"04e3bc5078344ad1a065815f0bb9f14d"}] }}, { "groupValue":"9e4f40e1aa82f2e1a9368748d1268082", "doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[ { "id":"831ce82bdff34abeb495f260bc7d67d2", "title":"Realtime Search: Solr vs Elasticsearch"}, "url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/", "author_id":"9e4f40e1aa82f2e1a9368748d1268082"}, [...]] }}]}}
  • 20. Alternatives - Solr Grouping? Pros Faster than doing grouping at the app layer: no need for post counting Possible to sort groups by sum of post relevance scores inside the engine (with some custom work): Cons No concept of author; author profiles still need to be fetched from db, so still suffers from some performance penalty Submit patch for group sort options? Not a lot of interest in sorting groups by anything other than max score: Don’t want to be stuck maintaining custom Solr code (been there done that with HBase: http://www.slideshare.net/gstathis/finding- the-right-nosql-db-for-the-job-the-path-to-a- nonrdbms-solution-at-traackr )
  • 21. Alternatives - Elasticsearch! Supports document types { and parent/child document "post" : { "_parent" : { mappings: http:// "type" : "author" www.elasticsearch.org/guide/ } reference/mapping/parent- } } field.html Out-of-the-box support for curl 'localhost:9200/traackr/_search?pretty=1' -d '{ querying child documents "query": { and obtaining their parents: "top_children": { http://www.elasticsearch.org/ "type": "post", "query": { guide/reference/query-dsl/ "query_string": { top-children-query.html. "query": "elasticsearch NRT" } Con: memory heavy }, can order parent "score": "sum" results by sum of } child scores! Parent documents can be } sorted but sum/avg/max of }'
  • 22. Alternatives - Elasticsearch! Supports document types { and parent/child document "post" : { "_parent" : { mappings: http:// "type" : "author" www.elasticsearch.org/guide/ } reference/mapping/parent- } } field.html Out-of-the-box support for curl 'localhost:9200/traackr/_search?pretty=1' -d '{ querying child documents "query": { and obtaining their parents: "top_children": { http://www.elasticsearch.org/ "type": "post", "query": { guide/reference/query-dsl/ "query_string": { top-children-query.html. "query": "elasticsearch NRT" } Con: memory heavy }, can order parent "score": "sum" results by sum of } child scores! Parent documents can be } sorted but sum/avg/max of }' Big win
  • 24. Other Elasticsearch benefits Lucene: don’t have to give up query syntax if you come from Solr In-JVM nodes: can use Java API to unit test different permutations of indexing configurations (e.g. different analyzers and tokenizers): great help for testing search on a qualitative basis; allows for embedded ES instances Index API and Cluster API: a great deal of cluster and index configuration changes can be made on the fly through curl API calls without restarting the cluster; very convenient for testing and cluster management Warmer API: significant help in avoiding search time drops due to segment merges; https://github.com/elasticsearch/elasticsearch/issues/1913 Percolators: register queries and let the engine tell you which queries match on a given document; great potential for real-time; http://www.elasticsearch.org/guide/ reference/api/percolate.html
  • 25. Q&A

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. - important to differentiate with Solr Cloud\n - Solr Cloud (in trunk but not quite out yet; will come out with Lucene 4.0)\n - Solr Cloud uses Zookeeper to coordinate the cluster, ES it’s built-in every node (issue with nodes losing connectivity with cluster, electing themselves as master, ES can use ZK as a plugin)\n - ES uses multicast, so if network does not support it, need to switch to unicast\n - Both support distributed NRT\n- refer to http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/\n
  7. \n
  8. \n
  9. \n
  10. - talk about how ES differs from Solr in that it detects the fields based on the content; Solr has the wildcard definitions.\n- Solr schema.xml vs. ES REST API driven JSON DSL config which can be dynamic\n
  11. if curl statements get snoozes, show real app demo\n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. if curl statements get snoozes, show real app demo\n
  22. Percolators? Don’t trigger when a record is available for searching (Igor’s comment)\n
  23. \n