SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Transforming the house hunting experience

       Alex Burmester, Alexander Kanarsky, Trulia
About Us

 Trulia: Founded in 2005
 Technology company in downtown San
  Francisco transforming real estate search.
 14M unique visitors per month
 Large and active user community
 Strong industry relationships: agents, brokers,
  over 500k registered professionals
 Helping consumers make informed decisions
 About the speakers


                                                    2
Trulia’s Marketplace

Consumers                                              Professionals
• $1Tr Real Estate                                      • $20Bn Real Estate Mktg
  Economy                                               • $7Bn Online Today

• ~5M Home sales/yr           MARKETPLACE               • >1M Realtors
• ~16M Leases/yr                                        • >1M Landlords/Managers
• >70M Homeowners                                       • >1M Local Contractors




 Most Compelling                                        Essential Marketing
      Product:                                              Solution:
  Comprehensive                                       Huge exposure, qualified
    data, unique                                      customers, powerful tools
 insights, great UX

           We create value by connecting consumers and professionals
Problem overview

 Real Estate search specifics
 Traditional Search: Agent based/MLS
 No single MLS country wide
 Largest financial transaction for most people
  with potential for huge life tradeoffs
 Lots of data but silo'd off with data quality
  issues and differing disclosure rules by area.
The House Hunting Process

 Buying, investing, renting
 How online search changes this
 Trulia was first site to mash-up lots of data
  nationwide, maps, heatmaps, tax info, rentals,
  street-view
 Focus on speed
 Relevancy of data
 Amount of information
 Targeted local RE search experience
Challenges

   Integrating large diverse datasets
   Data quality and freshness
   Delivering the right data to the right people
   Scaling for growth
   Meaning of Location, location, location
Legacy Approach

   MLS-like database search
   MySQL 4.x
   Limitations on data processing
   Limitations on scalability
   Replication problems
   Update speed problems
Why Solr?
 Fast, flexible attribute search, faceting
 Non-uniform data handling, dynamic schema
 Easy, uniform API for indexing, query
 Excellent Full-Text Search
 Indexing, Search scale great (Hadoop indexing,
  distributed queries, replication)
 Stored content: can be used as a data store
 Add-Ons: Geospatial search, Field Collapsing,
  Automatic Data Import via DIH
 Strong Developers/Users Community
Hadoop Indexing

  Based on SOLR-1301 patch

  Reduce-side indexing

  Map-side partitioning (sharding)

  Input Data in HDFS

  Output Indexes in HDFS

  Local FS for temporary data processing

  120+ mil. documents index in less than 1 hour

  Paired with Index Manager
Hadoop Indexing

                                              HDFS
                                  Mappers       Part 1
        Partitioners                            Part 2
                                                Part 3
                                                Part 4
                                                …
                                                Part N


                                             Input data

                       Embedded
                         Solr     Reducers
                                                Shard 1
Local                                           ...
 FS                                             Shard K




                                             Output data
Index management

  Index Manager: custom controller

  Manages index deployment from HDFS

  Controls load balancing

  Manages forced replication on slaves

  Interacts with caching layer

  Handles Direct indexing

  Visualizes the state of the indexes

  Alternatives: Katta, Solr Cloud
Index Management

                                                                                Search
      HDFS                                                                     Requests


        Shard 1
        Shard 2
        Shards 3

                                                                          Caching/Load Balancing Layer


                                                   Master                 Slave 1             Slave N
Pulls data
from HDFS                                                                               ...
                                                    Shard 1                   Shard 1            Shard 1
                    Index                           Shard 2                   Shard 2            Shard 2
                   Manager                          Shard 3                   Shard 3            Shard 3
                             Installs
                             Master Index




                                            Controls slaves replication
Direct Indexer

  Handles various updates between the batches

  Patches, pictures updates etc.

  Only does certain type of documents

  Directly updates master index shards

  Interacts with Index Manager (scope of
changes)

    Runs on a regular basis
Direct Indexer
                                   Data Sources
                                                                                  Search
      HDFS                                                                       Requests


        Shard 1                      Direct Indexer
        Shard 2
        Shards 3

                                                                            Caching/Load Balancing Layer


                                                         Master             Slave 1             Slave N
Pulls data
from HDFS                                                                                 ...
                                                      Shard 1                   Shard 1            Shard 1
                    Index                             Shard 2                   Shard 2            Shard 2
                   Manager                            Shard 3                   Shard 3            Shard 3
                               Installs
                               Master Index




                                              Controls slaves replication
Distributed Search

Query Aggregator uses the same partitioning schema
as Hadoop Indexer, so only the relevant shards are
queried. Query extra parameters define the scope.

            Query 1                       Shard 1
Requests                  Query
                                            ...
                        Aggregator
                                          Shard N




            Query 2                       Shard 1
                          Query
Requests                Aggregator
                                            ...
                                          Shard N
Distributed Search

  Many servers, many shards                Aggregated Query time
                                                distribution

  Request is sent only to       100%



relevant shards                 90%




  For search I/O wait matters   80%




  So SSDs rule :-)
                                70%


                                60%

  30-50 qps per shard                                                   >1s
                                50%                                     <1s

  90% of queries < 100 ms,      40%
                                                                        <100
                                                                        <10


  99% within 1sec               30%


  Separate indexes for some     20%

documents                       10%


  50% of queries < 10 ms         0%
                                       6     7   8   9   10   11   12

  Caching/load balancing
                                       Time period: 6 am to 12 pm
matters
Geospatial search

    Local search is super important

    Mobile search is naturally spatial-oriented

    Communities, neighborhoods, enclaves

    POI search

    Comparable Properties

    Personalized Search: Factoring census data in

    How to enable it with Solr?

    No standard solution yet
Geospatial Solr/Lucene

    SOLR-733 patch

    Local Lucene / Local Solr (Patrick O'Leary)

    Spatial Lucene in Lucene 3.1

    JTeam's SSP (Chris Male)

    SOLR-2155 patch (David Smiley)

    Lucene-Spatial-Playground (David Smiley, Chris
    Male, Ryan McKinley)

    Spatial functions is Solr 3.1
Geospatial search


  What kind of spatial search do we need?

  Radius search

  Polygon search

  Sort by distance is needed for radius search

  Implementation chosen: SSP-based, fast
geometry

   INFO: [active] webapp= path=/select/ params={fl=streetAddress_s,lat,lng&wt=kml&q={!spatial+polygons%3D37.77960208641734,-
122.43867874145508;37.783265262376574,-122.43945121765137;37.78414711095678,-122.4330997467041;37.788827515747435,-
122.43404388427734;37.79038758480464,-122.4221134185791;37.781908551707666,-122.420654296875;37.774988939930005,-
122.42932319641113;37.774988939930005,-122.4360179901123;37.77817746896081,-122.43687629699707;37.777634750327046,-
122.43867874145508;37.77960208641734,-122.43867874145508;}*:*&rows=1000} hits=75 status=0 QTime=6
Geospatial Search

  Circle search: given (lat, lng,
radius) return list of docs with
lat/lng indexed

  Sorting: distance from the
center

  Polygon search: given a
polygon (set of lat/lng pairs)
return list of docs with lat/lng
indexed

  Bounding boxes: lat,lng
range queries

  Postponed distance
calculation (query response
write time)
Geospatial Search


    Spatial query parser: {!spatial <params>}

    Can be wrapped as a filter query

    Facet by spatial queries

    Planar geometry: fast, but has limitations

    Good for our use cases, YMMV

    Polygon compression: Google's delta-encoding

    TBD: shapes indexing
Wrap Up
 We are reshaping real estate search - with the right
  tools!
 Hadoop + Solr way is great for frequent batch
  indexing
 Solr Search is fast - and highly scalable through
  sharding and replication
 Geospatial Solr search is still not standardized,
  lots of progress in development
 Open Source: Trulia contributes



                                                         22
Links
 SOLR-1301: Hadoop Indexing Contrib
  • issues.apache.org/jira/browse/SOLR-1301
 Spatial Solr Plugin (SSP)
  • www.jteam.nl/products/spatialsolrplugin.html
 SSP Polygons/Polylines Extension
  • sourceforge.net/projects/ssplex/files/




                                                   23
Contact Information
 Trulia
   • www.trulia.com
   • www.linkedin.com/company/trulia
 Alex Burmester
   • aburmester@trulia.com
 Alexander Kanarsky
   • akanarsky@trulia.com




                                       24
The End


    Q&A

    Trulia is hiring!
     – www.trulia.com/jobs
     – Work in downtown San Francisco with great people
     – Looking for engineers specializing in search, distributed data
       processing, visualization/data science, mobile platforms, front
       end and more.

    Thank you for coming!

Más contenido relacionado

Destacado (12)

Leveraging Advertising And Technology To Scale Your Business
Leveraging Advertising And Technology To Scale Your BusinessLeveraging Advertising And Technology To Scale Your Business
Leveraging Advertising And Technology To Scale Your Business
 
Staying Ahead in a World of Change
Staying Ahead in a World of ChangeStaying Ahead in a World of Change
Staying Ahead in a World of Change
 
10 Tips From Top Real Estate Agents
10 Tips From Top Real Estate Agents10 Tips From Top Real Estate Agents
10 Tips From Top Real Estate Agents
 
What We Thought We Knew: Surprising Truths About Buyers and Sellers
What We Thought We Knew: Surprising Truths About Buyers and SellersWhat We Thought We Knew: Surprising Truths About Buyers and Sellers
What We Thought We Knew: Surprising Truths About Buyers and Sellers
 
Housing Market Trends Report: Spring 2016
Housing Market Trends Report: Spring 2016Housing Market Trends Report: Spring 2016
Housing Market Trends Report: Spring 2016
 
Do I Stay Or Do I Grow?
Do I Stay Or Do I Grow?Do I Stay Or Do I Grow?
Do I Stay Or Do I Grow?
 
The Zillow Zero Marketing Strategy
The Zillow Zero Marketing StrategyThe Zillow Zero Marketing Strategy
The Zillow Zero Marketing Strategy
 
Subject pronouns to_be_present_continuous
Subject pronouns to_be_present_continuousSubject pronouns to_be_present_continuous
Subject pronouns to_be_present_continuous
 
Structuring Your Real Estate Team For Success
Structuring Your Real Estate Team For SuccessStructuring Your Real Estate Team For Success
Structuring Your Real Estate Team For Success
 
Training Your Real Estate Team
Training Your Real Estate TeamTraining Your Real Estate Team
Training Your Real Estate Team
 
What will tech VCs fund this year?
What will tech VCs fund this year? What will tech VCs fund this year?
What will tech VCs fund this year?
 
Zillow Group's Product Roadmap
Zillow Group's Product RoadmapZillow Group's Product Roadmap
Zillow Group's Product Roadmap
 

Similar a Transforming house hunting with solr at trulia - By Kanarsky Alexander and Alex Burmester

HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Geodata AS
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBMongoDB
 
Sharding Architectures
Sharding ArchitecturesSharding Architectures
Sharding Architecturesguest0e6d5e
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFShuguk
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...kcitp
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
Decade architecture discussion 20110311
Decade architecture discussion 20110311Decade architecture discussion 20110311
Decade architecture discussion 20110311chenlijiang
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Sharad Agarwal
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 

Similar a Transforming house hunting with solr at trulia - By Kanarsky Alexander and Alex Burmester (20)

ElephantDB
ElephantDBElephantDB
ElephantDB
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDB
 
Sharding Architectures
Sharding ArchitecturesSharding Architectures
Sharding Architectures
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFS
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
Decade architecture discussion 20110311
Decade architecture discussion 20110311Decade architecture discussion 20110311
Decade architecture discussion 20110311
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Hadoop Inside
Hadoop InsideHadoop Inside
Hadoop Inside
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Transforming house hunting with solr at trulia - By Kanarsky Alexander and Alex Burmester

  • 1. Transforming the house hunting experience Alex Burmester, Alexander Kanarsky, Trulia
  • 2. About Us  Trulia: Founded in 2005  Technology company in downtown San Francisco transforming real estate search.  14M unique visitors per month  Large and active user community  Strong industry relationships: agents, brokers, over 500k registered professionals  Helping consumers make informed decisions  About the speakers 2
  • 3. Trulia’s Marketplace Consumers Professionals • $1Tr Real Estate • $20Bn Real Estate Mktg Economy • $7Bn Online Today • ~5M Home sales/yr MARKETPLACE • >1M Realtors • ~16M Leases/yr • >1M Landlords/Managers • >70M Homeowners • >1M Local Contractors Most Compelling Essential Marketing Product: Solution: Comprehensive Huge exposure, qualified data, unique customers, powerful tools insights, great UX We create value by connecting consumers and professionals
  • 4. Problem overview  Real Estate search specifics  Traditional Search: Agent based/MLS  No single MLS country wide  Largest financial transaction for most people with potential for huge life tradeoffs  Lots of data but silo'd off with data quality issues and differing disclosure rules by area.
  • 5. The House Hunting Process  Buying, investing, renting  How online search changes this  Trulia was first site to mash-up lots of data nationwide, maps, heatmaps, tax info, rentals, street-view  Focus on speed  Relevancy of data  Amount of information  Targeted local RE search experience
  • 6. Challenges  Integrating large diverse datasets  Data quality and freshness  Delivering the right data to the right people  Scaling for growth  Meaning of Location, location, location
  • 7. Legacy Approach  MLS-like database search  MySQL 4.x  Limitations on data processing  Limitations on scalability  Replication problems  Update speed problems
  • 8. Why Solr?  Fast, flexible attribute search, faceting  Non-uniform data handling, dynamic schema  Easy, uniform API for indexing, query  Excellent Full-Text Search  Indexing, Search scale great (Hadoop indexing, distributed queries, replication)  Stored content: can be used as a data store  Add-Ons: Geospatial search, Field Collapsing, Automatic Data Import via DIH  Strong Developers/Users Community
  • 9. Hadoop Indexing  Based on SOLR-1301 patch  Reduce-side indexing  Map-side partitioning (sharding)  Input Data in HDFS  Output Indexes in HDFS  Local FS for temporary data processing  120+ mil. documents index in less than 1 hour  Paired with Index Manager
  • 10. Hadoop Indexing HDFS Mappers Part 1 Partitioners Part 2 Part 3 Part 4 … Part N Input data Embedded Solr Reducers Shard 1 Local ... FS Shard K Output data
  • 11. Index management  Index Manager: custom controller  Manages index deployment from HDFS  Controls load balancing  Manages forced replication on slaves  Interacts with caching layer  Handles Direct indexing  Visualizes the state of the indexes  Alternatives: Katta, Solr Cloud
  • 12. Index Management Search HDFS Requests Shard 1 Shard 2 Shards 3 Caching/Load Balancing Layer Master Slave 1 Slave N Pulls data from HDFS ... Shard 1 Shard 1 Shard 1 Index Shard 2 Shard 2 Shard 2 Manager Shard 3 Shard 3 Shard 3 Installs Master Index Controls slaves replication
  • 13. Direct Indexer  Handles various updates between the batches  Patches, pictures updates etc.  Only does certain type of documents  Directly updates master index shards  Interacts with Index Manager (scope of changes)  Runs on a regular basis
  • 14. Direct Indexer Data Sources Search HDFS Requests Shard 1 Direct Indexer Shard 2 Shards 3 Caching/Load Balancing Layer Master Slave 1 Slave N Pulls data from HDFS ... Shard 1 Shard 1 Shard 1 Index Shard 2 Shard 2 Shard 2 Manager Shard 3 Shard 3 Shard 3 Installs Master Index Controls slaves replication
  • 15. Distributed Search Query Aggregator uses the same partitioning schema as Hadoop Indexer, so only the relevant shards are queried. Query extra parameters define the scope. Query 1 Shard 1 Requests Query ... Aggregator Shard N Query 2 Shard 1 Query Requests Aggregator ... Shard N
  • 16. Distributed Search  Many servers, many shards Aggregated Query time distribution  Request is sent only to 100% relevant shards 90%  For search I/O wait matters 80% So SSDs rule :-) 70%  60%  30-50 qps per shard >1s 50% <1s  90% of queries < 100 ms, 40% <100 <10  99% within 1sec 30%  Separate indexes for some 20% documents 10%  50% of queries < 10 ms 0% 6 7 8 9 10 11 12  Caching/load balancing Time period: 6 am to 12 pm matters
  • 17. Geospatial search  Local search is super important  Mobile search is naturally spatial-oriented  Communities, neighborhoods, enclaves  POI search  Comparable Properties  Personalized Search: Factoring census data in  How to enable it with Solr?  No standard solution yet
  • 18. Geospatial Solr/Lucene  SOLR-733 patch  Local Lucene / Local Solr (Patrick O'Leary)  Spatial Lucene in Lucene 3.1  JTeam's SSP (Chris Male)  SOLR-2155 patch (David Smiley)  Lucene-Spatial-Playground (David Smiley, Chris Male, Ryan McKinley)  Spatial functions is Solr 3.1
  • 19. Geospatial search  What kind of spatial search do we need?  Radius search  Polygon search  Sort by distance is needed for radius search  Implementation chosen: SSP-based, fast geometry  INFO: [active] webapp= path=/select/ params={fl=streetAddress_s,lat,lng&wt=kml&q={!spatial+polygons%3D37.77960208641734,- 122.43867874145508;37.783265262376574,-122.43945121765137;37.78414711095678,-122.4330997467041;37.788827515747435,- 122.43404388427734;37.79038758480464,-122.4221134185791;37.781908551707666,-122.420654296875;37.774988939930005,- 122.42932319641113;37.774988939930005,-122.4360179901123;37.77817746896081,-122.43687629699707;37.777634750327046,- 122.43867874145508;37.77960208641734,-122.43867874145508;}*:*&rows=1000} hits=75 status=0 QTime=6
  • 20. Geospatial Search  Circle search: given (lat, lng, radius) return list of docs with lat/lng indexed  Sorting: distance from the center  Polygon search: given a polygon (set of lat/lng pairs) return list of docs with lat/lng indexed  Bounding boxes: lat,lng range queries  Postponed distance calculation (query response write time)
  • 21. Geospatial Search  Spatial query parser: {!spatial <params>}  Can be wrapped as a filter query  Facet by spatial queries  Planar geometry: fast, but has limitations  Good for our use cases, YMMV  Polygon compression: Google's delta-encoding  TBD: shapes indexing
  • 22. Wrap Up  We are reshaping real estate search - with the right tools!  Hadoop + Solr way is great for frequent batch indexing  Solr Search is fast - and highly scalable through sharding and replication  Geospatial Solr search is still not standardized, lots of progress in development  Open Source: Trulia contributes 22
  • 23. Links  SOLR-1301: Hadoop Indexing Contrib • issues.apache.org/jira/browse/SOLR-1301  Spatial Solr Plugin (SSP) • www.jteam.nl/products/spatialsolrplugin.html  SSP Polygons/Polylines Extension • sourceforge.net/projects/ssplex/files/ 23
  • 24. Contact Information  Trulia • www.trulia.com • www.linkedin.com/company/trulia  Alex Burmester • aburmester@trulia.com  Alexander Kanarsky • akanarsky@trulia.com 24
  • 25. The End  Q&A  Trulia is hiring! – www.trulia.com/jobs – Work in downtown San Francisco with great people – Looking for engineers specializing in search, distributed data processing, visualization/data science, mobile platforms, front end and more.  Thank you for coming!