SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
+




    Engineering Challenges
    in Vertical Search Engines
    Aleksandar Bradic, Senior Director,
    Engineering and R&D, Vast.com
+
    Introduction

        Vertical Search
             Search focused on vertical data
             Vertical Data – data inherently described by it’s structure:
                Items/Properties for sale (Automotive, Real Estate..)

                  Geographical Data (Neighborhoods, Locations..)
                  Services (Hotels, Transportation..)
                  Businesses (Restaurants, Nightlife..)
                  Events (Concerts, Plays..)
                  Auction items (Collectibles, Art..)
                  Metadata (News, Social Data, Reviews..)
                  …
+
    Introduction

        Vertical Search != Full Text Search
             Full Text Search queries:
                “Cheap tickets for Broadway shows this week”
                “Trendy Restaurants in San Francisco near SoMa”
                “3-day trips from NYC to anywhere under $1000”
             Vertical Search queries:
                “price-sorted results bellow two standard deviations from tickets
                 category with Broadway as location and date range of 2010-04-11 to
                 2010-04-18”
                “distance-sorted results relative to center of SF/SoMa matching the
                 appropriate threshold of composite score of user review scores and
                 historical change in query/review volume”
                “total cost-sorted results for all 3-day intervals within next 6 months
                 combining hotel and airfare price bellow max value of $1000 for all
                 valid locations”
+
    Introduction

        Vertical Search = search on structured data

        Vertical Search at Web-Scale:
             Web-Scale datasets
             Web-Scale query volumes
             Interactive operation
             Low latency requirements
             Utility maximization across all involved parties

        => loads of fun ! : )
+
    @Vast.com

        Vast.com : Vertical Search & Analytics Platform

        Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
         Airlines, etc..
+
    @Vast.com

        Daily processing up to 1Tb of unstructured and semi-
         structured Web data

        Managing ~150M records operational dataset across multiple
         verticals

        Handling > 1000 query/sec peak search query loads



        We’re hiring ! : )
+
    Challenges in Vertical Search
    Engines
        Web Data Retrieval

        Unstructured Data

        Data Processing Infrastructures

        Vertical Search

        Data Analytics

        Computational Advertising
+
    Web Data Retrieval

        Crawler Architecture
             Queue Management
             Crawl Ordering Policies
             Duplicate URL Detection
             Content Hash Management
             Politeness Management
             Coverage Measurement
             Freshness Optimization
             Incremental Crawling
+
    Web Data Retrieval

        ”Deep Web” crawling
             Locating Deep Web Content Sources
             Selecting Relevant Sources
             Estimating Database Size
             Understanding Content / Form Detection
             Automatic Dispatch of HTML Forms
             Predicting content in free text forms
             Crawling non-HTML Content
             Estimating Query Result Sparsity
             URL Generation problem
             Query Covering Problem
+
    Web Data Retrieval

        Focused (Topical) Crawling
             Content Classification
             Link Content Prediction
             Topic Relevance Estimation

        Modeling Temporal Characteristics
             Site-Level Evolution
             Page-Level Evolution

        Adversarial Crawling
             Web Spam Detection
             Cloaked Content Detection
+
    Unstructured Data

        Unstructured Data – information that does not have a pre-
         defined data model

        Handling Unstructured Data:
             Data Cleaning
             Tagging with Metadata
             Vertical Classification
             Schema Matching
             Information Extraction


    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make            model   year    trim          price                  ???
+
    Unstructured Data

        Information extraction from unstructured, ungrammatical
         data
             Reference Sets - relational data sets that consist of collection of
              known entities with associated common attributes
             Reference Set Selection
             Reference Set Generation
             Record Linkage : Finding “best matching” member of reference
              set corresponding post
             Challenge : Automatic Generation of Reference Sets
+
    Data Processing Infrastructures

        Infrastructures for continuous processing of unbounded streams
         of unstructured data
        Information Extraction as part of processing (non-trivial
         computation per each processed entry)

        Inherently distributed infrastructures - in order to support
         performance and scalability

        Time-to-site constraints. Ability to process out-of band data.

        Support for complex operations on aggregated data (de-
         duplication, static ranking, data enrichment, data cleaning/
         filtering …)

        Support for data archival and off-line analysis
+
    Data Processing Infrastructures
+
    Data Processing Infrastructures

        Distributed Computing Platforms:

             Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

             Stream-oriented (Flume, S4, Stream SQL…)

             Distributed Data Stores (Dynamo/Cassandra/Riak…)

        The curse of CAP Theorem:
             It is impossible for a distributed system to simultaneously provide
              all three of the following guarantees:
                Consistency
                Availability
                Partition tolerance
+
    Vertical Search

        Large-Scale structured data search

        Providing both analytic and canonical set of Information
         Retrieval functionalities

        Entries are represented in Vector Space Model

        Each result is represented as data point – tuple consisting of
         appropriate number of fields :

         (make, model, year, trim …)
+
    Vertical Search

        Search in Vector Space Model
             Resulting subset generation
             Sorting as linearization using selected metric
             Dynamic subset criteria calculation
             Search Result Clustering
             “Similar” result search
             …



… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
+
    Vertical Search

        Faceted Search
             fac-et (fas’it) :
                1. One of the flat polished surfaces cut on a gemstone or occurring
                 naturally on a crystal.
                2. One of numerous aspects, as of a subject.


             Vocabulary problem for faceted data
             Facet Design / selection
                "the keywords that are assigned by indexers are often at
                  odds with those tried by searchers.”
                Selection of information-distinguishing facet values
             User-specific faceted search
             Dynamic correlated facet generation
             Distributing facet computation
+
    Data Analytics

        Clickstream Data Analysis

        Learning from implicit user feedback

        Anonymous user clustering

        Learning to rank

        Inventory/Market Trends

        Rare Event detection

        Price Prediction

        Spam Content detection
+
    Data Analytics

        Challenges:
             “Good Deal” detection
             Recommendation Systems for Vertical Data with no explicit user
              feedback
             Accuracy of Automatic Valuation Models
             Data-driven feature design
             Click Prediction
             User Behavior Modeling
+
    Computational Advertising

        The central problem of computational advertising is to find
         the "best match" between a given user in a given context and a
         suitable advertisement.




    ads


                                                                          ads




                                         search results !
+
    Computational Advertising

        Vertical Search presents an additional challenge in the sense
         that any of the actual search results can be “sponsored”




                                                                   ad ?




                                                                   ad ?
+
    Computational Advertising

        Central challenge:
             Find the “best match” between a given user in a given context
              and a suitable advertisement
             “best match” – maximizing the value for :
                  Users
                  Advertisers
                  Publishers
             Each of the parties has different set of utilities:
                Users want relevance

                  Advertisers want ROI and volume
                  Publishers want revenue per impression/search
+
    Computational Advertising

        CTR (ClickThrough Rate Estimation):
             Reactive (statistically significant historical CTR)
             Predictive (CTR estimated from features of ads)
             Hybrid (historical + predictive)


             Personalization of CTR Computation ?
             Dynamic CTR Estimation (online algorithms)




                                  P(click) = ?
+
    Computational Advertising

        Analytical Aparatus:
             Regression Analysis (Linear, Logistic, probit model, High
              Dimensional methods)
             Game Theory (Nash Equilibria, dominant strategy)
             Auction Theory (Vickrey, GSP, VCG…)
             Graph Theory (random walks on graphs, graph matching, etc.)
             Information Retrieval Techniques (similarity metrics, etc.)
             …
+
    Conclusion

        Vertical Search & Analytics at Web Scale == fun !!!

        Source of large number of relevant research & engineering
         problems !

        Opportunity to tackle wide spectra of techniques across all
         areas of Computer Science and Engineering !




                                       Jump on the bandwagon ! : )

Más contenido relacionado

Similar a Engineering challenges in vertical search engines

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYAmit Sheth
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsOlha Hrytsay
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web TechnologiesKANIMOZHIUMA
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...Amazon Web Services
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation EngineAmazon Web Services
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebAlessandro Bozzon
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSAmazon Web Services
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product managementBhaskar Krishnan
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfParvathyparu25
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analyticsdeep.bi
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsAmit Sheth
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Publicaspoerri
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphBill Slawski
 

Similar a Engineering challenges in vertical search engines (20)

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 

Más de ITDogadjaji.com

Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in SerbiaITDogadjaji.com
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center SerbiaITDogadjaji.com
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...ITDogadjaji.com
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event PresentationITDogadjaji.com
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities ITDogadjaji.com
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotITDogadjaji.com
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing upITDogadjaji.com
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first baseITDogadjaji.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the worldITDogadjaji.com
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentationITDogadjaji.com
 

Más de ITDogadjaji.com (20)

Game Design 101
Game Design 101Game Design 101
Game Design 101
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
 
Taksiko
TaksikoTaksiko
Taksiko
 
SiteCake
SiteCakeSiteCake
SiteCake
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
 

Último

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Engineering challenges in vertical search engines

  • 1. + Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
  • 2. + Introduction   Vertical Search   Search focused on vertical data   Vertical Data – data inherently described by it’s structure:   Items/Properties for sale (Automotive, Real Estate..)   Geographical Data (Neighborhoods, Locations..)   Services (Hotels, Transportation..)   Businesses (Restaurants, Nightlife..)   Events (Concerts, Plays..)   Auction items (Collectibles, Art..)   Metadata (News, Social Data, Reviews..)   …
  • 3. + Introduction   Vertical Search != Full Text Search   Full Text Search queries:   “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”   Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18”   “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”   “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
  • 4. + Introduction   Vertical Search = search on structured data   Vertical Search at Web-Scale:   Web-Scale datasets   Web-Scale query volumes   Interactive operation   Low latency requirements   Utility maximization across all involved parties   => loads of fun ! : )
  • 5. + @Vast.com   Vast.com : Vertical Search & Analytics Platform   Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
  • 6. + @Vast.com   Daily processing up to 1Tb of unstructured and semi- structured Web data   Managing ~150M records operational dataset across multiple verticals   Handling > 1000 query/sec peak search query loads   We’re hiring ! : )
  • 7. + Challenges in Vertical Search Engines   Web Data Retrieval   Unstructured Data   Data Processing Infrastructures   Vertical Search   Data Analytics   Computational Advertising
  • 8. + Web Data Retrieval   Crawler Architecture   Queue Management   Crawl Ordering Policies   Duplicate URL Detection   Content Hash Management   Politeness Management   Coverage Measurement   Freshness Optimization   Incremental Crawling
  • 9. + Web Data Retrieval   ”Deep Web” crawling   Locating Deep Web Content Sources   Selecting Relevant Sources   Estimating Database Size   Understanding Content / Form Detection   Automatic Dispatch of HTML Forms   Predicting content in free text forms   Crawling non-HTML Content   Estimating Query Result Sparsity   URL Generation problem   Query Covering Problem
  • 10. + Web Data Retrieval   Focused (Topical) Crawling   Content Classification   Link Content Prediction   Topic Relevance Estimation   Modeling Temporal Characteristics   Site-Level Evolution   Page-Level Evolution   Adversarial Crawling   Web Spam Detection   Cloaked Content Detection
  • 11. + Unstructured Data   Unstructured Data – information that does not have a pre- defined data model   Handling Unstructured Data:   Data Cleaning   Tagging with Metadata   Vertical Classification   Schema Matching   Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! make model year trim price ???
  • 12. + Unstructured Data   Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of known entities with associated common attributes   Reference Set Selection   Reference Set Generation   Record Linkage : Finding “best matching” member of reference set corresponding post   Challenge : Automatic Generation of Reference Sets
  • 13. + Data Processing Infrastructures   Infrastructures for continuous processing of unbounded streams of unstructured data   Information Extraction as part of processing (non-trivial computation per each processed entry)   Inherently distributed infrastructures - in order to support performance and scalability   Time-to-site constraints. Ability to process out-of band data.   Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …)   Support for data archival and off-line analysis
  • 14. + Data Processing Infrastructures
  • 15. + Data Processing Infrastructures   Distributed Computing Platforms:   Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)   Stream-oriented (Flume, S4, Stream SQL…)   Distributed Data Stores (Dynamo/Cassandra/Riak…)   The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide all three of the following guarantees:   Consistency   Availability   Partition tolerance
  • 16. + Vertical Search   Large-Scale structured data search   Providing both analytic and canonical set of Information Retrieval functionalities   Entries are represented in Vector Space Model   Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
  • 17. + Vertical Search   Search in Vector Space Model   Resulting subset generation   Sorting as linearization using selected metric   Dynamic subset criteria calculation   Search Result Clustering   “Similar” result search   … … with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
  • 18. + Vertical Search   Faceted Search   fac-et (fas’it) :   1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.   2. One of numerous aspects, as of a subject.   Vocabulary problem for faceted data   Facet Design / selection   "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values   User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation
  • 19. + Data Analytics   Clickstream Data Analysis   Learning from implicit user feedback   Anonymous user clustering   Learning to rank   Inventory/Market Trends   Rare Event detection   Price Prediction   Spam Content detection
  • 20. + Data Analytics   Challenges:   “Good Deal” detection   Recommendation Systems for Vertical Data with no explicit user feedback   Accuracy of Automatic Valuation Models   Data-driven feature design   Click Prediction   User Behavior Modeling
  • 21. + Computational Advertising   The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
  • 22. + Computational Advertising   Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
  • 23. + Computational Advertising   Central challenge:   Find the “best match” between a given user in a given context and a suitable advertisement   “best match” – maximizing the value for :   Users   Advertisers   Publishers   Each of the parties has different set of utilities:   Users want relevance   Advertisers want ROI and volume   Publishers want revenue per impression/search
  • 24. + Computational Advertising   CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)   Predictive (CTR estimated from features of ads)   Hybrid (historical + predictive)   Personalization of CTR Computation ?   Dynamic CTR Estimation (online algorithms) P(click) = ?
  • 25. + Computational Advertising   Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High Dimensional methods)   Game Theory (Nash Equilibria, dominant strategy)   Auction Theory (Vickrey, GSP, VCG…)   Graph Theory (random walks on graphs, graph matching, etc.)   Information Retrieval Techniques (similarity metrics, etc.)   …
  • 26. + Conclusion   Vertical Search & Analytics at Web Scale == fun !!!   Source of large number of relevant research & engineering problems !   Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )