SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Search at Tumblr
Yufei Pan
Director of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007

Publishing Platform
● 163 million blogs
● 72 billion posts

Social Network
● Follow, Mention
● Like, Reblog
About search@tumblr
● Most important way to discover great content
○ 50M searches a day

● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!

Jak

Yufei

Bennett

Beitao

Patrick

● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends

Adam
Whole New Search
Post search
● full text search
● top and recent
● post type filtering

Blog search
● name & title
● top tags in posts
● blog highlights

Related search
● term co-occurrence
Typeahead Autocompletes
Search Autocomplete

Mention Autocomplete

●
●
●

Interactive guide of tumblr content
High volume of traffic
Low latency

Tag Suggest
Recommendations
Personalized Recommendation

Weekly Dashboard Digest
Trends
Trending Tags

Trending Blogs
Theme Search
Search Architecture
Post
Search

Blog
Search

Typeahead

Related
Tags

Blog
Recommend

Blog
Highlights

Blog
Top Tags

Trending
Tags

Trending
Blogs

Trending
Posts

Online
Search Online Framework

Recent Post
Index

Blog Full
Index

Theme
Index

Blog Top-K
Index

Follower
Counts

Post
Notecount

Post
Model

Personalized

Blog Index

Trending
Blogs

Trending
Posts

Trending
Tags

Related Tag
Index

Blog Global
Rank

Blog
Model

User
Model

Typeahead
Indices

Data

Top
Post Index

Blog Top
Posts

Blog Top
Tags

Two Degree

Like Root

Blog
Feedback

In-Blog Tag
Index

Global Tag
Index

Search Offline Framework

Rediscover
Solr

Offline

MySQL
Activity Streams (Fire Geyser)

Scribe logs, Sqoop tables (HDFS)

Nginx
Linux
Software Stack
● Search Online
○ HAProxy, Nginx, PHP
○ Memcache
○ Icinga, Scribe, OpenTSDB

● Search Data
○ Solr, Redis, MySQL

● Search Offline
○ Sqoop, Hadoop
○ Java, Hive, Pig, Scalding, Python
Search Online Framework
Search Services

SearchBase

Search Flow
Execution

Multi-level
Caching

Search
Logging

Async
Execution

Search
Editorial

QueryIF

RetrieverIF

SignalFetcherIF

RankerIF

DocFetcherIF

FilterIF

SimpleQuery

SolrPostRetriever

NotecountFetcher

TopPostRanker

PostFetcher

PostFilter

PersonalizedQuery

MysqlPostRetriever

FollowercountFetcher

TumblelogRanker

TumblelogFetcher

TumblelogFilter

AdvancedPostQuery

SMPostRetriever

TumblelogGlobalRan
kFetcher

RelatedPostRanker

TagFetcher

TagFilter

RecommendationSign
alFetcher

TumblelogMixingRan
ker

TimeSliceQuery
TrendTagQuery

TumblelogRetriever
TagTypeahead
Reteriever

BlogTopTagFetcher
Search Batch Processing
Search Data (Redis)

Workflow
Composition

Dependency
Resolution

Automatic
Versioning

Data
Verification

Execution
Logging
Failure
Detection/Alert

Search Workflow Engine
Hive Jobs
Term
Generators

Streaming
Jobs

Pig Jobs

Top-K
Indexer

Delta
Propagator

Search Task Base
Scribe Logs, Sqoop Tables (HDFS)

Scalding
Jobs
Lucille2
Classes
Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)

● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●

blog search: aggregated likes on query term
blog recommendation: follow counts among friends

○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list

● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering

● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation

● Limitation
○ Loss of freshness
○ Expensive for longtail query and results

● Precomputed
○
○
○
○

Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes

● Ranking
○ more effective and spam-resilient signals
○ learning to rank

● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation

● Content discovery
○ trending content in various categories
Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Más contenido relacionado

Similar a Search at Tumblr (nyc search meetup)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentFrank Jones
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization FundamentalsKalin Chernev
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)NeslaSherin
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) convertedNeslaSherin
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsOmnePresent
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightRaunak Guha
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020Bill Hartzer
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Mark Ginsberg
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israelBarry Schwartz
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDAbi Yudhie
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Marshal Yung
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO BasicsAmanda Billy
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO IntroductionSSAA60
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsiteFrom The Future
 

Similar a Search at Tumblr (nyc search meetup) (20)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional Development
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization Fundamentals
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) converted
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For Blogs
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020
 
SEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETINGSEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETING
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.ID
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO Basics
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Introduction to Databases
Introduction to Databases Introduction to Databases
Introduction to Databases
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW Website
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Search at Tumblr (nyc search meetup)

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)