SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Building a big social network
search system using Lucene
Aleksey Shevchuk
Lead developer @ Odnoklassniki
Agenda
Functions and architecture
Problems & solutions
1
About Odnoklassniki social network
• Audience:
– 200 mln accounts;
– Up to 6 mln users online;
– More then 40 mln visitors a day
• Within a second:
– 290 000 web pages,100 000 photos viewed;
– 4000 search requests,
average search time 70 ms
2
Why we have chosen Lucene?
• Back in 2009 we had user search based on MS SQL –
this simplified initial requirement definition
• We wanted an OpenSource written in Java
• Tests had shown that Solr underperforms for us
• Developed our own server around Lucene
3
Search system duties today
4
Users
Video
Music
Groups
Communities
Events
Gifts
Locations
Hobbies
Help
Group users
Quick portal search
5
Expanded portal search
6
Architecture
7
Search facade
Event
Maker + DB
Search
Update
Query Replication
Query
ServicesGet Entity cache
Presentation
Architecture: maker
8
• Collects notifications about changed entities
• Uses Cassandra to store additional entity data
• Responsible for domain index writing
• Controls index replication to query servers
Architecture: query
9
• Many servers in different hardware configuration
• Unified application
• For quick start store index’s on disk
• Queries are executed in heap memory
– IndexReader rewritten to eliminate unnecessary operations
– Own stored field retrieval method:
• No garbage
• Accessing values without actual deserialization
Architecture: search facade
10
• Creates & manages personal index’s
• Schedules query execution
• Reduce query results to search results
• Loads data for result rendering
Problems & solutions
11
Problem: spelling vs performance
12
• Most of the content is in Russian language:
– Proper Russian
– Common misspells
– Misspells made by people who try to write in Russian
– Russian words written in Latin (Translit & Crazy Russian)
– Wrong keyboard layout
• Few examples, with common misspells omitted:
– машина = мышына, масына, mashina, moshina
– Кашин = kashin, кашен, ka6in
– Kosheen = кошин, cosheen, koshin
Solution: spelling vs performance
13
• Reduce number of terms using phonetics:
MOSHINO = машина, мышына, масына, mashina, moshina
• Query is expanded with few phonetic keys:
– Common misspellings
– Synonyms we know
• Distinguish writing using 1 byte hash code per term
– If possible, perform hash check only for top documents
Problem: personal index availability
14
• Queries take 5 – 100 ms
• Personal index composition takes 50 - 300 ms
Cache Cache
Service Service Service
*2 *2 *2
• Network load on cache servers quickly hit 700 Mb/s
• Meanwhile, there were no CPU load on cache servers
Solution: personal index availability
15
Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-99
37
• Bind users to concrete servers
• Store personal index’s locally (in off-heap memory)
• Determine substitution order
• Whole network load is under 100 Mb/s
• Even CPU load on all servers
Problem: gender and country filters
16
• Usually index is split into shards, till average query
time meets some bounds
– This solves response time problem
– All possible documents are checked
• There is 2 filters which make user queries slow:
– Gender
– One very popular country
Solution: gender and country filters
• Remove this condition checks – saves 17% CPU
• Exclude documents which could not match this filters
– saves another 12% CPU
Russian males
Russian females
Other males
Other females
Problem: users online search
18
• People wish to quickly find a person they can talk to
• At any given moment, only small fraction of users
are online
• Standard solution – filter out onlines from general
search results:
+ easy to implement
+ reliable
– slow, especially at random users query
– wastes CPU
Solution: users online search
19
• Create separate index, with online users only:
+ works quickly
+ no tricks required
– more then 200.000 changes/minute
– correct results depend on index maker availability
Problem: user search inside group
20
• This kind of search is in demand from group owners
• Some numbers:
– 200 million users in 16 shards
– 7 million groups in 8 shards
– Each group has from 1 to several million users
– Number of group to user connections – billions
• “Dummy solutions” were not checked
Problem: user search inside group
21
Groups
Users
• We use mechanics from personal indexes
• Currently indexed groups are updated with changes
• Small group indexes are discontinued after 1 hour
• Big groups indexes are kept until application restarts
Search façade
Heap
memory
Off-heap
memory
Portal services
Small groups
More information
22
Aleksey Shevchuk
@AlekseyShevchuk
aleksey.shevchuk@odnoklassniki.ru
odnoklassniki.ru/mrSearch
Odnoklassniki.ru
http://v.ok.ru
Integration with Odnoklassniki.ru
http://connect.ok.ru
one-nio
slideshare.net/m0nstermind/presentations
github.com/odnoklassniki/one-nio
Cassandra
github.com/odnoklassniki/apache-cassandra
Aleksey Shevchuk
@AlekseyShevchuk
aleksey.shevchuk@odnoklassniki.ru
odnoklassniki.ru/mrSearch

Más contenido relacionado

Similar a Building big social network search system using lucene

lastfm contentdashboards project description
lastfm contentdashboards project descriptionlastfm contentdashboards project description
lastfm contentdashboards project descriptionGaurav Bhardwaj
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Dave Nielsen
 
Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInLili Wu
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scaleRan Levy
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Dave Nielsen
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionNguyen Tung
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
 
eHarmony - Messaging Platform with MongoDB Atlas
eHarmony - Messaging Platform with MongoDB Atlas eHarmony - Messaging Platform with MongoDB Atlas
eHarmony - Messaging Platform with MongoDB Atlas MongoDB
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
Technical Challenges of Developing a Facebook Game
Technical Challenges of Developing a Facebook GameTechnical Challenges of Developing a Facebook Game
Technical Challenges of Developing a Facebook GamePatrick Huesler
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageGreg Brown
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast MeetupsMembase
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in MicroservicesRandy Shoup
 

Similar a Building big social network search system using lucene (20)

lastfm contentdashboards project description
lastfm contentdashboards project descriptionlastfm contentdashboards project description
lastfm contentdashboards project description
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!
 
Browsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedInBrowsemap: Collaborative Filtering at LinkedIn
Browsemap: Collaborative Filtering at LinkedIn
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scale
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup?
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
eHarmony - Messaging Platform with MongoDB Atlas
eHarmony - Messaging Platform with MongoDB Atlas eHarmony - Messaging Platform with MongoDB Atlas
eHarmony - Messaging Platform with MongoDB Atlas
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Resolving problems & high availability
Resolving problems & high availabilityResolving problems & high availability
Resolving problems & high availability
 
Apache drill
Apache drillApache drill
Apache drill
 
Technical Challenges of Developing a Facebook Game
Technical Challenges of Developing a Facebook GameTechnical Challenges of Developing a Facebook Game
Technical Challenges of Developing a Facebook Game
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
Membase East Coast Meetups
Membase East Coast MeetupsMembase East Coast Meetups
Membase East Coast Meetups
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Building big social network search system using lucene

  • 1. Building a big social network search system using Lucene Aleksey Shevchuk Lead developer @ Odnoklassniki
  • 3. About Odnoklassniki social network • Audience: – 200 mln accounts; – Up to 6 mln users online; – More then 40 mln visitors a day • Within a second: – 290 000 web pages,100 000 photos viewed; – 4000 search requests, average search time 70 ms 2
  • 4. Why we have chosen Lucene? • Back in 2009 we had user search based on MS SQL – this simplified initial requirement definition • We wanted an OpenSource written in Java • Tests had shown that Solr underperforms for us • Developed our own server around Lucene 3
  • 5. Search system duties today 4 Users Video Music Groups Communities Events Gifts Locations Hobbies Help Group users
  • 8. Architecture 7 Search facade Event Maker + DB Search Update Query Replication Query ServicesGet Entity cache Presentation
  • 9. Architecture: maker 8 • Collects notifications about changed entities • Uses Cassandra to store additional entity data • Responsible for domain index writing • Controls index replication to query servers
  • 10. Architecture: query 9 • Many servers in different hardware configuration • Unified application • For quick start store index’s on disk • Queries are executed in heap memory – IndexReader rewritten to eliminate unnecessary operations – Own stored field retrieval method: • No garbage • Accessing values without actual deserialization
  • 11. Architecture: search facade 10 • Creates & manages personal index’s • Schedules query execution • Reduce query results to search results • Loads data for result rendering
  • 13. Problem: spelling vs performance 12 • Most of the content is in Russian language: – Proper Russian – Common misspells – Misspells made by people who try to write in Russian – Russian words written in Latin (Translit & Crazy Russian) – Wrong keyboard layout • Few examples, with common misspells omitted: – машина = мышына, масына, mashina, moshina – Кашин = kashin, кашен, ka6in – Kosheen = кошин, cosheen, koshin
  • 14. Solution: spelling vs performance 13 • Reduce number of terms using phonetics: MOSHINO = машина, мышына, масына, mashina, moshina • Query is expanded with few phonetic keys: – Common misspellings – Synonyms we know • Distinguish writing using 1 byte hash code per term – If possible, perform hash check only for top documents
  • 15. Problem: personal index availability 14 • Queries take 5 – 100 ms • Personal index composition takes 50 - 300 ms Cache Cache Service Service Service *2 *2 *2 • Network load on cache servers quickly hit 700 Mb/s • Meanwhile, there were no CPU load on cache servers
  • 16. Solution: personal index availability 15 Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-99 37 • Bind users to concrete servers • Store personal index’s locally (in off-heap memory) • Determine substitution order • Whole network load is under 100 Mb/s • Even CPU load on all servers
  • 17. Problem: gender and country filters 16 • Usually index is split into shards, till average query time meets some bounds – This solves response time problem – All possible documents are checked • There is 2 filters which make user queries slow: – Gender – One very popular country
  • 18. Solution: gender and country filters • Remove this condition checks – saves 17% CPU • Exclude documents which could not match this filters – saves another 12% CPU Russian males Russian females Other males Other females
  • 19. Problem: users online search 18 • People wish to quickly find a person they can talk to • At any given moment, only small fraction of users are online • Standard solution – filter out onlines from general search results: + easy to implement + reliable – slow, especially at random users query – wastes CPU
  • 20. Solution: users online search 19 • Create separate index, with online users only: + works quickly + no tricks required – more then 200.000 changes/minute – correct results depend on index maker availability
  • 21. Problem: user search inside group 20 • This kind of search is in demand from group owners • Some numbers: – 200 million users in 16 shards – 7 million groups in 8 shards – Each group has from 1 to several million users – Number of group to user connections – billions • “Dummy solutions” were not checked
  • 22. Problem: user search inside group 21 Groups Users • We use mechanics from personal indexes • Currently indexed groups are updated with changes • Small group indexes are discontinued after 1 hour • Big groups indexes are kept until application restarts Search façade Heap memory Off-heap memory Portal services Small groups
  • 23. More information 22 Aleksey Shevchuk @AlekseyShevchuk aleksey.shevchuk@odnoklassniki.ru odnoklassniki.ru/mrSearch Odnoklassniki.ru http://v.ok.ru Integration with Odnoklassniki.ru http://connect.ok.ru one-nio slideshare.net/m0nstermind/presentations github.com/odnoklassniki/one-nio Cassandra github.com/odnoklassniki/apache-cassandra