Building big social network search system using lucene

Building a big social network
search system using Lucene
Aleksey Shevchuk
Lead developer @ Odnoklassniki

Agenda
Functions and architecture
Problems & solutions
1

About Odnoklassniki social network
• Audience:
– 200 mln accounts;
– Up to 6 mln users online;
– More then 40 mln visitors a day
• Within a second:
– 290 000 web pages,100 000 photos viewed;
– 4000 search requests,
average search time 70 ms
2

Why we have chosen Lucene?
• Back in 2009 we had user search based on MS SQL –
this simplified initial requirement definition
• We wanted an OpenSource written in Java
• Tests had shown that Solr underperforms for us
• Developed our own server around Lucene
3

Search system duties today
4
Users
Video
Music
Groups
Communities
Events
Gifts
Locations
Hobbies
Help
Group users

Architecture
7
Search facade
Event
Maker + DB
Search
Update
Query Replication
Query
ServicesGet Entity cache
Presentation

Architecture: maker
8
• Collects notifications about changed entities
• Uses Cassandra to store additional entity data
• Responsible for domain index writing
• Controls index replication to query servers

Architecture: query
9
• Many servers in different hardware configuration
• Unified application
• For quick start store index’s on disk
• Queries are executed in heap memory
– IndexReader rewritten to eliminate unnecessary operations
– Own stored field retrieval method:
• No garbage
• Accessing values without actual deserialization

Architecture: search facade
10
• Creates & manages personal index’s
• Schedules query execution
• Reduce query results to search results
• Loads data for result rendering

Problem: spelling vs performance
12
• Most of the content is in Russian language:
– Proper Russian
– Common misspells
– Misspells made by people who try to write in Russian
– Russian words written in Latin (Translit & Crazy Russian)
– Wrong keyboard layout
• Few examples, with common misspells omitted:
– машина = мышына, масына, mashina, moshina
– Кашин = kashin, кашен, ka6in
– Kosheen = кошин, cosheen, koshin

Solution: spelling vs performance
13
• Reduce number of terms using phonetics:
MOSHINO = машина, мышына, масына, mashina, moshina
• Query is expanded with few phonetic keys:
– Common misspellings
– Synonyms we know
• Distinguish writing using 1 byte hash code per term
– If possible, perform hash check only for top documents

Problem: personal index availability
14
• Queries take 5 – 100 ms
• Personal index composition takes 50 - 300 ms
Cache Cache
Service Service Service
*2 *2 *2
• Network load on cache servers quickly hit 700 Mb/s
• Meanwhile, there were no CPU load on cache servers

Solution: personal index availability
15
Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-99
37
• Bind users to concrete servers
• Store personal index’s locally (in off-heap memory)
• Determine substitution order
• Whole network load is under 100 Mb/s
• Even CPU load on all servers

Problem: gender and country filters
16
• Usually index is split into shards, till average query
time meets some bounds
– This solves response time problem
– All possible documents are checked
• There is 2 filters which make user queries slow:
– Gender
– One very popular country

Solution: gender and country filters
• Remove this condition checks – saves 17% CPU
• Exclude documents which could not match this filters
– saves another 12% CPU
Russian males
Russian females
Other males
Other females

Problem: users online search
18
• People wish to quickly find a person they can talk to
• At any given moment, only small fraction of users
are online
• Standard solution – filter out onlines from general
search results:
+ easy to implement
+ reliable
– slow, especially at random users query
– wastes CPU

Solution: users online search
19
• Create separate index, with online users only:
+ works quickly
+ no tricks required
– more then 200.000 changes/minute
– correct results depend on index maker availability

Problem: user search inside group
20
• This kind of search is in demand from group owners
• Some numbers:
– 200 million users in 16 shards
– 7 million groups in 8 shards
– Each group has from 1 to several million users
– Number of group to user connections – billions
• “Dummy solutions” were not checked

Problem: user search inside group
21
Groups
Users
• We use mechanics from personal indexes
• Currently indexed groups are updated with changes
• Small group indexes are discontinued after 1 hour
• Big groups indexes are kept until application restarts
Search façade
Heap
memory
Off-heap
memory
Portal services
Small groups

More information
22
Aleksey Shevchuk
@AlekseyShevchuk
aleksey.shevchuk@odnoklassniki.ru
odnoklassniki.ru/mrSearch
Odnoklassniki.ru
http://v.ok.ru
Integration with Odnoklassniki.ru
http://connect.ok.ru
one-nio
slideshare.net/m0nstermind/presentations
github.com/odnoklassniki/one-nio
Cassandra
github.com/odnoklassniki/apache-cassandra

Aleksey Shevchuk
@AlekseyShevchuk
aleksey.shevchuk@odnoklassniki.ru
odnoklassniki.ru/mrSearch

Building big social network search system using lucene

Recomendados

Recomendados

Más contenido relacionado

Similar a Building big social network search system using lucene

Similar a Building big social network search system using lucene (20)

Más de lucenerevolution

Más de lucenerevolution (20)

Último

Último (20)

Building big social network search system using lucene