Presented by Aleksey Shevchuk, Lead developer, odnoklassniki.ru
We will explain how search systems of social network Odnoklassniki work. Each day 40mln people use Odnoklassniki to communicate and entertain themselves. These activities are hard to imagine without proper search system. A dozen big index's and thousands of small indexes are responding to more than 4000 searches per second at peak times. Users can search within specific site sections of the site or the whole site. Search system will decide which indexes should be queried, and which results to show. To improve relevance we use information from social graph and various activity statistics available for indexed entities. Query log analysis? Again Lucene!
3. About Odnoklassniki social network
• Audience:
– 200 mln accounts;
– Up to 6 mln users online;
– More then 40 mln visitors a day
• Within a second:
– 290 000 web pages,100 000 photos viewed;
– 4000 search requests,
average search time 70 ms
2
4. Why we have chosen Lucene?
• Back in 2009 we had user search based on MS SQL –
this simplified initial requirement definition
• We wanted an OpenSource written in Java
• Tests had shown that Solr underperforms for us
• Developed our own server around Lucene
3
5. Search system duties today
4
Users
Video
Music
Groups
Communities
Events
Gifts
Locations
Hobbies
Help
Group users
9. Architecture: maker
8
• Collects notifications about changed entities
• Uses Cassandra to store additional entity data
• Responsible for domain index writing
• Controls index replication to query servers
10. Architecture: query
9
• Many servers in different hardware configuration
• Unified application
• For quick start store index’s on disk
• Queries are executed in heap memory
– IndexReader rewritten to eliminate unnecessary operations
– Own stored field retrieval method:
• No garbage
• Accessing values without actual deserialization
11. Architecture: search facade
10
• Creates & manages personal index’s
• Schedules query execution
• Reduce query results to search results
• Loads data for result rendering
13. Problem: spelling vs performance
12
• Most of the content is in Russian language:
– Proper Russian
– Common misspells
– Misspells made by people who try to write in Russian
– Russian words written in Latin (Translit & Crazy Russian)
– Wrong keyboard layout
• Few examples, with common misspells omitted:
– машина = мышына, масына, mashina, moshina
– Кашин = kashin, кашен, ka6in
– Kosheen = кошин, cosheen, koshin
14. Solution: spelling vs performance
13
• Reduce number of terms using phonetics:
MOSHINO = машина, мышына, масына, mashina, moshina
• Query is expanded with few phonetic keys:
– Common misspellings
– Synonyms we know
• Distinguish writing using 1 byte hash code per term
– If possible, perform hash check only for top documents
15. Problem: personal index availability
14
• Queries take 5 – 100 ms
• Personal index composition takes 50 - 300 ms
Cache Cache
Service Service Service
*2 *2 *2
• Network load on cache servers quickly hit 700 Mb/s
• Meanwhile, there were no CPU load on cache servers
16. Solution: personal index availability
15
Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-99
37
• Bind users to concrete servers
• Store personal index’s locally (in off-heap memory)
• Determine substitution order
• Whole network load is under 100 Mb/s
• Even CPU load on all servers
17. Problem: gender and country filters
16
• Usually index is split into shards, till average query
time meets some bounds
– This solves response time problem
– All possible documents are checked
• There is 2 filters which make user queries slow:
– Gender
– One very popular country
18. Solution: gender and country filters
• Remove this condition checks – saves 17% CPU
• Exclude documents which could not match this filters
– saves another 12% CPU
Russian males
Russian females
Other males
Other females
19. Problem: users online search
18
• People wish to quickly find a person they can talk to
• At any given moment, only small fraction of users
are online
• Standard solution – filter out onlines from general
search results:
+ easy to implement
+ reliable
– slow, especially at random users query
– wastes CPU
20. Solution: users online search
19
• Create separate index, with online users only:
+ works quickly
+ no tricks required
– more then 200.000 changes/minute
– correct results depend on index maker availability
21. Problem: user search inside group
20
• This kind of search is in demand from group owners
• Some numbers:
– 200 million users in 16 shards
– 7 million groups in 8 shards
– Each group has from 1 to several million users
– Number of group to user connections – billions
• “Dummy solutions” were not checked
22. Problem: user search inside group
21
Groups
Users
• We use mechanics from personal indexes
• Currently indexed groups are updated with changes
• Small group indexes are discontinued after 1 hour
• Big groups indexes are kept until application restarts
Search façade
Heap
memory
Off-heap
memory
Portal services
Small groups