Topic : Near Real Time Search in e-commerce
Event : Cloud xTech (Expedia's Internal Conference)
Employer : Consultant @ Lucidworks
Attendance Type : External Speaker @Expedia
Place : Taj Vivanta, Gurgaon
Credits : Flipkart (work done), Expedia , Lucidworks
Date : 13th June 2017
2. Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Advisory & Professional Services @Lucidworks
aka Techie/Individual Contributor
Yogesh Ahuja
Senior Search Architect @Lucidworks
3. ● SOLR/LUCENE
○ User → Hacker →
Contributor
○ [ Lucene 2.1 to 6.5 ]
● Advisory/Consulting @
Lucidworks (4 months)
● Search & Data Platform @
Flipkart
○ 4.5 years
● Payments @ Amazon
○ 1.4 years
● Vertical Search @ Verse
Innovation & Naukri ( 4.8
years )
○ LUCENE 2.1
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14. Agenda
• Ecommerce Search
• Need for Real Time Search
• SolrCloud Solution
• Alternatives
• First Principle approach
• Q & A
15. E-commerce Search
50 main categories
500 sub categories
231 million docs
- 90 million sku
- 160 million
listings
- result
collapsing
drill down
filters
top positions at
premium
20. Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
Service
Pricing
Service
Product aka SKU
Listings
25. updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
S1.3 : SolrCloud : Monitoring Lens
[Very High Update Rates]
27. S1.5 : SolrCloud : Internal Email
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• Updated Document gets streamed to all replicas in sync
– Reduces indexing throughput
• Soft commit is Not Free
– Soft commit ⇒ In Memory Segment
– Lots of Merges
– Huge document churn / deletes
– All caches still need to be re-generated
– Filter Cache miss specially hurts performance
28. S2xx : Alternatives
1. Updatable DocValues
a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking]
b. Doesn’t scale to a large number of fields/multi valued fields
2. Parallel Indexes : Basically Term Partitioned Indexes [Updates in Redis]
a. ParallelReader : Warning: It is up to you to make sure all indexes are created and modified the same way.
b. Prototype worked :
i. Works for small indexes + lots of updates or huge index + daily build
ii. Not for large index + streaming updates + lots of qps
iii. Pulling Changes + Document Building killed it
3. Lucene Codecs API :
a. Works for Prototype : Research/Algorithms/small index
b. Not for e-commerce marketplace use case. Index corruption/2 phase
commit is difficult.
38. Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
39. Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cached DocIdSet
– No consistency between lookup and filtering
• Independent of lucene commits
• Query latency comparable to DocValues
– Consistent 99% performance
40. Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Production instances of ~50K updates/second real time