near real time search in e-commerce

OCTOBER 11-14, 2016 •
BOSTON, MA
EXPEDIA INTERNAL
CONFERENCE
JUNE 13th 2017

Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Advisory & Professional Services @Lucidworks
aka Techie/Individual Contributor
Yogesh Ahuja
Senior Search Architect @Lucidworks

● SOLR/LUCENE
○ User → Hacker →
Contributor
○ [ Lucene 2.1 to 6.5 ]
● Advisory/Consulting @
Lucidworks (4 months)
● Search & Data Platform @
Flipkart
○ 4.5 years
● Payments @ Amazon
○ 1.4 years
● Vertical Search @ Verse
Innovation & Naukri ( 4.8
years )
○ LUCENE 2.1

Agenda
• Ecommerce Search
• Need for Real Time Search
• SolrCloud Solution
• Alternatives
• First Principle approach
• Q & A

E-commerce Search
50 main categories
500 sub categories
231 million docs
- 90 million sku
- 160 million
listings
- result
collapsing
drill down
filters
top positions at
premium

800K active
users
160K requests per sec
- 40K service
- 10k solr
median : 11 ms
99th perc: 1.1 sec

!! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]

!! Steal Deals !!- Stolen .. @Search/Sherlock Team : Please investigate
Please read the code and review architecture very carefully

Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
Service
Pricing
Service
Product aka SKU
Listings

Summary : Lucene Document
• Product/SKU [Parent Document]
– Listing [Child Document]
• Query = Mostly SKU Attributes [Free Text]
• Filters = SKU + Listing Attributes [Drill Down]
• Ranking = SKU + Listing Attributes [Explicit/Relevance]
• Index Time Join aka Block Join
– [Best Performance]

Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products

Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For each Document
Versioning
Update Log
Forward to Replica
S1.1 : SolrCloud : Principal Engineer Lens

S1.2 : SolrCloud : Director Lens

updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
S1.3 : SolrCloud : Monitoring Lens
[Very High Update Rates]

Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
● Lucene doesn’t support Partial Updates
● Update = Delete + Add
S1.4 : SolrCloud : Principal Engineer Lens

S1.5 : SolrCloud : Internal Email
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• Updated Document gets streamed to all replicas in sync
– Reduces indexing throughput
• Soft commit is Not Free
– Soft commit ⇒ In Memory Segment
– Lots of Merges
– Huge document churn / deletes
– All caches still need to be re-generated
– Filter Cache miss specially hurts performance

S2xx : Alternatives
1. Updatable DocValues
a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking]
b. Doesn’t scale to a large number of fields/multi valued fields
2. Parallel Indexes : Basically Term Partitioned Indexes [Updates in Redis]
a. ParallelReader : Warning: It is up to you to make sure all indexes are created and modified the same way.
b. Prototype worked :
i. Works for small indexes + lots of updates or huge index + daily build
ii. Not for large index + streaming updates + lots of qps
iii. Pulling Changes + Document Building killed it
3. Lucene Codecs API :
a. Works for Prototype : Research/Algorithms/small index
b. Not for e-commerce marketplace use case. Index corruption/2 phase
commit is difficult.

ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets

A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer

NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency

HashMap based Implementation
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
DocId : 3
field : price

Foreign Key + Array Based Implementation
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
DocId : 3
field : price

NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Don’t
Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
DocId : 3
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
for d in [matched-docs]
collect d

NRT Filter
NRT Store - Invert index
NRT Forward Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet

Solr Integration Points
• ValueSources
• Filtering
– Custom Filter Implementation for cached DocIdSet
– Custom PostFilter
• Query
– Wrapper over Filter
• Custom FacetComponent

Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others

Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cached DocIdSet
– No consistency between lookup and filtering
• Independent of lucene commits
• Query latency comparable to DocValues
– Consistent 99% performance

Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Production instances of ~50K updates/second real time

Search @ Flipkart
• Catalogue
– ~ 50 main categories
– ~ 5000 sub-categories
– ~ 231 million documents
– ~ 90 million SKUs
– ~ 160 million listings
• E-commerce Marketplace
– ~ 100K Sellers
– Local Sellers
– Regional Availability
– Logistics Constraints

Professional LinkedIn through Logos

Personal Journey through Pictures

near real time search in e-commerce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a near real time search in e-commerce

Similar a near real time search in e-commerce (20)

Último

Último (20)

near real time search in e-commerce