SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
OCTOBER 11-14, 2016 •
BOSTON, MA
EXPEDIA INTERNAL
CONFERENCE
JUNE 13th 2017
Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Advisory & Professional Services @Lucidworks
aka Techie/Individual Contributor
Yogesh Ahuja
Senior Search Architect @Lucidworks
● SOLR/LUCENE
○ User → Hacker →
Contributor
○ [ Lucene 2.1 to 6.5 ]
● Advisory/Consulting @
Lucidworks (4 months)
● Search & Data Platform @
Flipkart
○ 4.5 years
● Payments @ Amazon
○ 1.4 years
● Vertical Search @ Verse
Innovation & Naukri ( 4.8
years )
○ LUCENE 2.1
Agenda
• Ecommerce Search
• Need for Real Time Search
• SolrCloud Solution
• Alternatives
• First Principle approach
• Q & A
E-commerce Search
50 main categories
500 sub categories
231 million docs
- 90 million sku
- 160 million
listings
- result
collapsing
drill down
filters
top positions at
premium
800K active
users
160K requests per sec
- 40K service
- 10k solr
median : 11 ms
99th perc: 1.1 sec
!! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]
!! Steal Deals !!- Stolen .. @Search/Sherlock Team : Please investigate
Please read the code and review architecture very carefully
CUSTOMER EXPERIENCE
Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
Service
Pricing
Service
Product aka SKU
Listings
Summary : Lucene Document
• Product/SKU [Parent Document]
– Listing [Child Document]
• Query = Mostly SKU Attributes [Free Text]
• Filters = SKU + Listing Attributes [Drill Down]
• Ranking = SKU + Listing Attributes [Explicit/Relevance]
• Index Time Join aka Block Join
– [Best Performance]
Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For each Document
Versioning
Update Log
Forward to Replica
S1.1 : SolrCloud : Principal Engineer Lens
S1.2 : SolrCloud : Director Lens
updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
S1.3 : SolrCloud : Monitoring Lens
[Very High Update Rates]
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
● Lucene doesn’t support Partial Updates
● Update = Delete + Add
S1.4 : SolrCloud : Principal Engineer Lens
S1.5 : SolrCloud : Internal Email
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• Updated Document gets streamed to all replicas in sync
– Reduces indexing throughput
• Soft commit is Not Free
– Soft commit ⇒ In Memory Segment
– Lots of Merges
– Huge document churn / deletes
– All caches still need to be re-generated
– Filter Cache miss specially hurts performance
S2xx : Alternatives
1. Updatable DocValues
a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking]
b. Doesn’t scale to a large number of fields/multi valued fields
2. Parallel Indexes : Basically Term Partitioned Indexes [Updates in Redis]
a. ParallelReader : Warning: It is up to you to make sure all indexes are created and modified the same way.
b. Prototype worked :
i. Works for small indexes + lots of updates or huge index + daily build
ii. Not for large index + streaming updates + lots of qps
iii. Pulling Changes + Document Building killed it
3. Lucene Codecs API :
a. Works for Prototype : Research/Algorithms/small index
b. Not for e-commerce marketplace use case. Index corruption/2 phase
commit is difficult.
S3xx : Drawing Board
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
HashMap based Implementation
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
DocId : 3
field : price
Foreign Key + Array Based Implementation
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
DocId : 3
field : price
NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Don’t
Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
DocId : 3
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
for d in [matched-docs]
collect d
NRT Filter
NRT Store - Invert index
NRT Forward Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet
Solr Integration Points
• ValueSources
• Filtering
– Custom Filter Implementation for cached DocIdSet
– Custom PostFilter
• Query
– Wrapper over Filter
• Custom FacetComponent
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cached DocIdSet
– No consistency between lookup and filtering
• Independent of lucene commits
• Query latency comparable to DocValues
– Consistent 99% performance
Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Production instances of ~50K updates/second real time
Thank you
&
Questions
Search @ Flipkart
• Catalogue
– ~ 50 main categories
– ~ 5000 sub-categories
– ~ 231 million documents
– ~ 90 million SKUs
– ~ 160 million listings
• E-commerce Marketplace
– ~ 100K Sellers
– Local Sellers
– Regional Availability
– Logistics Constraints
Professional LinkedIn through Logos
Personal Journey through Pictures
F
U
T
U
R
E
D
I
R
E
C
T
I
O
N

Más contenido relacionado

La actualidad más candente

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
Sudhir Tonse
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Lucidworks
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 

La actualidad más candente (20)

Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
Journeys from Kafka to Parquet
Journeys from Kafka to ParquetJourneys from Kafka to Parquet
Journeys from Kafka to Parquet
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
 

Similar a near real time search in e-commerce

Business Track: How Criteo Scaled and Supported Massive Growth with MongoDB
Business Track: How Criteo Scaled and Supported Massive Growth with MongoDBBusiness Track: How Criteo Scaled and Supported Massive Growth with MongoDB
Business Track: How Criteo Scaled and Supported Massive Growth with MongoDB
MongoDB
 

Similar a near real time search in e-commerce (20)

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Search-Based Serving Architecture of Embeddings-Based Recommendations (RecSys...
Search-Based Serving Architecture of Embeddings-Based Recommendations (RecSys...Search-Based Serving Architecture of Embeddings-Based Recommendations (RecSys...
Search-Based Serving Architecture of Embeddings-Based Recommendations (RecSys...
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Business Track: How Criteo Scaled and Supported Massive Growth with MongoDB
Business Track: How Criteo Scaled and Supported Massive Growth with MongoDBBusiness Track: How Criteo Scaled and Supported Massive Growth with MongoDB
Business Track: How Criteo Scaled and Supported Massive Growth with MongoDB
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
AWS Customer Presentation - JovianDATA
AWS Customer Presentation - JovianDATAAWS Customer Presentation - JovianDATA
AWS Customer Presentation - JovianDATA
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
How Criteo Scaled and Supported Massive Growth with MongoDB (2013)
How Criteo Scaled and Supported  Massive Growth with MongoDB (2013)How Criteo Scaled and Supported  Massive Growth with MongoDB (2013)
How Criteo Scaled and Supported Massive Growth with MongoDB (2013)
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

near real time search in e-commerce

  • 1. OCTOBER 11-14, 2016 • BOSTON, MA EXPEDIA INTERNAL CONFERENCE JUNE 13th 2017
  • 2. Near Real time Indexing Building Real Time Search Index For E-Commerce Umesh Prasad Advisory & Professional Services @Lucidworks aka Techie/Individual Contributor Yogesh Ahuja Senior Search Architect @Lucidworks
  • 3. ● SOLR/LUCENE ○ User → Hacker → Contributor ○ [ Lucene 2.1 to 6.5 ] ● Advisory/Consulting @ Lucidworks (4 months) ● Search & Data Platform @ Flipkart ○ 4.5 years ● Payments @ Amazon ○ 1.4 years ● Vertical Search @ Verse Innovation & Naukri ( 4.8 years ) ○ LUCENE 2.1
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Agenda • Ecommerce Search • Need for Real Time Search • SolrCloud Solution • Alternatives • First Principle approach • Q & A
  • 15. E-commerce Search 50 main categories 500 sub categories 231 million docs - 90 million sku - 160 million listings - result collapsing drill down filters top positions at premium
  • 16. 800K active users 160K requests per sec - 40K service - 10k solr median : 11 ms 99th perc: 1.1 sec
  • 17. !! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]
  • 18. !! Steal Deals !!- Stolen .. @Search/Sherlock Team : Please investigate Please read the code and review architecture very carefully
  • 20. Product /Listing: Important Attributes Seller Rating Service catalogue service Promise Service Availability Service Offer Service Pricing Service Product aka SKU Listings
  • 21. Summary : Lucene Document • Product/SKU [Parent Document] – Listing [Child Document] • Query = Mostly SKU Attributes [Free Text] • Filters = SKU + Listing Attributes [Drill Down] • Ranking = SKU + Listing Attributes [Explicit/Relevance] • Index Time Join aka Block Join – [Best Performance]
  • 22. Out Of Stock, but Why Show? Index has Stale Availability Data 234K Products
  • 24. S1.2 : SolrCloud : Director Lens
  • 25. updates / sec updates /hr normal Peak text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million S1.3 : SolrCloud : Monitoring Lens [Very High Update Rates]
  • 26. Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ● Lucene doesn’t support Partial Updates ● Update = Delete + Add S1.4 : SolrCloud : Principal Engineer Lens
  • 27. S1.5 : SolrCloud : Internal Email • Update = Delete + Add – Block Join Index ⇒ Update Whole Block (Product + Listings) • Updated Document gets streamed to all replicas in sync – Reduces indexing throughput • Soft commit is Not Free – Soft commit ⇒ In Memory Segment – Lots of Merges – Huge document churn / deletes – All caches still need to be re-generated – Filter Cache miss specially hurts performance
  • 28. S2xx : Alternatives 1. Updatable DocValues a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking] b. Doesn’t scale to a large number of fields/multi valued fields 2. Parallel Indexes : Basically Term Partitioned Indexes [Updates in Redis] a. ParallelReader : Warning: It is up to you to make sure all indexes are created and modified the same way. b. Prototype worked : i. Works for small indexes + lots of updates or huge index + daily build ii. Not for large index + streaming updates + lots of qps iii. Pulling Changes + Document Building killed it 3. Lucene Codecs API : a. Works for Prototype : Research/Algorithms/small index b. Not for e-commerce marketplace use case. Index corruption/2 phase commit is difficult.
  • 29. S3xx : Drawing Board
  • 30. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  • 31. A Typical Search Flow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  • 32. NRT Forward Index - Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency
  • 33. HashMap based Implementation NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups DocId : 3 field : price
  • 34. Foreign Key + Array Based Implementation Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups DocId : 3 field : price
  • 35. NRT Store Filter - PostFilter PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 DocId : 3 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 for d in [matched-docs] collect d
  • 36. NRT Filter NRT Store - Invert index NRT Forward Store NRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  • 37. Solr Integration Points • ValueSources • Filtering – Custom Filter Implementation for cached DocIdSet – Custom PostFilter • Query – Wrapper over Filter • Custom FacetComponent
  • 38. Near Real Time Solr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  • 39. Accomplishments • Real time sorting • Real time filtering : PostFilter – Higher latency • Near real time filtering : cached DocIdSet – No consistency between lookup and filtering • Independent of lucene commits • Query latency comparable to DocValues – Consistent 99% performance
  • 40. Accomplishments @ Flipkart ● Real time consumption for ~150 Signals ● Reduction in shown out of stock products by 2X ● Production instances of ~50K updates/second real time
  • 42. Search @ Flipkart • Catalogue – ~ 50 main categories – ~ 5000 sub-categories – ~ 231 million documents – ~ 90 million SKUs – ~ 160 million listings • E-commerce Marketplace – ~ 100K Sellers – Local Sellers – Regional Availability – Logistics Constraints