SlideShare una empresa de Scribd logo
1 de 35
LEAN RANKING INFRASTRUCTURE WITH SOLR
Ranking infrastructure and using Solr for lean prototyping
sergii.khomenko@stylight.com, @lc0d3r

Slide 1 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
AGENDA

1.

Problem definition

2.

Boosting

3.

Lean approach to ranking infrastructure

4.

Real-word examples

Slide 2 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LUCENE, SOLR, ELASTICSEARCH
SOLR USERS

Slide 4 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
STYLIGHT – THE BEST PLACE TO DISCOVER FASHION

Slide 5 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
GET INSPIRED BY LOOKS CREATED BY COMMUNITY

Slide 6 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
DISCOVER THOUSANDS OF BRANDS AND MILLIONS OF PRODUCTS

Slide 7 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
STYLIGHT – INTERNATIONAL COMMUNITY
Live in 13 countries

Slide 8 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
PROBLEM DEFINITION

Slide 9 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
PROBLEM DEFINITION

Slide 10 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
PROBLEM DEFINITION

Ranking specifics:
• Seasonal influence
• Trends
• Cold start of new countries, shops
• Multiple dimensions of ranking model

Slide 11 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE

Slide 12 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE
TF-IDF - default scoring model in Lucene/Solr

•
•
•

matching more query terms is better
more occurrences of a query term is better
more novel terms increase doc score more than common terms

Slide 13 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE

• Stages to improve relevance in Solr
• Editorial voting (QueryEvaluationComponent)
• Indexing time
(analyzing content, text analysis)

• Query-time
(function queries, boosting)

Slide 14 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE
Solr queries
q = +brand:adidas shop:monshowroom^3
q = +adidas monshowroom
defType = dismax
qf = brand shop^3
sort = user_ratings desc, score desc
qq = adidas
q = {!boost b=$b defType=dismax v=$qq}
b = prod(popularity, clicks)

Slide 15 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE
Definition of solr.ExternalFileField
<types>
<fieldType name="float" class="solr.FloatField" omitNorms="true"/>
<fieldType name="file_delta2" class="solr.ExternalFileField"
keyField="id" defVal="1.0" indexed="false" stored="false" valType="float"
/>
</types>
<fields>
<field name="delta2" type="file_delta2"/>
</fields>

Slide 16 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
IMPROVING RELEVANCE
Example of external file with boosting
coresde_DEproductsexternal_delta2.txt
15062471=0.5
15062479=0.2
15062507=0.41

Slide 17 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING

Slide 18 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING

Lean manufacturing, lean enterprise, or lean production, often simply,
"lean", is a production practice that considers the expenditure of resources for any goal
other than the creation of value for the end customer to be wasteful, and thus a target
for elimination.

Essentially, lean is centered on preserving value with less work.

Slide 19 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING

Requirements:
• Decreasing time to implement new ranking model
• Possibility to use more dynamic ranking models
• Keeping working infrastructure alive
• A/B testing without changing entire infrastructure
• Performance level - “still fast” and “transparent”

Slide 20 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING
Python benchmark
-h, --help
show this help message and exit
--gaid gaid, -g gaid Google analytics site id.
--gadate gadate a date to fetch the most popular pages from Google Analytics
–-solr solr, -s solr Solr server to benchmark performance.
--pages number, -p number a number of top pages from Google Analytics.
--repeats number, -r number a number of repeats for an every page.
--compare compare, -c compare Different rankings algorithms to compare.
--cmpmode CMPMODE run benchmark in comparison mode

python solr-benchmarkbenchmark.py -c RankingClassical,RankingDelta2
python solr-benchmarkbenchmark.py -c RankingClassical,RankingDelta2 --cmpmode 1

Slide 21 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING
Common search infrastructure

nginx

Solr-loadbalancer

Slide 22 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng

nginx

Solr

nginx

Jboss

Solr

Solr
LEAN APPROACH TO RANKING
Updated

nginx

Solr-loadbalancer

Front-end loadbalancer

Jboss

Slide 23 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng

Solr-loadbalancer

nginx

Solr

nginx

Jboss

Solr

Solr

nginx

Solr
LEAN APPROACH TO RANKING
nginx / templates / conf / solr-rewrites.conf.erb
include nginx
nginx::config { "solr_dev": }
nginx::solr-ranking { "delta2":
urls => [
"/search.action?gender=women&brand=2271&tag=1161&tag=877&tag=468",
"/search.action?gender=men&brand=11235&tag=10203&tag=10299&tag=10326"
],

Slide 24 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
LEAN APPROACH TO RANKING
nginx / templates / conf / solr-rewrites.conf.erb
<% urls.each do |url| -%>
if ($args ~* <% if url['gender'] > 0 -%>gender_id%3A<%= url['gender'] %>.*<% end %><% url['tags'].each do |tag| -%>tag_id%3A<%= tag %>.*<% end -%><% if
url['brand'] > 0 -%>brand_id%3A%28<%= url['brand'] %>%29<% end -%>) {
set $orig $args;
set $args "q={!boost+b=%24b+defType=dismax+v=%24qq}&qq=id:*";
rewrite ^(.*)$ "$1?$orig" break;
}
<% end -%>

Slide 25 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
REAL-WORLD EXAMPLES

Slide 26 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
ELEPHANT-DRIVEN ARCHITECTURE
Multiple pieces to perform simple task

Slide 27 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
SIMPLIFIED VERSION
Less code – less bugs

Slide 28 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
REAL-WORLD EXAMPLES

Slide 29 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
REAL-WORLD EXAMPLES

Slide 30 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
REAL-WORLD EXAMPLES
Multiple points to evaluate

Stages to evaluate the model:
• R ranking model
• Independent Solr-node
• For internal use-cases
• Testing for some of pages
• A/B roll out for % of users
• Production roll out

Slide 31 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
THANKS FOR YOUR ATTENTION!
Questions?

Slide 32 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
Sergii Khomenko
Data Scientist
STYLIGHT GmbH
sergii.khomenko@stylight.com
@lc0d3r
http://www.stylight.com
Nymphenburger Straße 86
80636 Munich, Germany

Slide 33 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
REFERENCE LIST

• Stack Overflow Tag Trends
http://hewgill.com/~greg/stackoverflow/stack_overflow/tags/#!l
ucene+solr+elasticsearch+sphinx
• Public websites using Solr
http://wiki.apache.org/solr/PublicServers
• CommonQueryParameters
http://wiki.apache.org/solr/CommonQueryParameters
• Thoughts in plain text http://lc0.github.io/
• STYLIGHT Engineering http://www.stylight.com/Engineering/
FASHION FRIDAY

Slide 35 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng

Más contenido relacionado

Similar a Lean Ranking infrastructure with Solr

Introduction to SQL Server Graph DB
Introduction to SQL Server Graph DBIntroduction to SQL Server Graph DB
Introduction to SQL Server Graph DBGreg McMurray
 
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters ParisBack to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters ParisLucian Precup
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Extending OutSystems with Javascript
Extending OutSystems with JavascriptExtending OutSystems with Javascript
Extending OutSystems with JavascriptRitaDias72
 
UI Realigning & Refactoring by Jina Bolton
UI Realigning & Refactoring by Jina BoltonUI Realigning & Refactoring by Jina Bolton
UI Realigning & Refactoring by Jina BoltonCodemotion
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...NoSQLmatters
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph DatabaseTobias Lindaaker
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013gidgreen
 
Data Visualization with R
Data Visualization with RData Visualization with R
Data Visualization with RSergii Khomenko
 
Feed your inner data scientist. JS Visualization Tools
Feed your inner data scientist.  JS Visualization ToolsFeed your inner data scientist.  JS Visualization Tools
Feed your inner data scientist. JS Visualization ToolsDoug Mair
 
Online course content siebel open ui
Online course content siebel open uiOnline course content siebel open ui
Online course content siebel open uiNOVA ITLABS
 
Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Jordi Valverde
 
Rails MVC by Sergiy Koshovyi
Rails MVC by Sergiy KoshovyiRails MVC by Sergiy Koshovyi
Rails MVC by Sergiy KoshovyiPivorak MeetUp
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkMongoDB
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 

Similar a Lean Ranking infrastructure with Solr (20)

Introduction to SQL Server Graph DB
Introduction to SQL Server Graph DBIntroduction to SQL Server Graph DB
Introduction to SQL Server Graph DB
 
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters ParisBack to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Extending OutSystems with Javascript
Extending OutSystems with JavascriptExtending OutSystems with Javascript
Extending OutSystems with Javascript
 
UI Realigning & Refactoring by Jina Bolton
UI Realigning & Refactoring by Jina BoltonUI Realigning & Refactoring by Jina Bolton
UI Realigning & Refactoring by Jina Bolton
 
Html5
Html5Html5
Html5
 
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Web API Design 2013
Web API Design 2013Web API Design 2013
Web API Design 2013
 
Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Data Visualization with R
Data Visualization with RData Visualization with R
Data Visualization with R
 
Solid Works
Solid WorksSolid Works
Solid Works
 
Feed your inner data scientist. JS Visualization Tools
Feed your inner data scientist.  JS Visualization ToolsFeed your inner data scientist.  JS Visualization Tools
Feed your inner data scientist. JS Visualization Tools
 
Online course content siebel open ui
Online course content siebel open uiOnline course content siebel open ui
Online course content siebel open ui
 
Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011
 
Rails MVC by Sergiy Koshovyi
Rails MVC by Sergiy KoshovyiRails MVC by Sergiy Koshovyi
Rails MVC by Sergiy Koshovyi
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation Framework
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 

Más de Sergii Khomenko

Handle your Lambdas - From event-based processing to Continuous Integration /...
Handle your Lambdas - From event-based processing to Continuous Integration /...Handle your Lambdas - From event-based processing to Continuous Integration /...
Handle your Lambdas - From event-based processing to Continuous Integration /...Sergii Khomenko
 
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...Sergii Khomenko
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Sergii Khomenko
 
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...Sergii Khomenko
 
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015Sergii Khomenko
 
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...Sergii Khomenko
 
From simple to more advanced: Lessons learned in 13 months with Tableau
From simple to more advanced: Lessons learned in 13 months with TableauFrom simple to more advanced: Lessons learned in 13 months with Tableau
From simple to more advanced: Lessons learned in 13 months with TableauSergii Khomenko
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesSergii Khomenko
 

Más de Sergii Khomenko (8)

Handle your Lambdas - From event-based processing to Continuous Integration /...
Handle your Lambdas - From event-based processing to Continuous Integration /...Handle your Lambdas - From event-based processing to Continuous Integration /...
Handle your Lambdas - From event-based processing to Continuous Integration /...
 
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
From Data Science to Production - deploy, scale, enjoy! / PyData Amsterdam - ...
 
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...Building Data applications with Go: from Bloom filters to Data pipelines / FO...
Building Data applications with Go: from Bloom filters to Data pipelines / FO...
 
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...
Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift /...
 
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
 
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
 
From simple to more advanced: Lessons learned in 13 months with Tableau
From simple to more advanced: Lessons learned in 13 months with TableauFrom simple to more advanced: Lessons learned in 13 months with Tableau
From simple to more advanced: Lessons learned in 13 months with Tableau
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-cases
 

Último

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Lean Ranking infrastructure with Solr

  • 1. LEAN RANKING INFRASTRUCTURE WITH SOLR Ranking infrastructure and using Solr for lean prototyping sergii.khomenko@stylight.com, @lc0d3r Slide 1 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 2. AGENDA 1. Problem definition 2. Boosting 3. Lean approach to ranking infrastructure 4. Real-word examples Slide 2 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 4. SOLR USERS Slide 4 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 5. STYLIGHT – THE BEST PLACE TO DISCOVER FASHION Slide 5 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 6. GET INSPIRED BY LOOKS CREATED BY COMMUNITY Slide 6 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 7. DISCOVER THOUSANDS OF BRANDS AND MILLIONS OF PRODUCTS Slide 7 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 8. STYLIGHT – INTERNATIONAL COMMUNITY Live in 13 countries Slide 8 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 9. PROBLEM DEFINITION Slide 9 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 10. PROBLEM DEFINITION Slide 10 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 11. PROBLEM DEFINITION Ranking specifics: • Seasonal influence • Trends • Cold start of new countries, shops • Multiple dimensions of ranking model Slide 11 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 12. IMPROVING RELEVANCE Slide 12 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 13. IMPROVING RELEVANCE TF-IDF - default scoring model in Lucene/Solr • • • matching more query terms is better more occurrences of a query term is better more novel terms increase doc score more than common terms Slide 13 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 14. IMPROVING RELEVANCE • Stages to improve relevance in Solr • Editorial voting (QueryEvaluationComponent) • Indexing time (analyzing content, text analysis) • Query-time (function queries, boosting) Slide 14 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 15. IMPROVING RELEVANCE Solr queries q = +brand:adidas shop:monshowroom^3 q = +adidas monshowroom defType = dismax qf = brand shop^3 sort = user_ratings desc, score desc qq = adidas q = {!boost b=$b defType=dismax v=$qq} b = prod(popularity, clicks) Slide 15 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 16. IMPROVING RELEVANCE Definition of solr.ExternalFileField <types> <fieldType name="float" class="solr.FloatField" omitNorms="true"/> <fieldType name="file_delta2" class="solr.ExternalFileField" keyField="id" defVal="1.0" indexed="false" stored="false" valType="float" /> </types> <fields> <field name="delta2" type="file_delta2"/> </fields> Slide 16 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 17. IMPROVING RELEVANCE Example of external file with boosting coresde_DEproductsexternal_delta2.txt 15062471=0.5 15062479=0.2 15062507=0.41 Slide 17 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 18. LEAN APPROACH TO RANKING Slide 18 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 19. LEAN APPROACH TO RANKING Lean manufacturing, lean enterprise, or lean production, often simply, "lean", is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful, and thus a target for elimination. Essentially, lean is centered on preserving value with less work. Slide 19 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 20. LEAN APPROACH TO RANKING Requirements: • Decreasing time to implement new ranking model • Possibility to use more dynamic ranking models • Keeping working infrastructure alive • A/B testing without changing entire infrastructure • Performance level - “still fast” and “transparent” Slide 20 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 21. LEAN APPROACH TO RANKING Python benchmark -h, --help show this help message and exit --gaid gaid, -g gaid Google analytics site id. --gadate gadate a date to fetch the most popular pages from Google Analytics –-solr solr, -s solr Solr server to benchmark performance. --pages number, -p number a number of top pages from Google Analytics. --repeats number, -r number a number of repeats for an every page. --compare compare, -c compare Different rankings algorithms to compare. --cmpmode CMPMODE run benchmark in comparison mode python solr-benchmarkbenchmark.py -c RankingClassical,RankingDelta2 python solr-benchmarkbenchmark.py -c RankingClassical,RankingDelta2 --cmpmode 1 Slide 21 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 22. LEAN APPROACH TO RANKING Common search infrastructure nginx Solr-loadbalancer Slide 22 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng nginx Solr nginx Jboss Solr Solr
  • 23. LEAN APPROACH TO RANKING Updated nginx Solr-loadbalancer Front-end loadbalancer Jboss Slide 23 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng Solr-loadbalancer nginx Solr nginx Jboss Solr Solr nginx Solr
  • 24. LEAN APPROACH TO RANKING nginx / templates / conf / solr-rewrites.conf.erb include nginx nginx::config { "solr_dev": } nginx::solr-ranking { "delta2": urls => [ "/search.action?gender=women&brand=2271&tag=1161&tag=877&tag=468", "/search.action?gender=men&brand=11235&tag=10203&tag=10299&tag=10326" ], Slide 24 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 25. LEAN APPROACH TO RANKING nginx / templates / conf / solr-rewrites.conf.erb <% urls.each do |url| -%> if ($args ~* <% if url['gender'] > 0 -%>gender_id%3A<%= url['gender'] %>.*<% end %><% url['tags'].each do |tag| -%>tag_id%3A<%= tag %>.*<% end -%><% if url['brand'] > 0 -%>brand_id%3A%28<%= url['brand'] %>%29<% end -%>) { set $orig $args; set $args "q={!boost+b=%24b+defType=dismax+v=%24qq}&qq=id:*"; rewrite ^(.*)$ "$1?$orig" break; } <% end -%> Slide 25 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 26. REAL-WORLD EXAMPLES Slide 26 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 27. ELEPHANT-DRIVEN ARCHITECTURE Multiple pieces to perform simple task Slide 27 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 28. SIMPLIFIED VERSION Less code – less bugs Slide 28 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 29. REAL-WORLD EXAMPLES Slide 29 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 30. REAL-WORLD EXAMPLES Slide 30 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 31. REAL-WORLD EXAMPLES Multiple points to evaluate Stages to evaluate the model: • R ranking model • Independent Solr-node • For internal use-cases • Testing for some of pages • A/B roll out for % of users • Production roll out Slide 31 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 32. THANKS FOR YOUR ATTENTION! Questions? Slide 32 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 33. Sergii Khomenko Data Scientist STYLIGHT GmbH sergii.khomenko@stylight.com @lc0d3r http://www.stylight.com Nymphenburger Straße 86 80636 Munich, Germany Slide 33 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng
  • 34. REFERENCE LIST • Stack Overflow Tag Trends http://hewgill.com/~greg/stackoverflow/stack_overflow/tags/#!l ucene+solr+elasticsearch+sphinx • Public websites using Solr http://wiki.apache.org/solr/PublicServers • CommonQueryParameters http://wiki.apache.org/solr/CommonQueryParameters • Thoughts in plain text http://lc0.github.io/ • STYLIGHT Engineering http://www.stylight.com/Engineering/
  • 35. FASHION FRIDAY Slide 35 | STYLIGHT | Proud to bleed purple | @lc0d3r and @stylight_eng