SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI
2
Results for “iphone”
3
Results for “ipohne” without spellchecker
4
Results for “ipohne” with spellchecker
5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)
6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)
7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model
8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed
9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
10
Building the language model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.
11
Building the language model (example)
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.
12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model
13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
14
Building the error model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
200 máy rửa mặt
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.
15
Building the error model (example)
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
16
Building the error model (example)
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas
18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!
19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb
20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase
21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants
Thank you!
22

Más contenido relacionado

La actualidad más candente

Container Runtime Security with Falco
Container Runtime Security with FalcoContainer Runtime Security with Falco
Container Runtime Security with FalcoMichael Ducy
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency ControlMorgan Tocker
 
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...Migration d'une Architecture Microservice vers une Architecture Event-Driven ...
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...Daniel Rene FOUOMENE PEWO
 
JavaScript Interview Questions and Answers | Full Stack Web Development Train...
JavaScript Interview Questions and Answers | Full Stack Web Development Train...JavaScript Interview Questions and Answers | Full Stack Web Development Train...
JavaScript Interview Questions and Answers | Full Stack Web Development Train...Edureka!
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu
 
Lecture 3: Servlets - Session Management
Lecture 3:  Servlets - Session ManagementLecture 3:  Servlets - Session Management
Lecture 3: Servlets - Session ManagementFahad Golra
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Claus Ibsen
 
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxDataThe InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxDataInfluxData
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQDmitriy Samovskiy
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programmingTausun Akhtary
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 

La actualidad más candente (20)

Input-Buffering
Input-BufferingInput-Buffering
Input-Buffering
 
Query trees
Query treesQuery trees
Query trees
 
Container Runtime Security with Falco
Container Runtime Security with FalcoContainer Runtime Security with Falco
Container Runtime Security with Falco
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency Control
 
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...Migration d'une Architecture Microservice vers une Architecture Event-Driven ...
Migration d'une Architecture Microservice vers une Architecture Event-Driven ...
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
JavaScript Interview Questions and Answers | Full Stack Web Development Train...
JavaScript Interview Questions and Answers | Full Stack Web Development Train...JavaScript Interview Questions and Answers | Full Stack Web Development Train...
JavaScript Interview Questions and Answers | Full Stack Web Development Train...
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor Netty
 
Lecture 3: Servlets - Session Management
Lecture 3:  Servlets - Session ManagementLecture 3:  Servlets - Session Management
Lecture 3: Servlets - Session Management
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
 
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxDataThe InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Introduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQIntroduction to AMQP Messaging with RabbitMQ
Introduction to AMQP Messaging with RabbitMQ
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programming
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 

Similar a Grokking TechTalk #35: Efficient spellchecking

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and SelectionAhmed Nobi
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - SlidecastDaniel Kolman
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in ElixirMustafa TURAN
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015François Scharffe
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and PythonJisc
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeHarvard Web Working Group
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product InformationVamsee Chamakura
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion ApplicationsLuca Pradovera
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsPVS-Studio
 

Similar a Grokking TechTalk #35: Efficient spellchecking (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - Slidecast
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in Elixir
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product Information
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion Applications
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by Professionals
 
Php optimization
Php optimizationPhp optimization
Php optimization
 
Php101
Php101Php101
Php101
 

Más de Grokking VN

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking VN
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking VN
 

Más de Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 

Último

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Último (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Grokking TechTalk #35: Efficient spellchecking

  • 1. PO Department PEOPLE OPERATION’S MONTHLY UPDATE 09/2019 1 CPU and memory efficient spellchecker implementation in TIKI
  • 3. 3 Results for “ipohne” without spellchecker
  • 4. 4 Results for “ipohne” with spellchecker
  • 5. 5 General approach words, result = (tokenize(query), []) for w in words: candidates = generate_candidates(w) best_c, best_score = (None, 0.) for c in candidates: score = spellchecker_score(w, c) if score > best_score: best_c, best_score = (c, score) result.append(best_c)
  • 6. 6 Generate candidates Generate all possible similar words: - Need to define a measure of similarity - we use Damerau-Levenshtein distance - It allows insertions, deletions, substitutions and transpositions of symbols - We limit maximum allowed distance depending on the length of the word - Then just generate all edits out of 4 possible types (CPU greedy) - We will optimize this approach later Examples of Damerau-Levenshtein distance: - distance(nguyễn, nguyên) = 1 (one substitution) - distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions) - distance(behaivour, behaviour) = 1 (one transposition)
  • 7. 7 Spellchecker score “Noisy channel” model: - Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w) - Need to find candidate c which maximizes P(c|w) - Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates Used probabilities: - P(c|w) - probability of c being intended when w was observed - P(w|c) - probability of the word w to be a misspelling of c - error model - P(c) - probability to observe c - language model
  • 8. 8 Building the language model N-gram model: - Building a 2-gram dictionary - Remove 2-grams below a certain threshold Used data: - All product contents on Tiki - All Tiki search queries for a year - Some randomly crawled texts from the Vietnamese Web - Total: 5.5Gb gzip-ed
  • 9. 9 Building the language model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 10. 10 Building the language model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... We just count all possible single words and word pairs from our counted queries data and write it down into language model. This will let us calculate the probability of the word to be observed without a context or with a context of 1 word before or after it.
  • 11. 11 Building the language model (example) Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... Query: máy => “< máy >" P(máy) = 0.5 * (P(< máy) + P(máy >)) = 0.5 * (410/410+0/410) = 0.5 Query: máy xay tóc P(xay) = 0.5 * (P(máy xay) + P(xay tóc)) = 0.5 * (105/410+5/105) ~ 0.30 P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc)) = 0.5 * (100/410+100/105) ~ 0.60 Language model here suggests that the probability to see “sấy” in this context is higher than the probability to see “xay”.
  • 12. 12 Building the error model Automatic extraction of P(w|c): - Extract triplets (w1, w2, w3) from our texts set - Group triplets by (w1, *, w3) and sort by descending popularity - Remove groupings below a certain threshold - Remove samples where w2 words are too far from each other (using Damerau-Levenshtein distance) - Remove samples with popularity comparable to the most popular sample in this grouping - Write w2 words from all left samples into error model mapping as triplets of (observed word, intended word, count) Used data: - Same as for the language model
  • 13. 13 Building the error model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 14. 14 Building the error model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Triplets: 205 < máy rửa 200 rửa mặt > 5 rửa mắt > 100 máy sấy tóc 5 máy xay tóc 200 máy rửa mặt 5 máy rửa mắt 105 < máy xay 100 sinh tố > ... We count all possible triplets from our counted queries data.
  • 15. 15 Building the error model (example) Triplets (grouped): rửa * > 200 rửa mặt > 5 rửa mắt > máy * tóc 100 máy sấy tóc 5 máy xay tóc máy * sinh 100 máy xay sinh sinh * > 100 sinh tố > ... Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 16. 16 Building the error model (example) Query: kem rửa mắt P(mắt|mắt) = 0/5 = 0.0 - we divide the number of times “mắt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. P(mắt|mặt) = 5/5 = 1.0 - again, we divide the number of times "mặt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. This means that according to error model built on our data, it is extremely likely for “mắt" to be a misspelling of “mặt". Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 17. 17 Quality optimizations Idea: - Language model is more important in bigger context - Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda) - Lambda depends on the length of available context Results: - Using bigger lambda for longer context => better test result (idea works!) - For bigger N-gram need to use machine learning to optimize lambdas
  • 18. 18 Performance optimizations Important fact: It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w and c we can find a combination of no more than N deletes of a single character from each side, which will lead to the same result. Examples below: distance(iphone, iphobee) = 2 (one insertion, one substitution) iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!) distance(iphone, pihoone) = 2 (one transposition, one insertion) iphone -> ihone VS pihoone -> ihoone -> ihone (match!) Let’s use it to optimize candidates generation!
  • 19. 19 Performance optimizations Problem 1 - generating candidates is CPU greedy: - Precompute “deletes” dictionary - Use only delete operations from both sides - Need to double-check the distance (can be up to 2N, but we need N) - Fast, but requires RAM Problem 2 - having “deletes” dictionary requires RAM: - Use different data compression techniques - From what we’ve tried, Judy dynamic arrays work the best - We decreased RAM requirements from 10.5Gb to 2.3Gb
  • 20. 20 Testing results Testing set: - 5,000 random queries, 10,000 misspelled queries - Suggestions collected through Google API and then manually checked - Only one marker per query Results: - Slightly (10-12%) worse than Google (ok for such RAM requirements) - In A/B test shows 3-9% purchases increase
  • 21. 21 Future plans Implementation: - Use 3-gram data (still trying to keep it RAM-optimal) Testing: - Use multi-marker test set - Properly handle cases when spellchecker returns multiple variants