This document provides an overview of the technical challenges in launching Indeed's job search platform around the world. It discusses how Indeed handles tokenization and indexing of jobs in different languages, including challenges with Chinese, Japanese, and Korean text. It describes Indeed's approaches to language detection, stemming, and query expansion to improve recall and relevance across many international markets. Key techniques discussed include n-gram tokenization, Unicode blocking, Bayesian classification, term expansion maps separated from indexing, and rule-based stemming. The goal is to make Indeed's search system scalable, generic, and able to support comprehensive use cases for job searching in different languages and regions globally.
16. Precision
Job seeker searches for “architect”
10 jobs returned:
8 building architect jobs Relevant
2 software architect jobs Not Relevant
Precision: 8 / 10
22. Senior Software Engineer - Search
Indeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer
responsible for the information retrieval system that
powers Indeed’s job search website.
If you are an engineer who's passionate about building
innovative products...
Job Description - English
23. Senior Software Engineer - Search
Indeed - Austin, TX
Indeed.com is seeking a Senior Software Engineer
responsible for the information retrieval system that
powers Indeed’s job search website.
If you are an engineer who's passionate about building
innovative products...
Tokenization
24. Inverted Index
● Like index in the back of a book
● words = tokens, page numbers = doc ids
25. Inverted Index
Token Job A Job B Job C
assistant ✔
developer ✔
engineer ✔
lawyer ✔ ✔
paralegal ✔ ✔
retrieval ✔
26. Inverted Indexes
Allow you to:
● Quickly find all documents containing a
token
● Perform boolean queries, e.g “java AND
developer”
31. Secrétaire
Saclay
Au sein de la direction de la Qualité et de l'Environnement (DQE)
vous seconderez la secrétaire-assistante. Vos principales
missions seront :
- organisation de réunions
- l'accueil téléphonique
- la gestion des missions ..
Job Description - French
59. Language Detection options
● HTTP Content-Language response header
○ Most sites don’t provide this header
○ May not be accurate
60. Language Detection - ICU4J
● ICU4J’s CharsetDetector
○ Works well for languages with single byte
encoded characters
○ Detect that language is one of
Danish, Dutch, English, French,
German, Italian, Portuguese, Swedish
68. ● 100% accurate
● Used in:
○ Thai
○ Greek
○ Korean
○ Hebrew
Using Unicode Blocks
69. CJ language detection
● Strongly weight Hiragana and Katakana
● Some characters (Kanji) common between
Chinese and Japanese
● p(卒 ϵ ja) = 0.99 p(卒 ϵ zh) = 0.000001
70. Language Results
● Did cross validation on hand labeled testing
data
● 99% accurate for text > 30 characters
○ Average job description is 200 characters
● Fast - 0.6ms per job
100. Why stemming matters
● Return all possible relevant jobs given the
user’s query, not just exact matches
101. Stemming - Lucene Analyzers
● Do stemming before adding to inverted
index
● Examples
○ PorterStemFilter
○ SnowballAnalyzer
○ EnglishMinimalStemmer
102. Inverted Index
Job A: Directrice de Documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
103. Search with stemming tokenizers
● At search time, use the same analyzer on
the query
○ “directrice” → “directeur”
● Search for “directrice” returns both jobs
104. Modifying stem rules require full
index rebuild
● If roots have changed need to re-
process all jobs
105. Token Job A Job B
de ✔ ✔
directeur ✔ ✔
documentaires ✔
production ✔
106. Drawbacks
● Loss of precise information
○ “Directrice” search should return exact match only
109. Term Expansion Maps
● Map from String->List<String>
● Key is root, values are tokens that stem
to that root
● driver → driver, drivers
● vendeur → vendeur, vendeuse
111. Building term expansion map
for each language
for each term in language
root = Stemmer.stem(term)
termMap[root].append(term)
● Takes ~1.5 minutes on index with 2
million tokens and 18 languages
119. Job A: Directrice de documentaires
Job B: Directeur de production
Token Job A Job B
de ✔ ✔
directeur ✔
directrice ✔
documentaires ✔
production ✔
120. Benefits
● Modifying stem rules don’t require index
rebuilds
○ Takes minutes on index with millions of jobs
○ Had flexibility to iteratively implement stemming
rules as we come across different use cases
124. Scale Stemming
● Indeed continued international expansion
● Needed stemming to scale without code
deploys and coordination between
developers and country managers.
152. Job
Seekers
Stem Rule Editor
EN s → ‘’, ces → y, …
FR e → é, u → ù, …
Jobs Index Builder
Term Expansion Map
sale → sale, sales
policy → policy, policies
Search Service
Country Managers
query
results
153. Term expansion map storage
● Custom serialization format
○ Store string array as UTF8 bytes and offsets
○ Front encoding for additional compression
● 2X smaller than using Java native
serialization
155. Scalable
27 languages use stemming rules
Re-used language detection and stemming
libraries in resume search
156. Efficient
● Term expansion map in Europe index has 2
million terms in 18 languages - 60MB on
disk
● Building term expansion maps takes ~ 1.5
minutes
● Doing boolean query for stemming adds
~5ms to median search time (~35ms)
157. Stemming helps job seekers
Searches that return no jobs reduced by 60%
with stemming
3% to 5% more clicks
168. Job Bid x eCTR = Value
A $3.00 5% $0.15
B $2.00 10% $0.20
C $1.00 8% $0.08
169. Job Bid x eCTR = Value → Rank
A $3.00 5% $0.15 2
B $2.00 10% $0.20 1
C $1.00 8% $0.08 3
170. Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
171. Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
B could win the auction with a lower bid...
172. B could win the auction with a lower bid...
…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
173. B could win the auction with a lower bid...
…only charge what’s needed to win!
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
$1.50 x 10% = $0.15
174. B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
175. B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
176. B could win the auction with a lower bid...
…only charge what’s needed to win!
Cost = $1.51
Job Bid x eCTR = Value → Rank
B $2.00 10% $0.20 1
A $3.00 5% $0.15 2
178. Sponsored Jobs at Indeed
“Generalized Second Price Auction”
● Fair for employers
● Ensures sponsored results are relevant and
useful for job seekers
180. Sponsored Jobs at Indeed
Employers set their bid & budget
employer_id int(10) unsigned,
bid decimal(10,2) unsigned,
daily_budget decimal(10,2) unsigned,
181. Sponsored Jobs at Indeed
A builder process creates read-optimized data
structures for the auction system
182. On search results page, execute auction to
determine sponsored impressions
Sponsored Jobs at Indeed
183. Sponsored Jobs at Indeed
When job seeker clicks on sponsored result,
log information from the auction
employerId
jobId
bid
cost
…
184. Sponsored Jobs at Indeed
Process click logs to update budgets and
charge employers
185. Sponsored Jobs at Indeed
Process click logs to update budgets and
charge employers
Apply business rules during click processing:
● Fraud detection
● Duplicate click detection
186. SJ outside the US
Non-US employers wanted their jobs in
sponsored results...
187. SJ outside the US
Non-US employers wanted their jobs in
sponsored results...
...but they don’t have US Dollars
188. SJ outside the US
v1: Use credit cards
Credit card company convert charges to
employer’s currency
190. SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
191. SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
- Disadvantaged exchange rates
192. SJ outside the US
Credit Cards
+ No changes needed
- Bad UX for employers
- Disadvantaged exchange rates
- Employers bear currency risk
193. Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
194. Credit Cards: Currency Risk
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
195. Credit Cards: Currency Risk
+4.25%
Desired Daily Budget: CA $100.00
Exchange rate on Jan 1: 0.9351
Set Daily Budget to: $93.51
Exchange rate on Jan 31: 0.8970
Effective Daily Budget: CA $104.25
197. Multi-currency SJ
Employers can set bids and budgets in
preferred currency
Canadian Dollars CAD
Australian Dollars AUD
Japanese Yen JPY
Euro EUR
British Pounds GBP
Swiss Francs CHF
205. Millicents
Provide enough granularity to differentiate
similar values in different currencies
All of these are about $1.00 (USD):
£0.60 (GBP)
€0.73 (EUR)
¥102 (JPY)
206. Millicents
Provide enough granularity to differentiate
similar values in different currencies
All of these are about $1.00 (USD):
£0.60 (GBP)
€0.73 (EUR) Which is larger?
¥102 (JPY)
209. Millicents
32 bit signed values
$21,474 USD equivalent
64 bit signed values
$9.2 trillion USD equivalent
210. Local Currency Values
Values in specific currency are represented
with currency code and an integer
Integer represents “minor unit”, depends on
the currency type:
(USD, 543) == $5.43
(EUR, 543) == €5.43
(JPY, 543) == ¥543
211. Local Currency Values
For each currency, preferable that the “minor
unit” is roughly equal to $0.01 USD
● Exchange rate representation
● Fairness in auction competition
212. Local Currency Values
32 bit signed values
$21 million USD (and others)
¥2.1 billion JPY
64 bit signed values
$90 quadrillion USD (and others)
¥9 quintillion JPY
219. Multi-currency SJ
During click processing, convert auction cost
(in millicents) back to employer’s currency
using same exchange rate
costInMillicents
currency
exchangeRate
→ costInCurrency
224. Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39483
225. Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39483
226. Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39
Actual Millicent Cost: 53,168 millicents
227. Revenue Reporting
If the auction millicent cost is used, there could
be errors!
Millicent Cost: 53,826 millicents
Euro Cost: €0.39
Actual Millicent Cost: 53,168 millicents
1.2% difference!