@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day

RAD
How We Replicate Terabytes of Data Around the World Every Day

Jason Koppe
System Administrator

Indeed is the #1
external source of hire
64% of US job searchers search
on indeed each month
Unique Visitors (millions)
Million unique visitors
2009 2011 2012 2013 2014 2015
0
20
40
60
80
100
120
140
160
180
2010
180M
180 million
unique users
80.2M
unique US visitors per month
16M
jobs
50+
countries
28
languages

How We Build Systems
fast simple resilient scalable

Job Search Browser Rendering
median ~0.5 seconds
Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Feb 29 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 Mar 6 Mar 7 Mar 8
0
100
200
300
400
500
600
700
800
milliseconds

2004 launch: a few servers, 1.8m US jobs

2004
Aggregation
MySQL
Job Search
Every job on
the web

relational database,
accessed across the network

NOT fast at full text search
NOT a search engine

LuceneTM
a high-performance, full featured
text search engine library

LuceneTM
NOT a remote database,
files must be on local disk

MySQL
Database Server Lucene Index Server
Index Builder
/data/jobindex

Index Builder Index Builder Index Builder Index Builder
/data/jobindex /data/jobindex /data/jobindex /data/jobindex
MySQL

MySQL
Database Server Indexer Server
Index Builder
/data/jobindex
Search Engine
/data/jobindex
4 Search Servers

any combination of data, not just lucene

lucene +
model
bitset
lucene +
custom
binary

lucene +
model
bitset
lucene +
custom
binary
json +
csv

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Artifact
is read-optimized data stored in a directory on the file system

Producer
creates and updates a data artifact
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
MySQL

Consumer
reads a data artifact
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
MySQL

produce once, consume many times

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: minimize database access

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: compute artifact once

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: scale consumers independently

MySQL
Expensive
Index Builder
Producer
Artifact Artifact
Commodity
Search Engine
Benefit: scale consumers independently

MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: separate code deployables

Producer
artifact
Search Engine
Consumers
artifact
Index Builder

rsync
efficient point-to-point file transfer utility

1
consumers should
reload data regularly

1
consumers should
2
roll back

consumers should
2
roll back
3
data reload should
not interrupt requests
1

$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
new directory for new version

$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.latest -> jobindex.3
symlink to know current version

$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.4
load new data

$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.4
roll back

each new version takes disk space & time

versions
total bytes on disk
normal disk copy

versions
disk
latency
total bytes on disk
normal disk copy

versions
version
create time
disk
latency
total bytes on disk
normal disk copy

1.8m jobs, change <2% per hour

all jobs
00:00 AM
all jobs
04:00 AM
new jobs
changed jobs

all jobs
00:00 AM
all jobs
04:00 AM
new jobs
changed jobs
unchanged

file1.bin
file2.bin
file3.bin
3GB
jobindex.1

file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
jobindex.2

file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2

file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
5GB
jobindex.3

file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
5GB
jobindex.3
= 12GB+ +

5GB
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
1GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
1GB
jobindex.3
=+ +

file1.bin
file2.bin
file3.bin
file4.bin
jobindex.2
file1.bin
file2.bin
file3.bin
file5.bin
jobindex.3
deleted
1GB 1GB = 5GB+ 2GB
file4.bin

remove referenced file of symlink, data is gone

hardlink
additional name for an existing file

file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
1GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
1GB
jobindex.3
= 5GB+ +

file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
1GB
jobindex.3
= 5GB+

file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
5GB
jobindex.3
= 5GB

remove last hardlink, data is gone

artifact versions: symlinks + hardlinks + rsync

scale: single producer, many consumers

fast simple resilient scalable
How We Build Systems

2004
Indeed
1999
Lucene
2008
6 countries

2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries

2004 2008 200920062005
22.5 M5.2 M 7.1 M4.0 M1.8 M
jobs added or modified each month

2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2nd
datacenter

Producer
Consumers
artifacts
DC1
Staging
Consumers
artifacts
DC2
multi-dc rsync
Staging
Consumers
artifacts
DC3

Producer
Consumers
artifacts
DC1
Staging
Consumers
artifacts
DC2
Staging
Consumers
artifacts
DC3
minimize
Internet
bandwidth

2011
52 countries
4 datacenters
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries

2004 2008 200920062005
22.5 M5.2 M 7.1 M4.0 M1.8 M
jobs added or modified each month
2011
32.5 M

Simple: serially copy one artifact at a time
DC1
Producer Artifacts
DC2
Staging Artifacts

Problem: serially can cause delays
Producer
Staging
New
New
New
Old
DC1
DC2

smalllarge2large1
smalllarge2large1
Workaround: copy separately in “streams”
DC1
DC2
Staging
Producer

Simple: point-to-point datacenter rsync paths
DC4
DC3
DC2
DC1

Problem: Internet, why did you do that?
Down
DC4
DC3
DC2
DC1

Workaround: shift replication path
DC4
DC3
DC2
DC1

Scale: few consumers with rsync
Producer
Artifacts Consumers

Consumers
Producer
Grow: many consumers with rsync
Artifacts
Consumers

Consumers
Producer
Problem: too many consumers with rsync
Artifacts
Consumers
network
100%
used

Workaround: add more network bandwidth
Consumers
Producer
Artifacts
Consumers

Workaround: add staging tiers
Consumers
Producer
Artifacts
Staging
Artifacts Artifacts
Staging
Artifacts
Staging
Artifacts
Consumers Consumers Consumers Consumers Consumers Consumers Consumers
Staging

rsync growth required sysad intervention

2011
52 countries
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2014
rsync growth

100 artifacts, adding +1 producer each month

over 200 consumers, +2 each month

replicating over 21,931 TB per month

staging tiers or network bandwidth, quarterly

modify replication path, monthly

requiring too much intervention from system
administrators

sysad
dev
sysad
dev
+50%
+100%
2014
January December

2011
52 countries
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2014
rsync limits

Julie Scully
Software Engineer

Jobsearch backend team produces a lot of data

RAD
“Resilient Artifact Distribution”

Design GoalsDesign Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide visibility
1
2
3
4
5

Design Goals
Loose coupling
Automatic recovery
3
4
5
1
2

Design Goals
Loose coupling
Automatic recovery
1
2
5
4
3

Design Goals
Loose coupling
Automatic recovery
1
2
3
5
4

Design Goals
Loose coupling
Automatic recovery
1
2
3
4
5

Measure time and
network traffic
Bittorrent: Would it work?
Sample replication to
3 consumers
https://github.com/shevek/ttorrent

Network Test
Total MB received + transmitted for 700MB artifact
Producer 2,240
Consumer 1 746
Consumer 2 747
Consumer 3 747
machine RSYNC

Network Test
Producer 2,240 782
Consumer 1 746 1,226
machine BITTORRENTRSYNC

Network Test
Producer 2,240 782
Total 4,481 4,480
machine BITTORRENTRSYNC

24 minutes
rsync
5.5 minutes
bittorrent
Timing Test

Data split into small pieces of equal size

File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1

File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1
Piece 1: 75 MB
Piece 2: 75 MB
Piece 3: 75 MB
Piece 4: 75 MB
Piece 5: 25 MB

{ files:file1.bin,100MB;
file2.bin,200MB;
file3.bin,50MB }
{ piecelength:75MB }
{
infohash:XSDJSK;JDISJLD;DJKJDB;KDJB
OP;FJEIODK; }
.torrent metadata file:

Tracker
Coordinator of the download

Seeder
Any client providing data

Seeder
Data
I have pieces for info hash
Tracker
.torrent
Info Hash
File manifest

Data .torrent
Info Hash
File manifest
Seeder Tracker
Info hash peer
Map
Ok!
I have pieces for info hash

Consumer
Any client downloading data

Peers for infohash
Consumer Tracker
.torrent
Info Hash
File manifest
Tracker URL
Map
Info hash peer
How a consumer gets the first piece

Peers for infohash
Peerlist
Consumer Tracker
.torrent
Info Hash
File manifest
Tracker URL
Map
Info hash peer
How a consumer gets the first piece

Data .torrent
Info Hash
File manifest
Consumer/
Seeder
I have pieces for infohash
Tracker
Info hash peer
Map
It is also a seeder

Consumer 1
Seeding as it downloads
Consumer 2
Consumer 3
Seeder
SWARM

Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1

jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1

jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH6
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
Piece 6: HASH7
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1

File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1 jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.2
File0.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)

Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH9
Piece 5: HASH10
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
Piece 6: HASH11
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1 jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.2
File0.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)

jobindex.2
File3.bin
(50MB)
File1.bin
(150MB)
File2.bin
(200MB)
Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH9
Piece 5: HASH10
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
Piece 6: HASH11
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1

File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1
Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH9
Piece 5: HASH10
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
Piece 6: HASH11
File3.bin
(50MB)
File1.bin
(150MB)
File2.bin
(200MB)
jobindex.2

{ files:file1.bin,100MB,DATETIME;
file2.bin,200MB,DATETIME;
file3.bin,50MB,DATETIME }
{ piecelength:75MB }
...
.torrent metadata file contents:

File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.1
Piece 1: File 0, File1
Piece 2: File 1
Piece 3: File 1, File 2
Piece 4: File 2
Piece 6: File 3
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.2
File0.bin
(50MB)

File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.1
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.2
File0.bin
(50MB) Piece 1: File 0, File1
Piece 2: File 1
Piece 4: File 2
Piece 6: File 3

Bittorrent Evaluation Result
substantially faster drastically reduces
network load on the
producer machine
horizontally scalable

Automatic recovery
3
4
5
Loose coupling2
Minimize network bottlenecks1

Headwater
The beginning of a river

Headwater
Host
Data
Producer Data
Publish
my data

Headwater takes ownership of the data
(hardlink + read-only)

Headwater
Host
Data
Producer Data
Publish
my data
Will do!

Headwater
Host
Data
Producer Data

create the .torrent metadata file

Headwater
River
Course the water carves
across the landscape

Rhone
RhoneRhone
Zookeeper
Rhone: multi-master coordinator service

Rhone
Headwater
Host
Data
Producer Data

Rhone
Headwater
Host
Data
Producer Datadata.version
torrent metadata

Rhone
Headwater
Host
Data
Producer Data
Tracker
.torrent metadata
can be retrieved
data.version
torrent metadata

Headwater
River
Course the water carves
across the landscape
Delta
The end of the river

Subscribe
to data!
Delta
Host
Data
Consumer

Make all subscribed artifacts available

RhoneDelta
Host
Data
Consumer
Headwater
Host
Data
Producer Data

Delta
Data
Consumer
Rhone
Host

Tracker
Delta
Host
Data
ConsumerData
/rad/data

Delta
Host
Data
ConsumerData
Where’s
the latest
data?
/rad/data

It’s at
/rad/data
Delta
Host
Data
ConsumerData
Where’s the
latest data?
/rad/data

Delta
Host
Data
ConsumerData
/rad/data

Keep all subscribed artifacts current

Rhone
Data
Host
Artifact Availability Flow
Delta Headwater
Host
Data
Consumer
Data
Producer Data

Automatic recovery
4
5
Loose coupling2
3

Rhone
Headwater
Host
Data
Producer Data
Crash!

Rhone
Headwater
Data
Producer Datadata.version
torrent metadata
Tracker
Crash!
Host

Development philosophy:
Make recovery the common case

Durable state with atomic filesystem operations

All service calls are idempotent

DC4
DC3
DC2
DC1
rsync is point-to-point

DC1
DC4
DC3
DC2
bittorrent peer-to-peer

Down
DC1
DC4
DC3
DC2
No problem with bittorrent swarm

RAD treats artifact independently

System-wide visibility5
Loose coupling2
Automatic recovery3
4

Adding a new artifact in the rsync system

Adding a new artifact in the RAD system

Loose coupling2
Automatic recovery3
Developer empowerment4
5

Rhone already knows all artifacts

Rhone stores list of versions by artifact.
version 4
version 5
version 6
artifactA
version 221
version 226
version 227
version 228
artifactB
version 1artifactC

Heartbeats from Delta and Headwater

RADAR: Developers can easily see where their data is

2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD

Lesson learned: prevent people from
using the system incorrectly

We made configuration TOO easy

New Requirement: protect the disks

Delta
Prevent downloading artifacts that will fill the disk (and alarm)

2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
critical artifacts
migrated

2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
critical artifacts
migrated
2016
80 RAD
artifacts

2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
critical artifacts
migrated
2016
80 RAD
artifacts
100 artifacts in 10 years

100 artifacts in 10 years
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
critical artifacts
migrated
2016
80 RAD
artifacts
80 new
artifacts
in 1 year

7,666
versions published
Producer
Consumer
56
unique producers
52,357
versions downloaded
670
unique consumers
RAD Stats
March 23, 2016

Duration of JobIndex replication in RAD v. Rsync
Jan 18 6 AM 12 PM 6 PM Jan 19 6 AM 12 PM 6 PM
1,000
2,000
3,000
RAD rsync
time

replicating over 65,193 TB per month

Learn More
Engineering blog & talks http://indeed.tech
Open Source http://opensource.indeedeng.io
Careers http://indeed.jobs
Twitter @IndeedEng

@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to @Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day

Similar to @Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day (20)

More from indeedeng

More from indeedeng (20)

Recently uploaded

Recently uploaded (20)

@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day