Using Hadoop and HBase for DNA Matching at Scale

1
Ancestry DNA at Scale
Using Hadoop and HBase
September 7, 2013

What does this talk cover?
What does Ancestry do?
How did our journey with Hadoop start?
Using Hadoop as a Job Processor
DNA Matching with Hadoop and HBase
What’s next?
2

Discoveries Are the Key
• Over 30,000 historical content collections
• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes
We are the world's largest online family history resource.

The “eureka” moment drives our business
Discoveries In Detail

Discoveries With DNA
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 120,000 DNA samples
700,000 SNPs for each sample
6,000,000+ 4th cousin matches
6
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism).
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
-
50,000
100,000
150,000
Genotyped samples

8
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
2,000 10,053 21,205 40,201 60,240 80,405 115,756
CousinMatches
Database Size
Network Effect – Cousin Matches

Where Did We Start?
The process before Hadoop
9

What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Pressures of a startup business
– Release a product, learn, and then scale
Sr. Manager and 5 developers and 4 member Science Team
10
Scientists
Think they can code:
• Linux
• MySQL
• PERL and/or Python
Software Engineers
Think they are Scientists:
• Biology in HS and College
• Math/Statistics
• Read science papers

DNA Input
Raw Data (A,C,T,G,0):
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)
Map File:
0 rs10005853 0 0
0 rs10015934 0 0
0 rs1004236 0 0
0 rs10059646 0 0
0 rs10085382 0 0
0 rs10123921 0 0
0 rs10127827 0 0
0 rs10155688 0 0
0 rs10162780 0 0
0 rs1017484 0 0
0 rs10188129 0 0
11

What Did “Get Something Running” Look Like?
Single Beefy Box – Only option is to scale Vertically
12
Old Version
Pipeline
Control
Run
Watch Dog
B
4) Disc
Management
(V2)
RakeshInit
Results
Processing
3) Poll
status
Finalize
Heart beat
Creates run
Reruns
Monitor
2) Enqueuer
(dna validation)
Monitor
“Beefy Box”
Runs on
AdMixture (Ethnicity)
Beagle (Phasing) and GermLine (Matching)
runs here

Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-to-
end processing time
• Put the data in pivot tables and graphed each step
• Normalize the data (sample size was changing)
• Use the data collected to predict future performance
13
#1

Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)
• Linear
(by DNA pool size)
• Quadratic (Matching related steps) – Time bomb
(Courtesy from Keith’s Potting)
14

Parallel Ethnicity Jobs
Use Hadoop as a job processor
15

Why Attack Ethnicity First?
• Smart developers, little Hadoop experience
– Using Hadoop as a job scheduler and scaling the ethnicity step
was easier than redesigning the matching step
• AdMixture is a self-contained application
– Reference panel, the users DNA, and a seed value for inputs
– CPU intensive job that writes to stdout
• Easy to split up the input
• Looked hard enough at the matching problem to realize a
HBase, MapReduce solution was realistic
16

Parallel Ethnicity Jobs
Typical run of 1000 samples. Queue up one Hadoop job
with 40 tasks, 25 samples per task
17
1) Map Reduce
Hadoop Cluster (20 x 4 slots x 96g)
Server Server Server Server Server Server Server Server
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Server Server
#2

Results
1000 sample runs under 3 hours (one interesting bug)
18
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000 2012-03-01T21:18:03
2012-03-31T16:27:50
2012-04-17T07:31:45
2012-05-17T18:36:08
2012-06-16T15:23:27
2012-06-29T19:42:18
2012-07-11T11:29:56
2012-07-22T07:48:32
2012-07-30T06:56:26
2012-08-08T20:42:30
2012-08-17T20:58:55
2012-09-01T01:51:54
2012-09-11T21:53:05
2012-09-23T21:46:15
2012-10-02T14:28:50
2012-10-14T17:45:53
2012-11-04T02:43:36
2012-11-24T11:12:19
2012-12-12T17:35:15
2012-12-25T04:36:45
2013-01-14T15:18:38
2013-01-29T12:29:56
2013-02-11T10:22:02
2013-03-02T16:03:16
2013-03-29T00:19:36
2013-04-21T02:02:51
2013-05-17T01:34:00
2013-05-29T07:08:04
2013-06-13T13:50:45
2013-06-25T21:06:04
2013-07-17T15:15:27
2013-08-06T07:57:41
AdMixture Time (sec)
Sum of Run Size
Admixture Time

Freed up the “Beefy Box”
• Moving AdMixture off left an additional 10 threads for
phasing and matching
• Memory was freed up for phasing and matching
• Just moving AdMixture off, saved over 6 hours of
processing on the single box
– Bought us time
19

New Matching Algorithm
Hadoop and HBase
20

What is GERMLINE?
• GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA
• GERMLINE also refers to the reference implementation of
that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/

So what's the problem?
• GERMLINE (the implementation) was not meant to be
used in an industrial setting
• Stateless
• Single threaded
• Prone to swapping (heavy memory usage)
• Generic
• Used for any DNA (fish, fruit fly, human, …)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply : GERMLINE couldn't scale

0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
Hours
Number of samples
GERMLINE Run Times (in hours)

Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
122500
Hours
Number of samples
GERMLINE run
times
Projected
GERMLINE run
times

The Mission : Create a Scalable
Matching Engine
... and thus was
born
(aka "Jermline with a J")

Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
The
Input
Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely
saves
humanity from
destruction
DNA Matching : How it Works

0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Separate into
words

0 1 2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Build the hash
table

Iterate through genome and find matches
Starbuck and Adama match from position 1 to position 2
0 1 2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama

Does that mean they're related?
...maybe

Baltar : TTAAGCCTAGGGGCG
But wait... what about Baltar?
Gaius Baltar
• Handsome
• Genius
• Kinda evil

Adding a new sample, the GERMLINE
way

0 1 2
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
GGGCG_2 : Baltar
Step one : Rebuild the entire hash table from scratch, including
the new sample
The GERMLINE Way

Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step two : Find everybody's matches all over again, including the
new sample. (n x n comparisons)
0 1 2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
The GERMLINE Way

Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step three : Now, throw away the evidence!
0 1 2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
You have done this before, and you will have
to do it ALL OVER AGAIN.
The GERMLINE Way

Not so good, right?
Now let's take a look at the
way.

Step one : Update the hash table.Starbuck Adama
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Baltar : TTAAG CCTAG GGGCG New sample to add
Add a column for every new sample for each user
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
The way

Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
Already
stored in
HBase
2_Starbuck 2_Adama
2_Starbuck { (1, 2), ...}
2_Adama { (1, 2), ...}
New
matches to
add
“Fuzzy Match” the consecutive words. Worst case: Identical twins
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
The way
Step two : Find matches.

2_Starbuck 2_Adama 2_Baltar
2_Starbuck { (1, 2), ...} { (1), ...}
2_Adama { (1, 2), ...} { (0,1), ...}
2_Baltar { (1), ...} { (0,1), ...}
The way
Starbuck Adama Baltar
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
These are the updated
tables after adding
Baltar’s information
Only looking at 3
samples, chromosome
#2, positions 0, 1, and 2
Very simple example of
how the matching
process works

But wait ... what about
Zarek, Roslin, Hera, and Helo?

Photo by Benh Lieu
Song
Run them in parallel with Hadoop!

• Batches are usually about a thousand
people.
• Each mapper takes a single chromosome for
a single person.
o Three samples per task means 22 jobs with 334 tasks
(1000/3) each
• MapReduce Jobs :
Job #1 : Match Words
• Updates the hash table
Job #2 : Match Segments
• Identifies areas where the samples
match
Parallelism with Hadoop

How does Jermline perform?
A 1700% improvement over
GERMLINE!
Along with more accurate results
#3

0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
Run Times For Matching (in hours)

Run Times For Matching (in hours)
0
20
40
60
80
100
120
140
160
180
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
GERMLINE
run times
Jermline run
times
Projected
GERMLINE
run times

• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat
• (Courtesy from Bill’s plotting)
Incremental Changes Over Time
46

Bottom line : Without Hadoop and HBase, this would
have been expensive and difficult.
• Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level – at considerable cost!
• Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.
Dramatically Increased our Capacity

What’s Next?
Hadoop and HBase
48

Continue to Evolve the Software
• Azkaban for job control
– Nearly complete
• Phasing
– Still runs on the “Beefy Box”, 1000 samples take over 11 hours
– Total run time for 1000 samples is about 14 hours.
– Re-implement with HBase, MapReduce, Hadoop
• Version Updates
– New algorithms require us to re-run the entire DNA pool
– Burst capacity to the cloud
• Machine Learning
– Matching (V2) and Ethnicity (V3) both would benefit from a
Machine Learning approach
49

End of the Journey (for now) - Questions?
50

Using Hadoop and HBase for DNA Matching at Scale

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Using Hadoop and HBase for DNA Matching at Scale

Similar a Using Hadoop and HBase for DNA Matching at Scale (20)

Último

Último (20)

Using Hadoop and HBase for DNA Matching at Scale

Notas del editor