SlideShare una empresa de Scribd logo
1 de 50
1
Ancestry DNA at Scale
Using Hadoop and HBase
September 7, 2013
What does this talk cover?
What does Ancestry do?
How did our journey with Hadoop start?
Using Hadoop as a Job Processor
DNA Matching with Hadoop and HBase
What’s next?
2
Ancestry.com Mission
3
Discoveries Are the Key
• Over 30,000 historical content collections
• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes
We are the world's largest online family history resource.
The “eureka” moment drives our business
Discoveries In Detail
Discoveries With DNA
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 120,000 DNA samples
700,000 SNPs for each sample
6,000,000+ 4th cousin matches
6
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism).
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
-
50,000
100,000
150,000
Genotyped samples
What does the customer see?
7
8
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
2,000 10,053 21,205 40,201 60,240 80,405 115,756
CousinMatches
Database Size
Network Effect – Cousin Matches
Where Did We Start?
The process before Hadoop
9
What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Pressures of a startup business
– Release a product, learn, and then scale
Sr. Manager and 5 developers and 4 member Science Team
10
Scientists
Think they can code:
• Linux
• MySQL
• PERL and/or Python
Software Engineers
Think they are Scientists:
• Biology in HS and College
• Math/Statistics
• Read science papers
DNA Input
Raw Data (A,C,T,G,0):
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)
Map File:
0 rs10005853 0 0
0 rs10015934 0 0
0 rs1004236 0 0
0 rs10059646 0 0
0 rs10085382 0 0
0 rs10123921 0 0
0 rs10127827 0 0
0 rs10155688 0 0
0 rs10162780 0 0
0 rs1017484 0 0
0 rs10188129 0 0
11
What Did “Get Something Running” Look Like?
Single Beefy Box – Only option is to scale Vertically
12
Old Version
Pipeline
Control
Run
Watch Dog
B
4) Disc
Management
(V2)
RakeshInit
Results
Processing
3) Poll
status
Finalize
Heart beat
Creates run
Reruns
Monitor
2) Enqueuer
(dna validation)
Monitor
“Beefy Box”
Runs on
AdMixture (Ethnicity)
Beagle (Phasing) and GermLine (Matching)
runs here
Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-to-
end processing time
• Put the data in pivot tables and graphed each step
• Normalize the data (sample size was changing)
• Use the data collected to predict future performance
13
#1
Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)
• Linear
(by DNA pool size)
• Quadratic (Matching related steps) – Time bomb
(Courtesy from Keith’s Potting)
14
Parallel Ethnicity Jobs
Use Hadoop as a job processor
15
Why Attack Ethnicity First?
• Smart developers, little Hadoop experience
– Using Hadoop as a job scheduler and scaling the ethnicity step
was easier than redesigning the matching step
• AdMixture is a self-contained application
– Reference panel, the users DNA, and a seed value for inputs
– CPU intensive job that writes to stdout
• Easy to split up the input
• Looked hard enough at the matching problem to realize a
HBase, MapReduce solution was realistic
16
Parallel Ethnicity Jobs
Typical run of 1000 samples. Queue up one Hadoop job
with 40 tasks, 25 samples per task
17
1) Map Reduce
Hadoop Cluster (20 x 4 slots x 96g)
Server Server Server Server Server Server Server Server
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Admixture
Server Server
#2
Results
1000 sample runs under 3 hours (one interesting bug)
18
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000 2012-03-01T21:18:03
2012-03-31T16:27:50
2012-04-17T07:31:45
2012-05-17T18:36:08
2012-06-16T15:23:27
2012-06-29T19:42:18
2012-07-11T11:29:56
2012-07-22T07:48:32
2012-07-30T06:56:26
2012-08-08T20:42:30
2012-08-17T20:58:55
2012-09-01T01:51:54
2012-09-11T21:53:05
2012-09-23T21:46:15
2012-10-02T14:28:50
2012-10-14T17:45:53
2012-11-04T02:43:36
2012-11-24T11:12:19
2012-12-12T17:35:15
2012-12-25T04:36:45
2013-01-14T15:18:38
2013-01-29T12:29:56
2013-02-11T10:22:02
2013-03-02T16:03:16
2013-03-29T00:19:36
2013-04-21T02:02:51
2013-05-17T01:34:00
2013-05-29T07:08:04
2013-06-13T13:50:45
2013-06-25T21:06:04
2013-07-17T15:15:27
2013-08-06T07:57:41
AdMixture Time (sec)
Sum of Run Size
Admixture Time
Freed up the “Beefy Box”
• Moving AdMixture off left an additional 10 threads for
phasing and matching
• Memory was freed up for phasing and matching
• Just moving AdMixture off, saved over 6 hours of
processing on the single box
– Bought us time
19
New Matching Algorithm
Hadoop and HBase
20
What is GERMLINE?
• GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA
• GERMLINE also refers to the reference implementation of
that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
So what's the problem?
• GERMLINE (the implementation) was not meant to be
used in an industrial setting
• Stateless
• Single threaded
• Prone to swapping (heavy memory usage)
• Generic
• Used for any DNA (fish, fruit fly, human, …)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply : GERMLINE couldn't scale
0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
Hours
Number of samples
GERMLINE Run Times (in hours)
Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
122500
Hours
Number of samples
GERMLINE run
times
Projected
GERMLINE run
times
The Mission : Create a Scalable
Matching Engine
... and thus was
born
(aka "Jermline with a J")
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
The
Input
Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely
saves
humanity from
destruction
DNA Matching : How it Works
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Separate into
words
DNA Matching : How it Works
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Build the hash
table
DNA Matching : How it Works
Iterate through genome and find matches
Starbuck and Adama match from position 1 to position 2
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
DNA Matching : How it Works
Does that mean they're related?
...maybe
Baltar : TTAAGCCTAGGGGCG
But wait... what about Baltar?
Gaius Baltar
• Handsome
• Genius
• Kinda evil
Adding a new sample, the GERMLINE
way
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Step one : Rebuild the entire hash table from scratch, including
the new sample
The GERMLINE Way
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step two : Find everybody's matches all over again, including the
new sample. (n x n comparisons)
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
The GERMLINE Way
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step three : Now, throw away the evidence!
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
You have done this before, and you will have
to do it ALL OVER AGAIN.
The GERMLINE Way
Not so good, right?
Now let's take a look at the
way.
Step one : Update the hash table.Starbuck Adama
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Baltar : TTAAG CCTAG GGGCG New sample to add
Add a column for every new sample for each user
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
The way
Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
Already
stored in
HBase
2_Starbuck 2_Adama
2_Starbuck { (1, 2), ...}
2_Adama { (1, 2), ...}
New
matches to
add
“Fuzzy Match” the consecutive words. Worst case: Identical twins
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
The way
Step two : Find matches.
2_Starbuck 2_Adama 2_Baltar
2_Starbuck { (1, 2), ...} { (1), ...}
2_Adama { (1, 2), ...} { (0,1), ...}
2_Baltar { (1), ...} { (0,1), ...}
The way
Starbuck Adama Baltar
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
These are the updated
tables after adding
Baltar’s information
Only looking at 3
samples, chromosome
#2, positions 0, 1, and 2
Very simple example of
how the matching
process works
But wait ... what about
Zarek, Roslin, Hera, and Helo?
Photo by Benh Lieu
Song
Run them in parallel with Hadoop!
• Batches are usually about a thousand
people.
• Each mapper takes a single chromosome for
a single person.
o Three samples per task means 22 jobs with 334 tasks
(1000/3) each
• MapReduce Jobs :
Job #1 : Match Words
• Updates the hash table
Job #2 : Match Segments
• Identifies areas where the samples
match
Parallelism with Hadoop
How does Jermline perform?
A 1700% improvement over
GERMLINE!
Along with more accurate results
#3
0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
Run Times For Matching (in hours)
Run Times For Matching (in hours)
0
20
40
60
80
100
120
140
160
180
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
GERMLINE
run times
Jermline run
times
Projected
GERMLINE
run times
• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat
• (Courtesy from Bill’s plotting)
Incremental Changes Over Time
46
Bottom line : Without Hadoop and HBase, this would
have been expensive and difficult.
• Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level – at considerable cost!
• Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.
Dramatically Increased our Capacity
What’s Next?
Hadoop and HBase
48
Continue to Evolve the Software
• Azkaban for job control
– Nearly complete
• Phasing
– Still runs on the “Beefy Box”, 1000 samples take over 11 hours
– Total run time for 1000 samples is about 14 hours.
– Re-implement with HBase, MapReduce, Hadoop
• Version Updates
– New algorithms require us to re-run the entire DNA pool
– Burst capacity to the cloud
• Machine Learning
– Matching (V2) and Ethnicity (V3) both would benefit from a
Machine Learning approach
49
End of the Journey (for now) - Questions?
50

Más contenido relacionado

Destacado

คู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerคู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerตุลยวัต พุ่มทิม
 
Bahaya gula berlebihan
Bahaya gula berlebihanBahaya gula berlebihan
Bahaya gula berlebihanNorizan Din
 
Social media for business www.mintsocialmedia.com
Social media for business   www.mintsocialmedia.comSocial media for business   www.mintsocialmedia.com
Social media for business www.mintsocialmedia.comKabir Shaikh
 
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesOneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesSean Barbeau
 
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)William Yetman
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...William Yetman
 
Create A Vision Board for Future
Create A Vision Board for FutureCreate A Vision Board for Future
Create A Vision Board for FutureShreya Lalwani
 
UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)Pj Premallon
 
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINONAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINOreponia
 
開発者からみたTensor flow
開発者からみたTensor flow開発者からみたTensor flow
開発者からみたTensor flowHideo Kinami
 

Destacado (20)

Answer
AnswerAnswer
Answer
 
คู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerคู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation center
 
Cab advertising
Cab advertisingCab advertising
Cab advertising
 
Narration
NarrationNarration
Narration
 
Bahaya gula berlebihan
Bahaya gula berlebihanBahaya gula berlebihan
Bahaya gula berlebihan
 
advertisement
advertisementadvertisement
advertisement
 
Social media for business www.mintsocialmedia.com
Social media for business   www.mintsocialmedia.comSocial media for business   www.mintsocialmedia.com
Social media for business www.mintsocialmedia.com
 
Avatar
AvatarAvatar
Avatar
 
Saad
SaadSaad
Saad
 
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesOneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
 
NFC standards
NFC standardsNFC standards
NFC standards
 
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
 
14 context clues
14 context clues14 context clues
14 context clues
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
 
Create A Vision Board for Future
Create A Vision Board for FutureCreate A Vision Board for Future
Create A Vision Board for Future
 
UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)
 
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINONAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
開発者からみたTensor flow
開発者からみたTensor flow開発者からみたTensor flow
開発者からみたTensor flow
 

Similar a Using Hadoop and HBase for DNA Matching at Scale

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! Cloudera, Inc.
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemDataWorks Summit
 
Всеволод Поляков (DevOps Team Lead в Grammarly)
Всеволод Поляков (DevOps Team Lead в Grammarly)Всеволод Поляков (DevOps Team Lead в Grammarly)
Всеволод Поляков (DevOps Team Lead в Grammarly)Provectus
 
"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod Polyakov"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod PolyakovYulia Shcherbachova
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe
 
Мониторинг. Опять, rootconf 2016
Мониторинг. Опять, rootconf 2016Мониторинг. Опять, rootconf 2016
Мониторинг. Опять, rootconf 2016Vsevolod Polyakov
 
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break GlassCassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glassaaronmorton
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
 
Use Ruby GC in full..
Use Ruby GC in full..Use Ruby GC in full..
Use Ruby GC in full..Alex Mercer
 

Similar a Using Hadoop and HBase for DNA Matching at Scale (20)

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop Ecosystem
 
Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Всеволод Поляков (DevOps Team Lead в Grammarly)
Всеволод Поляков (DevOps Team Lead в Grammarly)Всеволод Поляков (DevOps Team Lead в Grammarly)
Всеволод Поляков (DevOps Team Lead в Grammarly)
 
Metrics: where and how
Metrics: where and howMetrics: where and how
Metrics: where and how
 
"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod Polyakov"Metrics: Where and How", Vsevolod Polyakov
"Metrics: Where and How", Vsevolod Polyakov
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
A G1GC Saga-KCJUG.pptx
A G1GC Saga-KCJUG.pptxA G1GC Saga-KCJUG.pptx
A G1GC Saga-KCJUG.pptx
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-up
 
Мониторинг. Опять, rootconf 2016
Мониторинг. Опять, rootconf 2016Мониторинг. Опять, rootconf 2016
Мониторинг. Опять, rootconf 2016
 
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break GlassCassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
Cassandra Community Webinar August 29th 2013 - In Case Of Emergency, Break Glass
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
 
M.B.T.S. Round 3, Week 1
M.B.T.S. Round 3, Week 1 M.B.T.S. Round 3, Week 1
M.B.T.S. Round 3, Week 1
 
Use Ruby GC in full..
Use Ruby GC in full..Use Ruby GC in full..
Use Ruby GC in full..
 
Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptxData Mining Lecture_4.pptx
Data Mining Lecture_4.pptx
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Using Hadoop and HBase for DNA Matching at Scale

  • 1. 1 Ancestry DNA at Scale Using Hadoop and HBase September 7, 2013
  • 2. What does this talk cover? What does Ancestry do? How did our journey with Hadoop start? Using Hadoop as a Job Processor DNA Matching with Hadoop and HBase What’s next? 2
  • 4. Discoveries Are the Key • Over 30,000 historical content collections • 11 billion records and images • Records dating back to 16th century • 4 petabytes We are the world's largest online family history resource.
  • 5. The “eureka” moment drives our business Discoveries In Detail
  • 6. Discoveries With DNA Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 120,000 DNA samples 700,000 SNPs for each sample 6,000,000+ 4th cousin matches 6 DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) - 50,000 100,000 150,000 Genotyped samples
  • 7. What does the customer see? 7
  • 8. 8 - 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 2,000 10,053 21,205 40,201 60,240 80,405 115,756 CousinMatches Database Size Network Effect – Cousin Matches
  • 9. Where Did We Start? The process before Hadoop 9
  • 10. What’s the Story? Cast of Characters (Scientists and Software Engineers) Pressures of a startup business – Release a product, learn, and then scale Sr. Manager and 5 developers and 4 member Science Team 10 Scientists Think they can code: • Linux • MySQL • PERL and/or Python Software Engineers Think they are Scientists: • Biology in HS and College • Math/Statistics • Read science papers
  • 11. DNA Input Raw Data (A,C,T,G,0): 3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips) Map File: 0 rs10005853 0 0 0 rs10015934 0 0 0 rs1004236 0 0 0 rs10059646 0 0 0 rs10085382 0 0 0 rs10123921 0 0 0 rs10127827 0 0 0 rs10155688 0 0 0 rs10162780 0 0 0 rs1017484 0 0 0 rs10188129 0 0 11
  • 12. What Did “Get Something Running” Look Like? Single Beefy Box – Only option is to scale Vertically 12 Old Version Pipeline Control Run Watch Dog B 4) Disc Management (V2) RakeshInit Results Processing 3) Poll status Finalize Heart beat Creates run Reruns Monitor 2) Enqueuer (dna validation) Monitor “Beefy Box” Runs on AdMixture (Ethnicity) Beagle (Phasing) and GermLine (Matching) runs here
  • 13. Measure Everything Principle • Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-to- end processing time • Put the data in pivot tables and graphed each step • Normalize the data (sample size was changing) • Use the data collected to predict future performance 13 #1
  • 14. Challenges and Pain Points Performance degrades when DNA pool grows • Static (by batch size) • Linear (by DNA pool size) • Quadratic (Matching related steps) – Time bomb (Courtesy from Keith’s Potting) 14
  • 15. Parallel Ethnicity Jobs Use Hadoop as a job processor 15
  • 16. Why Attack Ethnicity First? • Smart developers, little Hadoop experience – Using Hadoop as a job scheduler and scaling the ethnicity step was easier than redesigning the matching step • AdMixture is a self-contained application – Reference panel, the users DNA, and a seed value for inputs – CPU intensive job that writes to stdout • Easy to split up the input • Looked hard enough at the matching problem to realize a HBase, MapReduce solution was realistic 16
  • 17. Parallel Ethnicity Jobs Typical run of 1000 samples. Queue up one Hadoop job with 40 tasks, 25 samples per task 17 1) Map Reduce Hadoop Cluster (20 x 4 slots x 96g) Server Server Server Server Server Server Server Server Admixture Admixture Admixture Admixture Admixture Admixture Admixture Admixture Admixture Server Server #2
  • 18. Results 1000 sample runs under 3 hours (one interesting bug) 18 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 2012-03-01T21:18:03 2012-03-31T16:27:50 2012-04-17T07:31:45 2012-05-17T18:36:08 2012-06-16T15:23:27 2012-06-29T19:42:18 2012-07-11T11:29:56 2012-07-22T07:48:32 2012-07-30T06:56:26 2012-08-08T20:42:30 2012-08-17T20:58:55 2012-09-01T01:51:54 2012-09-11T21:53:05 2012-09-23T21:46:15 2012-10-02T14:28:50 2012-10-14T17:45:53 2012-11-04T02:43:36 2012-11-24T11:12:19 2012-12-12T17:35:15 2012-12-25T04:36:45 2013-01-14T15:18:38 2013-01-29T12:29:56 2013-02-11T10:22:02 2013-03-02T16:03:16 2013-03-29T00:19:36 2013-04-21T02:02:51 2013-05-17T01:34:00 2013-05-29T07:08:04 2013-06-13T13:50:45 2013-06-25T21:06:04 2013-07-17T15:15:27 2013-08-06T07:57:41 AdMixture Time (sec) Sum of Run Size Admixture Time
  • 19. Freed up the “Beefy Box” • Moving AdMixture off left an additional 10 threads for phasing and matching • Memory was freed up for phasing and matching • Just moving AdMixture off, saved over 6 hours of processing on the single box – Bought us time 19
  • 21. What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/
  • 22. So what's the problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting • Stateless • Single threaded • Prone to swapping (heavy memory usage) • Generic • Used for any DNA (fish, fruit fly, human, …) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply : GERMLINE couldn't scale
  • 24. Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 122500 Hours Number of samples GERMLINE run times Projected GERMLINE run times
  • 25. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J")
  • 26. Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC The Input Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction DNA Matching : How it Works
  • 27. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Separate into words DNA Matching : How it Works
  • 28. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Build the hash table DNA Matching : How it Works
  • 29. Iterate through genome and find matches Starbuck and Adama match from position 1 to position 2 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama DNA Matching : How it Works
  • 30. Does that mean they're related? ...maybe
  • 31. Baltar : TTAAGCCTAGGGGCG But wait... what about Baltar? Gaius Baltar • Handsome • Genius • Kinda evil
  • 32. Adding a new sample, the GERMLINE way
  • 33. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Step one : Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way
  • 34. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step two : Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar The GERMLINE Way
  • 35. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step three : Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar You have done this before, and you will have to do it ALL OVER AGAIN. The GERMLINE Way
  • 36. Not so good, right? Now let's take a look at the way.
  • 37. Step one : Update the hash table.Starbuck Adama 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Baltar : TTAAG CCTAG GGGCG New sample to add Add a column for every new sample for each user Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The way
  • 38. Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 Already stored in HBase 2_Starbuck 2_Adama 2_Starbuck { (1, 2), ...} 2_Adama { (1, 2), ...} New matches to add “Fuzzy Match” the consecutive words. Worst case: Identical twins Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome The way Step two : Find matches.
  • 39. 2_Starbuck 2_Adama 2_Baltar 2_Starbuck { (1, 2), ...} { (1), ...} 2_Adama { (1, 2), ...} { (0,1), ...} 2_Baltar { (1), ...} { (0,1), ...} The way Starbuck Adama Baltar 2_ACTGA_0 1 2_TTAAG_0 1 1 2_CCTAG_1 1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 These are the updated tables after adding Baltar’s information Only looking at 3 samples, chromosome #2, positions 0, 1, and 2 Very simple example of how the matching process works
  • 40. But wait ... what about Zarek, Roslin, Hera, and Helo?
  • 41. Photo by Benh Lieu Song Run them in parallel with Hadoop!
  • 42. • Batches are usually about a thousand people. • Each mapper takes a single chromosome for a single person. o Three samples per task means 22 jobs with 334 tasks (1000/3) each • MapReduce Jobs : Job #1 : Match Words • Updates the hash table Job #2 : Match Segments • Identifies areas where the samples match Parallelism with Hadoop
  • 43. How does Jermline perform? A 1700% improvement over GERMLINE! Along with more accurate results #3
  • 45. Run Times For Matching (in hours) 0 20 40 60 80 100 120 140 160 180 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples GERMLINE run times Jermline run times Projected GERMLINE run times
  • 46. • Support the business, move incrementally and adjust • After H2, pipeline speed stays flat • (Courtesy from Bill’s plotting) Incremental Changes Over Time 46
  • 47. Bottom line : Without Hadoop and HBase, this would have been expensive and difficult. • Previously, we ran GERMLINE on a single "beefy box". • 12-core 2.2GHZ Opteron 6174 with 256GB of RAM • We had upgraded this machine until it couldn't be upgraded any more. • Processing time was unacceptable, growth was unsustainable. • To continue running GERMLINE on a single box, we would have required a vastly more powerful machine, probably at the supercomputer level – at considerable cost! • Now, we run Jermline on a cluster. • 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM • We can now run 16 batches per day, whereas before we could only run one. • Most importantly, growth is sustainable. To add capacity, we need only add more nodes. Dramatically Increased our Capacity
  • 49. Continue to Evolve the Software • Azkaban for job control – Nearly complete • Phasing – Still runs on the “Beefy Box”, 1000 samples take over 11 hours – Total run time for 1000 samples is about 14 hours. – Re-implement with HBase, MapReduce, Hadoop • Version Updates – New algorithms require us to re-run the entire DNA pool – Burst capacity to the cloud • Machine Learning – Matching (V2) and Ethnicity (V3) both would benefit from a Machine Learning approach 49
  • 50. End of the Journey (for now) - Questions? 50

Notas del editor

  1. Job Processor: As you will see, we started our Hadoop/DNA journey with something that was fairly basic and then we moved to the matching problemDNA Matching: We will walk through and example of how matching works, discuss how GERMLINE implemented the matching, and contrast that with the Hadoop/HBase implementation we created.
  2. At Ancestry.com our mission is to help people discover, preserve and share their family history.
  3. Everything from birth certificates, obituaries, immigration records, census records, voter registration, old phone books, everything.
  4. Typically, the way it works is this :You search through our records to find one of your relatives. Once you've found enough records that you're satisfied you've found your relative, you attach them to your family tree. After that, Ancestry goes to work for you. Our search engine takes a look at your whole tree to find relatives that you may not know about yet, and presents these to you as hints. (shaky leaf) You can then examine these hints and see if they are, in fact, related to you. It's pretty cool! And the beauty of it is that, say you've found a relative who's researched their family tree pretty extensively? Well, you get to piggyback on all that research by simply adding their family tree to yours. A fine example of crowdsourcing.
  5. Spit in a tube, pay $99 and learn about your past. That is how Derrick Harris of GigaOm described what we do. DNA is found in every living cell – it is the genetic material that encodes all of the information required to create and maintain life. DNA is passed down from parent to child and is like breadcrumbs left by our Ancestors. And changes in DNA across generations give us a view into history. We can take those breadcrumbs and determine with a large degree of accuracy what your ethnicity is and who else in our database might be your cousin. If we determine that you have a 4th cousin then you likely share a common ancestor with that person between 7 and 10 generations ago or 150 – 300 years ago. We have a team of data scientists and bioinformatics PhD’s working on this effort and have very quickly acquired over 120,000 DNA samples for people that have family trees on our site. Each DNA sample is composed of over 700,000 SNPs or location markers. In order to compare the 700,000 SNPs from each new sample with the 700,000 SNPs from each existing sample that is already in our database we have a sophisticated pipeline of algorithms that run using Hadoop, Hbase and MapReduce for parallel distributed processing.What is our confidence rate of a 4th cousin match?The average customer has close to 30 fourth cousin matches
  6. Top left our ethnicity chart. To the right, Tree view with cousin hints and surnames in another member’s public tree. Maps pinpointing birth locations. List of surnames that appear in both trees.
  7. The bottom red line is the size of our DNA pool (i.e. each unique sample in our database). The black line is the number of cousin matches we’ve calculated at the particular DNA pool size. As you can see, the matches start to compound and grow quadratically as the pool size increases. This is a good thing. It means we can find genetic relatives for most customers who take the DNA test.The cousin matches are actually a Big Data problem for our Front End. We are looking at different ways to handle the transfer, storage, and growth of the cousin match data as the DNA pool size increases.
  8. Every scientist thinks they can code – because they have been doing it for a long time on their own or in an academic environment. But they don’t know what it means to build, deploy, support “production” code. Software engineers understand production code. They just think they understand the math and statistics – after all they are computer scientists. They can understand the science behind DNA, after all, they took Biology in high school. Nowhere near the education of a Bioinformatics or Population Geneticist PhD. The Science Team are the domain experts and the engineers are required to build a production system to meet the domain expert’s needs.Really started light 3 developers and 2 scientists. In fact, for the first 3 months we “borrowed” engineers from other projects to get this started.
  9. 5 possible values – not 4. A C T G and zero. Zero indicates a “read” failure at that position. No sample is perfect, extraction could be off, each run on the same sample will come up with zeros in different spots.QC checks on the sample. If there are too many “zeros” we have the lab try the extraction again. If that fails 2 more times, we issue a recollect (send another kit to the customer and ask them to submit their DNA again).Map file tells you where each value is on a particular chromosome
  10. Ran AdMixture on 10 threads, Phasing and Germline on 10 threads. AdMixture would usually finish before Beagle (Phasing) and that freed up more memory and threads for Germline. In all, a 500 sample run took about 24 hours to complete (pool size < 25K)IF WE STAYED IN THIS CONFIGURATION (WHICH MATCHED MANY ACEDEMIC ENVIRONMENTS) THE ONLY OPTION WAS TO INCREASE THE HARDWARE. MORE CPUS, MORE MEMORY. SCALING VERTICALLY JUST PLAIN SUCKS!
  11. Critically important. In software development you must measure your performance at every step. Does not matter what you are doing, if you are not measuring your performance, you can’t improve. The last point is critical. We could determine the formula for performance of key phases (correlate this) and used that formula to predict future performance at particular DNA pool sizes. We could see the problems coming and knew when we were going to have performance issues.Story #1: Our first step that was going out of control (going quadratic), was the first implementation of the relationship calculation – happens just after matching. This step was basically two nested for loops that walked over the entire DNA pool for each input sample. Simple code, it worked with small numbers, fell over fast. Time was approaching 5 hours to run. Two of my developers rewrote this in PERL and got it down to 2 minutes 30 seconds. They were ecstatic. One of our DNA Scientists (PhD in Bioinformatics, MS in Computer Science – he knows how to code) wrote an AWK command (nasty regular expressions) that ran in less than 10 seconds. My devs were humbled. For the next week, whenever they ran into Keith, they formally bowed to his skills. (All in good nature, all fun.)
  12. Static by batch size (Phasing). Some steps took a long time but were very consistent. A worry but not critical to fix up front.Linear by DNA Pool size (Pipeline Initialization). Looked at ways to streamline and improve performance of these steps.Quadratic – those are the time bombs (Germline, Relationship processing, Germline results processing)The only way we knew this was coming was because we measured each step in the pipeline.
  13. KEY POINT: We knew we wanted to move this to Hadoop to solve matching. With that end goal in mind, we attacked the AdMixture/Ethnicity step. Without that initial investigation and discovery step, we could have used an MPI Linux environment or some other way to scale AdMixture.
  14. Story 2: First job we put through was a single job with 500 tasks (sample size of 500). AdMixture is a C++, multithreaded App. When we kicked up all the tasks, it did not leave enough CPU time for the “task health check” to run in the background on the Hadoop node. So the Job Controller would reach out and kill some jobs because they were “misbehaving” – when in fact they were running just fine. Remember Hadoop is intimately aware of the JVM and how it is running. Hadoop does not have a good view into other applications you choose to run. Since AdMixture was C++, Hadoop had no idea how much memory, threads, or CPU was being used per “slot”. We had to back things off, so there was enough room for the Job Controller to get an “ACK” indicating the jobs were running fine.THE ONLY WAY TO UNDERSTAND HADOOP’S CAPABILITIES AND LIMITATIONS IS TO USE IT! BE READY FOR SOME SURPRISES.
  15. Really happy with this performance. 1000 samples usually run in 2 hours 30 minutes to 2 hours 45 minutes. Two spikes to explain (this is the bug):There is a bug in AdMixture (remember, created by someone who wanted to finish their CS Masters for an academic situation) that showed up occasionally. The program would literally “get lost” and never complete. It would not GPF or throw an error, it just swallowed up the CPU and never completed. Even worse, it usually happened on chromosome 1 or 2, the biggest chromosomes we process. We put a timeout on our tasks. If a task did not finish in 2 hours, we killed it, changed the seed value and resubmitted a new task. This fixes the problem. That explains the spikes.
  16. This was a great first step. We got valuable Hadoop, MapReduce, job control experience and this first step BOUGHT US TIME!It gave us the confidence to start working on the GERMLINE matching problem.
  17. Very smart people at Columbia University came up with GERMLINE.
  18. Remember, for an academic, running a 1000 sample set through GERMLINE was “large”. I’ve talked to people who kept re-running the same 50 fish DNA samples through GERMLINE to clean up the variations between sample extractions (think of it as eliminating all the zeros).In a lot of ways, we were using GERMLINE in a way that it was not built for.
  19. Mention how we kept upgrading and tightening things up
  20. Our projections showed how bad the execution time would get. As we approached 120K for the DNA pool size, each additional 500 sample set would require 700 hours to complete – over 4 weeks.
  21. Germline with a “J” (lead engineer’s first name is Jeremy)This was a “clean room” implementation of the algorithm. Read the reference paper, don’t look at the C++ reference implementation. Work off the original (brilliant) paper.
  22. Using BattlestarGallactica for the matching example.
  23. For each person-to-person comparison, we add up the total length of their shared DNA and run that through a statistical model to see how closely they're related. This is the “Relationship Calculation” step that works on the GERMLINE output.
  24. Remind people that GERMLINE was stateless
  25. Anytime you see an N-by-N comparison in a computer problem you are working on it should send up huge red flags.
  26. HBase holds the data. (Mix between a spread sheet and a hash table.) Adding columns is easy. Having a very sparse matrix is fine. Key is the chromosome, the word value, and position (which word). Each new sample adds a column to the table. A value of 1 in the cell indicates this user has this value at this location. A row holds all the samples with that same value in our DNA poolsize.This is really a pretty simple implementation. Remember: SIMPLE SCALES.
  27. There is a second table for the fuzzy matching phase. It holds the list of ranges where two users match on a chromosome. This is used to create the output of the matching phase. Exactly where two individuals match on each chromosome.
  28. There were a whole bunch of characters on BattlestarGallactica!
  29. First run we kicked off one job with (500 samples x 22 chromosomes) 11,000 tasks using Hbase 0.92 and we panicked the HBase region server. That’s where we came up with 22 jobs (one for each chromosome) with about 334 tasks per job. (Moved to HBase 0.94 was much more stable)
  30. Story #3: We would run samples through the old GERMLINE and the new HadoopJermline. For the most part, they always matched. We finally found a few runs where there were discrepancies. We had to pull in the Science Team to check – we had actually found a bug in the original GERMLINE implementation for an edge case. The clean room implementation of the Hadoop code was “more correct” than the original C++ GERMLINE reference code. Very gratifying to see – but the truth is it had us concerned and confused for about 3 days.Made the natural assumption that the base implementation GERMLINE (with a ‘G’) was 100% correct. That assumption was wrong.
  31. This slide is a huge relief. We’ve been released and steady for a while. One note, the curve for H2 is not totally flat. It is going up ever so slightly. No worries. We can always add more nodes to the cluster and reduce the time.
  32. This is an “Agile” development story.Point out a few colors: Dark Green, Orange, Light Green at the top, and the PurpleThe darker green is AdMixture, you can see when we moved to Hadoop (our H1 Release)Orange is the Matching Step (Germline to Jermline, our H2 Release)The lighter green is the pipeline finalization step. We eliminated most of this step when we released H2. We had a failsafe way to fallback to the completed steps of the previous run. We never wanted to fail in the middle of a run, destroy everything, and then have to rerun the entire pool from scratch. Key part of finalization pre-JermlineThe Purple is Phasing (Beagle). Static based on input size, very stable. On our hit list.
  33. The “Beefy Box” would be a good candidate for a large database server or a single node on a heavily used distributed cache (Memcache-D or Redis)