SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
DNA
Learning	
  from	
  Data:	
  	
  
Who	
  Do	
  You	
  Think	
  You	
  Are?	
  	
  
Sco$	
  Sorensen	
  and	
  Leonid	
  Zhukov	
  
Ancestry.com	
  Mission	
  
2
Discoveries	
  
It’s	
  the	
  “aha”	
  moment	
  of	
  a	
  discovery	
  that	
  
drives	
  our	
  business!	
  
3
World’s	
  largest	
  online	
  family	
  history	
  resource	
  
Historical	
  Content	
  
Over	
  30,000	
  historical	
  content	
  collec2ons	
  	
  
11	
  billion	
  records	
  and	
  images	
  
Records	
  da2ng	
  back	
  to	
  16th	
  century	
  
4
World’s	
  largest	
  online	
  family	
  history	
  resource	
  
User	
  Contributed	
  Content	
  
45	
  million	
  family	
  trees	
  
More	
  than	
  4	
  billion	
  profiles	
  
200	
  million	
  stories	
  and	
  photos	
  
5
DNA	
  Data	
  
DNA	
  Data	
  
	
  
Over	
  120,000	
  DNA	
  samples	
  
700,000	
  SNPs	
  for	
  each	
  sample	
  
2,000,000	
  4th	
  cousin	
  matches	
  
	
  
	
  
	
  
	
  
Spit	
  in	
  a	
  tube,	
  pay	
  $99,	
  learn	
  your	
  past	
  Derrick	
  Harris	
  -­‐	
  GigaOm	
  
	
  6
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism). (http://
en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)	
  
User	
  Behavior	
  Data	
  
User	
  Behavior	
  Data	
  
40	
  million	
  searches	
  /	
  day	
  
10	
  million	
  people	
  added	
  to	
  trees	
  /	
  day	
  
5	
  million	
  	
  Hints	
  accepted	
  /	
  day	
  
3.5	
  million	
  	
  Records	
  aMached	
  /	
  day	
  
	
  
7
1/12	
   12/12	
  
1/12	
   12/12	
  
Real-­‐Ome	
  data	
  feed	
  
8
Technology	
  
9
Machine	
  Learning	
  
	
  
Person	
  and	
  record	
  search	
  
10
•  Search	
  query	
  
Hint	
  suggesOons	
  system	
  
11
•  Hints	
  -­‐	
  sugges2ons	
  	
  to	
  aMach	
  a	
  record	
  
	
  
Record	
  linkage	
  
•  Record	
  linkage	
  –	
  finding	
  and	
  matching	
  records	
  in	
  mul2ple	
  data	
  sets	
  	
  
with	
  non-­‐unique	
  iden2fiers	
  
•  Goal:	
  bring	
  together	
  informa2on	
  about	
  the	
  same	
  person	
  
•  Some	
  	
  non-­‐unique	
  iden2fiers:	
  
–  Names:	
  first	
  name,	
  last	
  name	
  (John	
  Smith	
  –	
  300,000	
  records)	
  
–  Dates:	
  	
  date	
  of	
  birth,	
  date	
  of	
  death	
  	
  	
  	
  
–  Places:	
  place	
  of	
  birth,	
  residence,	
  place	
  of	
  death	
  	
  
–  Extra:	
  family	
  members,	
  life	
  events	
  
•  Records	
  o[en	
  incomplete	
  	
  
•  Records	
  contains	
  mistakes	
  
•  Exact	
  and	
  fuzzy	
  match	
  
12
Life	
  events	
  in	
  collecOons	
  
13
•  Life	
  events	
  
–  Birth:	
  2.59	
  bln	
  
–  Marriage:	
  	
  114	
  mln	
  
–  Census:	
  	
  2.74	
  bln	
  
–  Death:	
  	
  467	
  mln	
  
•  Total:	
  	
  5.91	
  bln	
  events	
  
Candidate	
  set	
  funnel:	
  exact	
  match	
  
14
John	
  Smith:	
  	
  300,000	
  	
  
John	
  Smith,	
  1870:	
  
2,200	
  
John	
  Smith,	
  1870,	
  	
  
Boston,	
  MA:	
  
	
  10	
  
Search:	
  	
  high	
  precision	
  
Candidate	
  set	
  funnel:	
  fuzzy	
  match	
  
15
John	
  Smith:	
  	
  380,000	
  	
  
John	
  Smith,	
  1870:	
  
97,000	
  
John	
  Smith,	
  1870,	
  	
  
Boston,	
  MA:	
  
	
  1400	
  
Explora2on:	
  large	
  recall	
  
Results	
  set	
  
16
Names editdistance
Extendeddates
Missing fields
Short names
initials
Exact match
Hints	
  suggesOon	
  system	
  
17
•  User	
  feedback	
  loop:	
  
– Accept	
  sugges2on	
  
– Reject	
  sugges2on	
  
•  Supervised	
  machine	
  learning	
  
•  Learn	
  similarity	
  measure	
  	
  
(how	
  to	
  combine	
  iden2fiers)	
  
•  Training	
  &	
  tes2ng	
  sets:	
  
– User	
  accepts,	
  rejects	
  
•  Features	
  (>	
  500):	
  
– First	
  last	
  name,	
  DOB,	
  POB,	
  DOD,	
  POD	
  	
  
– Parents,	
  children,	
  siblings,	
  spouses	
  
– Fuzzy	
  matches	
  
•  Similar	
  to	
  “learning	
  to	
  rank”	
  problem	
  
A	
  place	
  for	
  machine	
  learning	
  
18
ML suggest
Candidate	
  k-­‐set	
  
Person Record?	
  
Similarity	
  measure	
  learning	
  
19
Ancestry
collections
Feature generation
Member
trees
Person ID
ML Random
forest
Person ID
Label
Model
Index
Top-k records
candidate set
Feature generation Ranked
List
Training	
  
Scoring	
  
Hadoop	
  
Hive	
  
Record ID
Large	
  scale	
  machine	
  learning	
  
20
Random
forest (R)
Random
forest (R)
Random
forest (R)
Random
forest (R)
Model
Hadoop	
  streaming	
  
Hadoop	
  HDFS	
  
Data	
  
21
Big	
  Data	
  –	
  Big	
  Picture	
  
	
  
Family	
  tree	
  
22
•  User	
  generated	
  family	
  trees:	
  
– 	
  45	
  mln	
  family	
  trees	
  
–  	
  4.9	
  bln	
  	
  profiles	
  
Family	
  tree	
  as	
  a	
  graph 	
  (DAG)	
  
23
2020	
  nodes	
  
572	
  marriage	
  edges	
  
2910	
  family	
  edges	
  
	
  
Family	
  trees	
  
24
Family	
  trees	
  staOsOcs	
  
25
“Power	
  law”	
  distribu2on	
  
44	
  mln	
  trees	
  
History	
  from	
  family	
  trees	
  
26
500	
  nodes	
  
700	
  edges	
  
55	
  genera2ons	
  	
  
	
  
2me	
  
Historical	
  immigraOon	
  to	
  the	
  US	
  
•  ImmigraOon	
  is	
  the	
  movement	
  of	
  people	
  into	
  a	
  country	
  or	
  region	
  to	
  which	
  they	
  
are	
  not	
  na2ve	
  in	
  order	
  to	
  seMle	
  there	
  
•  Immigrants	
  are	
  those	
  who	
  were	
  born	
  outside	
  the	
  US	
  and	
  died	
  in	
  the	
  US	
  
•  Based	
  on	
  family	
  tree	
  profiles:	
  
–  Birth/death	
  dates	
  range	
  	
  1500-­‐1990	
  
–  Select	
  only	
  complete	
  profiles	
  with	
  FLN,	
  POB,	
  DOB,	
  POD,	
  DOD	
  
–  Perform	
  de-­‐duplica2on,	
  remove	
  same	
  ancestors	
  from	
  different	
  family	
  trees	
  
–  Select	
  only	
  those	
  with	
  POB	
  !=	
  US,	
  POD	
  ==	
  US	
  
•  15	
  mln	
  profiles	
  (	
  0.3	
  %	
  from	
  4.9	
  bln	
  profiles)	
  
27
ImmigraOon	
  to	
  the	
  USA	
  1500-­‐1990	
  
28
29
ImmigraOon	
  map	
  	
  
30
Ports	
  of	
  arrival	
  	
  (1800-­‐1980)	
  	
  
31
Data	
  Science 	
  	
  
•  Ancestry	
  is	
  building	
  data	
  science	
  team	
  
•  We	
  work	
  on	
  product	
  data	
  and	
  BI	
  
•  We	
  are	
  hiring	
  
•  Special	
  thanks	
  to	
  Mercator	
  Group	
  for	
  inforgraphics	
  	
  	
  
32

Más contenido relacionado

Similar a ancestry-bigdatasummit-april2013

Graphs are Feeding the World
Graphs are Feeding the WorldGraphs are Feeding the World
Graphs are Feeding the World
Tim Williamson
 
4 revelations genealogy study (83 slides) non government sites
4 revelations  genealogy study  (83 slides)  non government sites4 revelations  genealogy study  (83 slides)  non government sites
4 revelations genealogy study (83 slides) non government sites
jspeir
 
Genealogy101/Netting Your Ancestors
Genealogy101/Netting Your AncestorsGenealogy101/Netting Your Ancestors
Genealogy101/Netting Your Ancestors
Larry Naukam
 
Confessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data BrokerConfessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data Broker
metanautix
 
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Ellinor Michel
 
Searching your family history
Searching your family historySearching your family history
Searching your family history
medinalibrary
 

Similar a ancestry-bigdatasummit-april2013 (20)

Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Familial DNA Searching - Technology to Provide Investigative Leads
Familial DNA Searching - Technology to Provide Investigative LeadsFamilial DNA Searching - Technology to Provide Investigative Leads
Familial DNA Searching - Technology to Provide Investigative Leads
 
Using Digital Traces for User Profiling: the Uncertainty of Identity Toolset
Using Digital Traces for User Profiling: the Uncertainty of Identity ToolsetUsing Digital Traces for User Profiling: the Uncertainty of Identity Toolset
Using Digital Traces for User Profiling: the Uncertainty of Identity Toolset
 
Graphs are Feeding the World
Graphs are Feeding the WorldGraphs are Feeding the World
Graphs are Feeding the World
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Genealogy Using Today's Technology
Genealogy Using Today's TechnologyGenealogy Using Today's Technology
Genealogy Using Today's Technology
 
Data explosion
Data explosionData explosion
Data explosion
 
4 revelations genealogy study (83 slides) non government sites
4 revelations  genealogy study  (83 slides)  non government sites4 revelations  genealogy study  (83 slides)  non government sites
4 revelations genealogy study (83 slides) non government sites
 
Genealogy101/Netting Your Ancestors
Genealogy101/Netting Your AncestorsGenealogy101/Netting Your Ancestors
Genealogy101/Netting Your Ancestors
 
Genealogy Heritage Quest & Genealogy Connect
Genealogy Heritage Quest & Genealogy ConnectGenealogy Heritage Quest & Genealogy Connect
Genealogy Heritage Quest & Genealogy Connect
 
Confessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data BrokerConfessions (and Lessons) of a "Recovering" Data Broker
Confessions (and Lessons) of a "Recovering" Data Broker
 
Bauhina Genome slides for school visit
Bauhina Genome slides for school visitBauhina Genome slides for school visit
Bauhina Genome slides for school visit
 
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
 
Cyberage genealogy
Cyberage genealogyCyberage genealogy
Cyberage genealogy
 
Searching your family history
Searching your family historySearching your family history
Searching your family history
 
Beginning African American Family History Research
Beginning African American Family History ResearchBeginning African American Family History Research
Beginning African American Family History Research
 
Elections:Who Decides
Elections:Who DecidesElections:Who Decides
Elections:Who Decides
 
The Data Lifecycle (Harvard DataFest)
The Data Lifecycle (Harvard DataFest)The Data Lifecycle (Harvard DataFest)
The Data Lifecycle (Harvard DataFest)
 
Genealogy Crash Course Handout
Genealogy Crash Course HandoutGenealogy Crash Course Handout
Genealogy Crash Course Handout
 
Digging for Your Roots 2012: Heritage Quest Database
Digging for Your Roots 2012: Heritage Quest DatabaseDigging for Your Roots 2012: Heritage Quest Database
Digging for Your Roots 2012: Heritage Quest Database
 

Más de Leonid Zhukov

Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
Leonid Zhukov
 

Más de Leonid Zhukov (11)

Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

ancestry-bigdatasummit-april2013

  • 1. DNA Learning  from  Data:     Who  Do  You  Think  You  Are?     Sco$  Sorensen  and  Leonid  Zhukov  
  • 3. Discoveries   It’s  the  “aha”  moment  of  a  discovery  that   drives  our  business!   3
  • 4. World’s  largest  online  family  history  resource   Historical  Content   Over  30,000  historical  content  collec2ons     11  billion  records  and  images   Records  da2ng  back  to  16th  century   4
  • 5. World’s  largest  online  family  history  resource   User  Contributed  Content   45  million  family  trees   More  than  4  billion  profiles   200  million  stories  and  photos   5
  • 6. DNA  Data   DNA  Data     Over  120,000  DNA  samples   700,000  SNPs  for  each  sample   2,000,000  4th  cousin  matches           Spit  in  a  tube,  pay  $99,  learn  your  past  Derrick  Harris  -­‐  GigaOm    6 DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http:// en.wikipedia.org/wiki/Single- nucleiotide_polymorphism)  
  • 7. User  Behavior  Data   User  Behavior  Data   40  million  searches  /  day   10  million  people  added  to  trees  /  day   5  million    Hints  accepted  /  day   3.5  million    Records  aMached  /  day     7 1/12   12/12   1/12   12/12  
  • 10. Person  and  record  search   10 •  Search  query  
  • 11. Hint  suggesOons  system   11 •  Hints  -­‐  sugges2ons    to  aMach  a  record    
  • 12. Record  linkage   •  Record  linkage  –  finding  and  matching  records  in  mul2ple  data  sets     with  non-­‐unique  iden2fiers   •  Goal:  bring  together  informa2on  about  the  same  person   •  Some    non-­‐unique  iden2fiers:   –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)   –  Dates:    date  of  birth,  date  of  death         –  Places:  place  of  birth,  residence,  place  of  death     –  Extra:  family  members,  life  events   •  Records  o[en  incomplete     •  Records  contains  mistakes   •  Exact  and  fuzzy  match   12
  • 13. Life  events  in  collecOons   13 •  Life  events   –  Birth:  2.59  bln   –  Marriage:    114  mln   –  Census:    2.74  bln   –  Death:    467  mln   •  Total:    5.91  bln  events  
  • 14. Candidate  set  funnel:  exact  match   14 John  Smith:    300,000     John  Smith,  1870:   2,200   John  Smith,  1870,     Boston,  MA:    10   Search:    high  precision  
  • 15. Candidate  set  funnel:  fuzzy  match   15 John  Smith:    380,000     John  Smith,  1870:   97,000   John  Smith,  1870,     Boston,  MA:    1400   Explora2on:  large  recall  
  • 16. Results  set   16 Names editdistance Extendeddates Missing fields Short names initials Exact match
  • 17. Hints  suggesOon  system   17 •  User  feedback  loop:   – Accept  sugges2on   – Reject  sugges2on  
  • 18. •  Supervised  machine  learning   •  Learn  similarity  measure     (how  to  combine  iden2fiers)   •  Training  &  tes2ng  sets:   – User  accepts,  rejects   •  Features  (>  500):   – First  last  name,  DOB,  POB,  DOD,  POD     – Parents,  children,  siblings,  spouses   – Fuzzy  matches   •  Similar  to  “learning  to  rank”  problem   A  place  for  machine  learning   18 ML suggest Candidate  k-­‐set   Person Record?  
  • 19. Similarity  measure  learning   19 Ancestry collections Feature generation Member trees Person ID ML Random forest Person ID Label Model Index Top-k records candidate set Feature generation Ranked List Training   Scoring   Hadoop   Hive   Record ID
  • 20. Large  scale  machine  learning   20 Random forest (R) Random forest (R) Random forest (R) Random forest (R) Model Hadoop  streaming   Hadoop  HDFS  
  • 21. Data   21 Big  Data  –  Big  Picture    
  • 22. Family  tree   22 •  User  generated  family  trees:   –   45  mln  family  trees   –   4.9  bln    profiles  
  • 23. Family  tree  as  a  graph  (DAG)   23 2020  nodes   572  marriage  edges   2910  family  edges    
  • 25. Family  trees  staOsOcs   25 “Power  law”  distribu2on   44  mln  trees  
  • 26. History  from  family  trees   26 500  nodes   700  edges   55  genera2ons       2me  
  • 27. Historical  immigraOon  to  the  US   •  ImmigraOon  is  the  movement  of  people  into  a  country  or  region  to  which  they   are  not  na2ve  in  order  to  seMle  there   •  Immigrants  are  those  who  were  born  outside  the  US  and  died  in  the  US   •  Based  on  family  tree  profiles:   –  Birth/death  dates  range    1500-­‐1990   –  Select  only  complete  profiles  with  FLN,  POB,  DOB,  POD,  DOD   –  Perform  de-­‐duplica2on,  remove  same  ancestors  from  different  family  trees   –  Select  only  those  with  POB  !=  US,  POD  ==  US   •  15  mln  profiles  (  0.3  %  from  4.9  bln  profiles)   27
  • 28. ImmigraOon  to  the  USA  1500-­‐1990   28
  • 29. 29
  • 31. Ports  of  arrival    (1800-­‐1980)     31
  • 32. Data  Science     •  Ancestry  is  building  data  science  team   •  We  work  on  product  data  and  BI   •  We  are  hiring   •  Special  thanks  to  Mercator  Group  for  inforgraphics       32