Building a Mutation History Tree

Combining SNPs, STRs, & Genealogy
to build a Surname Origins Tree
Dr Maurice Gleeson
11th Annual FTDNA Conference
15th Nov 2015
http://gleesondna.blogspot.co.uk/
YouTube – DNA and Family History Research

Google: YouTube Genetic Genealogy Ireland

A Combined Mutation / Family History Tree
… using DNA markers when people run out
… is it possible? Can you do it?

Topics for Discussion
• Building a tree with STRs
• Building a tree with SNPs
• Combining STRs & SNPs
• Dating branching points in the tree
• Combining STRs, SNPs & genealogy
• Opportunities for the years ahead

• Challenges for the years ahead

Modal Haplotype for Lineage II
• Lots of Parallel Mutations!
o Back Mutations remain hidden
• Is resolution enough to define the tree?
• Is this the “best fit” model?
570 (17-18)
CDYa (38>39) CDYa (38>39)
3
Branch numbers

Courtesy of Ralph Taylor
G64
G39
Fluxus cladogram
• It can help
- useful to check against
the Hand-Drawn Tree
• Shows “maximum
parsimony” version
• Cumbersome, fiddly,
easy to make mistakes,
difficult to interpret,
time-consuming
• Difficult to visualise as a
“Family Tree”
• Gives all markers equal
weight & ignores differing
mutation rates
www.isogg.org/wiki/Cladogram

G64
G39
Fluxus cladogram
• Several “Best Fit” models
- at least 8 BF models …
- Tree is not anchored
• No single “most likely” option
• So not enough information
at 37 markers to define
the branching pattern
• Parallel Mutations still
persist
- 390, 392, CDYa&b
• Back Mutations also possible
• Not clear which mutation
came before which
www.isogg.org/wiki/Cladogram

570 (17-18)
CDYa (38>39) CDYa (38>39)
Hand Drawn Tree
570 (17-18)
CDYa (38>39) CDYa (38>39)
Fluxus Tree v1
Branch numbers

Fluxus Cladogram
(111 markers)
G64
G39
G73
G64
G39
Fluxus Cladogram
(37 markers)
www.isogg.org/wiki/Cladogram Courtesy of Ralph Taylor

Essential technology for project success

Fluxus Cladogram
(111 markers)
G64
G39
G73
G64
G39
Fluxus Cladogram
(37 markers)
• No weighting … but mutation rates vary by a factor of 400
• James Irvine developed an algorithm for weighting markers
weighting = 99* (1 – mutation rate/0.04)2
https://en.wikipedia.org/wiki/List_of_Y-STR_markers

www.isogg.org/wiki/Cladogram Courtesy of Ralph Taylor
• Torso disappears
• No alternative pathways
= 1 single “Best Fit” model
Fluxus Cladogram
(111 markers)
G64
G39
G73
Fluxus Cladogram
(111 markers,
weighted)

Some markers behave unusually
• Marker 389: this is tested in 2 parts – mutation in Part 1 is also
counted in Part 2 => so just use Part 2 (389ii) … and we did!
– www.familytreedna.com/learn/y-dna-testing/y-str/different-str-markers-dys389i-dys398ii-
dys389-2-result-family-tree-dna-different-genographic-project/
• Multi-copy markers 464abcd
(but also 385, 459, YCAII, CDY, DYF395S1, 413)
– mutations in multi-copy markers may not be in the correct order
– Kittler test defines relative positions for 385 … not applicable here?
– www.familytreedna.com/learn/y-dna-testing/y-str/infinite-allele-palindromic-markers/
– http://www.isogg.org/wiki/DYS_464
• Multi-copy marker 464abcd: 2 types = c & g
– 464x test defines which type (but not position) … not accounted for!
– http://www.dna-fingerprint.com/static/PalindromicPres.pdf
• 464abcd, CDYa & b: fast-mutating palindromic markers
– http://www.isogg.org/wiki/RecLOH

Fluxus Cladogram
(111 markers,
weighted)
Fluxus Cladogram
(111 markers,
weighted, no CDY,464)

Which is more accurate?
with or without CDY & 464?
or some version in between?

How likely is it that 464 & CDY will screw things up?
• Gleeson surname origin = 1000 AD
 Surname has had 1000 years to mutate
= 33.3 generations (30 y/gen)
• How many mutations would you expect in 1000 years?
• CDY mutation rate = 0.03531 / gen
= 1.176 per member = c.16 mutations for all 14 branches of Lineage II
Observed rate is 4 for CDYa, and 3 for CDYb
=> 12/16 and 13/16 mutations respectively are hidden?
– So predictions based on CDY will be incorrect (12/16 + 13/16)/2 = 78%
of the time?
• 464 mutation rate = 0.00566 / gen
= 0.188 per member = 2.6 per 14 members (on each of 464abcd)
Observed rate is 0 for 464a & d, and 2 for 464b & c
=> 2.6/2.6 & 0.6/2.6 mutations respectively are hidden?
– So predictions based on 464 will be incorrect 62% of the time?

How likely is it that 464 & CDY will screw things up?
• Less of a problem in those branches related within the last
200-300 years?
– less time to mutate back
– lower chance of back mutations
– more useful for branch-defining
• More of a problem with those branches more distantly
related (600-1000 yrs)?
– more time to mutate back
– higher chance of back mutations
– less useful for branch-defining
 Choose v3a (i.e. use CDY & 464 data)
• Tree will be less than 100% correct
• Be especially wary of mutations in more distant reaches of
the tree

Caveats & Limitations
• Missing data
– Fluxus fills in the blanks - is its “best guess" valid?
– No adequate mutation rates for many markers
• The Tree is not yet “anchored”
– Moreso in the upper reaches of the tree (sub-branches seem stable)
– Several interpretations are still possible, even at 111 markers (v3a vs v4)
– Will this reduce as more people test? or upgrade?
– Are there hidden Back Mutations?
• Tree may be skewed by recent mutations (last 5-6 generations)
=> Triangulate on each MDKA
– Test at least 2 known distant cousins from each family branch in order to
characterise the haplotype of each MDKA
– Helps eliminate recent mutations which might cloud the interpretation
– Costly … $339 for a 111 marker test … x2 = $678
• Is there Convergence in the Tree? (e.g. 3/111)
www.isogg.org/wiki/Fluxus

• Brief overview of key concepts
• Challenges for the years ahead

http://dna-explained.com/2014/10/15/tenth-annual-family-tree-dna-conference-wrapup/
Deep Clade Panel 2.0
- Targeted subclade panels
- $119

Is fine-scale SNP testing
the best method of determining
branching patterns within a Genetic Family?
… how to do it as cheaply &
efficiently as possible?

Working with SNPs
– Opportunities & Challenges
• Declaring SNPs - false positives
• Missing SNPs - false negatives
• Constant change
– “Known, Novel, Shared & Private”
• No name, just a location
• SNP naming process unregulated
– Same SNP, different names
• Making results user-friendly
• Lots of help available
– independent verification & interpretation possible

Problems encountered with “declaring a genuine SNP”
Problem Reason(s) Implication
Detection No coverage False negative – SNP is present on Y but
remains undetected
Low no. of
Calls
Poor coverage False Negative – SNP present but fails to
meet threshold criteria
Recognition Detection Filter /
Threshold too strict?
False Negative - SNP is present in data but
missed by analysis - detectable by manual
analysis of possible SNPs on BAM file
Localisation Difficult location on Y
(centromere, palindrome,
in STR / repetitive region)
False Positive or Negative - SNP may be
genuine but its exact position cannot be
known for sure or may vary
Instability Unstable SNP – frequent
& unpredictable mutation
False Positive or Negative - SNP may or may
not be genuine
InDels Not SNPs, but rather a
deletion (usually)
False Positive or Negative - may or may not
be genuine
So is the SNP really present?
… or absent?
Just because it is detected, doesn’t mean it is there …
Just because it’s not detected, doesn’t mean it isn’t there

SNPs
Known SNPs
(already
discovered)
New SNPs
(never discovered
before)
Shared
(with someone
else)
Not shared
(Unique / Private)
“Known, Novel, Shared & Private”
– the fluid categorisation of SNPs

Shared
Novel
Variants
No names …
just positions

Private SNPs
(unique)
No names …
just positions

FTDNA
Results (FT)
Project
Admin (LL)
Haplogroup
Admins*
Alex (Big Tree)
Williamson
Nigel (Munster)
McCarthy
YFULL (YF)
11
2
3
2
1
4
Shared
Novel
Variants in
Z16437
subgroup
* Neal Downing, John Murphy, James Kane & Z255 Yahoo group

Gleeson Family Tree based on newly discovered SNP markers
Lisa Little, project member

Z255 Haplogroup Project Colour Coded Spreadsheet
(John Murphy)
Gleeson-specific SNP markers
https://groups.yahoo.com/neo/groups/R1b-Z255-Project

James Kane’s tree
www.it2kane.org/matrix/R.html
https://www.familytreedna.com/groups/r-l21-south-irish/about/background

http://www.ytree.net
Alex Williamson’s “Big Tree”

… aka BY2853
Jan 2015
Apr 2015
Jun 2015
Oct 2015
www.ytree.net/DisplayTree.php?blockID=319&star=false
Clicking on a marker or name
brings up further analysis

www.ytree.net/MutMatrix.php
Grey = no coverage
Pink = marginal coverage
My simplistic interpretation
+ Definite
* Probable
** Possible
*** Unlikely
The Big Tree: R-A5629 Mutation Matrix of Shared SNPs

Currently Unique SNPs … 3 (1), 3 (2), 13 (5) = 19 (8)
http://www.ytree.net/SNPinfoForPerson.php?personID=1288Alex Williamson’s “Big Tree”

YFULL Novel SNPs
Alex Williamson’s “Big Tree”
www.yfull.com

• Are they really SNPs?
- different thresholds & filters
• SNPs trapped in Private Collections
- Private SNPs will be liberated as more people test
& SNPs become “not private” anymore – move up into the
shared area of the tree … but they will run out! When?
• No names, just locations
- will need to be translated into SNP names in time
=> consult Ybrowse, other utilities??
Inconsistency in “declaring
a genuine SNP”

Different strokes for different folks
Who is right?
… or more accurately …
who has estimated correctly?
End Result
SNP = definite, probable, possible, or unlikely
… subject to change ... & Sanger Sequencing?

Despite NGS, Sanger Sequencing
will still be required
• Chip-based SNP testing will still be
needed to confirm or refute
discoveries made by NGS
• Multiple Deep Clade Panels will
need to be created
… for subclades, surnames, & genetic clusters
Some Bold Predictions …

• SNP results consistent?
• Need to tidy it up
456 15-16

• SNPs are further up the tree than STRs
• Tell us nothing about branches on left
• Only use “definite SNPs” (not probable/possible)
• Private SNPs are still trapped in Private Collections
Mutation sequence?
BY2853 > A5629 > 456 …
> G68 (Glisson, Branch 14)
> A5628
> Y16880 (Branch 2,7,6)
> A660 (Branch 9)

http://freepages.genealogy.rootsweb.ancestry.com/~skibbgirl/McCarthyDNAProject/
G54 G39
G51
G66 G22 G42 G55 G57 G21
Nigel McCarthy

G54 G39
G51
G73
G66 G22 G42 G55 G57 G21
Nigel McCarthy’s Z255 Group E
http://freepages.genealogy.rootsweb.ancestry.com/~skibbgirl/McCarthyDNAProject/
G68
No BY2852 block
Extra marker
Private SNPsPrivate SNPsPrivate SNPs
2 pink SNPs omitted
Differing
Modal
Haplotype
<67 markers excluded

Iain McDonald, The 2015 report to the U106 group (Sep 2015)
www.jb.man.ac.uk/~mcdonald/genetics/u106-geography-2015-revised.pdf

www.familytreedna.com/groups/tmrca-case-studies/about
Up till now, we know there are branches that come off the Modal
But which came first?
Can we place them in the correct order?

G57, 60393
G21, N74958
G55, 338070
G39, N101540
G51, 244645
• YFULL analysis offers TMRCA estimates for SNPs
… and includes Calculation Formula
-60% to +50%

0
3
10
Probability
Markers
tested GD 5%
MLE
50% 95% Range (%)
12 1 3 17 >24 -82% to ???
25 1 1 7 20 -85% to + 186%
37 1 0 3 10 -100% to + 233%
67 2 1 4 11 -75% to +175%
111 6 4 8 15 -50% to +88%
495 24 6 9 12 -33% to +33%
G21 G57
MLE, Maximum Likelihood Estimate(?)
• Ranges are wide & skewed toward distant generations
• 111 markers gives the “best estimate”
with smallest upper ranges
but still almost double the mid-value

• Individually extracted 5%, 50% & 95% estimates (90% Confidence Interval)
• Markers tested: White = 111, Yellow = 67, Cream = 37, Blue = 25
• 50% probability estimate ranges from 1 to >24 generations
• Use triangulation to get better overall estimate?
TMRCA Triangulation

750
325
50
3
3
6,4,6
3,3
8,3,11
24,22,21,21*3,
>24,18,15,20,22
9
12
11*3,1522,14,13*3,1
6
11
2
25
5.3 9.5
13
21
3
9.5
?
14,14,11,11,22,22,17,17,18,18,15,
15,(20,13,14,14)*3,18,10,10,10
14,14,11,11,22,22,17,17,18,18,15,
15,(20,13,14,14)*3,18,10,10,1014.3
TMRCA Triangulation

Will additional STR markers help refine TMRCA estimates?
• But … 5% differ? ... some are missing? ... not detected by NGS?
• 35 mutations between G21 & G55

http://dna-project.clan-donald-usa.org/tmrca.htm

0
3
10
Probability
Markers
tested GD 5%
MLE
50% 95% Range (%)
12 1 3 17 >24 -82% to ???
25 1 1 7 20 -85% to + 186%
37 1 0 3 10 -100% to + 233%
67 2 1 4 11 -75% to +175%
111 6 4 8 15 -50% to +88%
495 24 6 9 12 -33% to +33%
Probability
Markers
tested GD 5%
MLE
50% 95% Range (%)
12 1 3 17 >24 -82% to ???
25 1 1 7 20 -85% to + 186%
37 1 0 3 10 -100% to + 233%
67 2 1 4 11 -75% to +175%
111 6 4 8 15 -50% to +88%
495 24 6 9 12 -33% to +33%
G21 G57

750
325
50
3
3
6,4,6
3,3
8,3,11
24,22,21,21*3,
>24,18,15,20,22
9
12
11*3,1522,14,13*3,1
6
11
2
25
5.3 9.5
13
21
3
9.5
?14,16,18,18
13,10
16.5
14.3
11.5
7

750
325
50
3
3
6,4,6
3,3
8,3,11
24,22,21,21*3,
>24,18,15,20,22
9
12
11*3,1522,14,13*3,1
6
11
2
25
5.3 9.5
13
21
3
9.5
?14,16,18,18
13,10
16.5
14.3
11.5
???? ????
7

750
325
50
3
3
6,4,6
3,3
8,3,11
24,22,21,21*3,
>24,18,15,20,22
9
12
11*3,1522,14,13*3,1
6
11
2
25
5.3 9.5
13
21
3
9.5
?14,16,18,18
13,10
16.5
14.3
11.5
???? ????
7
MDKA
Profile

MDKA Profiles
http://gleesondna.blogspot.com

A Combined Mutation / Family History Tree
… using DNA markers when people run out
… is it possible?

• Brief overview of key concepts
• Opportunities for the years ahead

Lessons Learned & Future Opportunities
• Transcription errors are easy => triple-check, automate
• Re STRs
– Lots of Parallel Mutations … where are the Back Mutations?
– 111 markers best define the branching pattern
– Placement of CDY & 464 is likely to be incorrect (esp. in
upstream generations)
– Most project members have not tested other male cousins
to triangulate on their MDKA
– Convergence may be a problem (even at 3/111)
– We need more people to test
– We need more people to upgrade to 111 markers
– YFULL analysis liberates 495 STRs

• Re SNPs
– Difficult to declare a genuine SNP
– Different SNPs from different lips
– Definite, probable, possible, unlikely
– Likely to be lots of false negatives (& false positives)
– No names (locations too long)
– Naming is unregulated
– Many SNPs trapped in Private Collections
– Current NGS is discovery, not confirmatory =>
further testing (with other NGS?) needed to confirm

• Re combining STRs & SNPs
– Adding SNPs changed the upper reaches of the tree
– SNPs are still located relatively upstream - STRs offer better
definition downstream
– Start with the Modal of your Haplogoup subgroup
• Re TMRCA estimates
– SNP-based estimates work best for distant branching
points (haplogroup projects)
– STR-based estimates have wide ranges, and skewed
toward distant generations
– Even at 111, upper range ~ double the mid-value
– Even 495 markers has a wide range (+/- 33%)

• Re combining STRs, SNPs & genealogy
– We need to overlay documentary data on DNA
– Some pedigrees not supplied / incomplete
– Need to add MPRs to all (MDKA Profile)
– Need to take a One Name Study approach?
• Collate all Gleeson data worldwide
• Establish a relational database (Access?)
• Assign data to different family branches
• This early draft MHT serves as a useful basis
– Will evolve over time as more people test & upgrade
– Will faciltate collaboration between project members
– Will help attract new project members

Vision 2020
Where will we be in 5 years time?
Here are some bold predictions …

What would happen if …
• Everyone upgraded to 111 markers?
– Better definition of branching pattern
– More precise TMRCA estimates (with narrower range)
• Everyone did the Big Y?
– SNPs only good for upstream branches? (<1500 AD)
– We will run out of Private SNPs
• Everyone tested on a Surname Specific Panel?
– Would elucidate branching pattern up to 1500 AD? Later?
• Everyone did Whole Genome Sequencing?
– No better than Big Y? Better coverage? Better read length?
– What will happen to Probable / Possible / Unlikely SNPs?

• (To help stimulate discussion & to learn)
• What is most useful for Surname Projects –
more SNPs or more STRs?
– More STRs … we will run out of Private SNPs
– 111 vs 50,000
– 500 vs 40?
• In 2020, FTDNA will offer 500 STRs for $129

• How do we best generate a Surname-Specific
SNP Panel?
– Q: How many discovery Big Y tests are needed to
liberate sufficient Private SNPs to adequately
define the Surname Panel?
– A: 5-10 Big Y tests per genetic cluster
– We need another few people to Big Y test, then
generate the Surname Panel for Lineage II
• In 2020, FTDNA will offer over 4000
Surname Specific SNP Panels
for $100 each

Generate MHTree
More tools
Lineage I
Lineage II
Lineage III
Lineage IV
Lineage II Mutation History Tree

Acknowledgements
• Bennett Greenspan
• Max Blankfield
• Janine Cloud
• FTDNA team
• Judy Claassen
• Lisa Little
• James Irvine
• Ralph Taylor
• John Cleary
• Haplogroup Admins
• John Murphy
• Neal Downing
• James Kane
• Alex Williamson
• Nigel McCarthy
• Dennis Wright
• Alasdair MacDonald
• YFULL team
The Genetic Genealogy Community

Building a Mutation History Tree

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (19)

Similar a Building a Mutation History Tree

Similar a Building a Mutation History Tree (20)

Último

Último (20)

Building a Mutation History Tree

Notas del editor