Developer Data Modeling Mistakes: From Postgres to NoSQL
Natasha de Vere - Plants Plenary
1. Barcode Wales / Codbar Cymru: A complete DNA
Barcode Dataset of a Nation’s Native Flowering
Plants: Creation, Applications and Public
Engagement
Natasha de Vere, Tim Rich, Col Ford, Sarah Trinder, Charlie Long, Chris
Moore, Danielle Satterthwaite, Helena Davies, Joe Moughan, Addie
Griffith, Laura Jones, Joel Allainguillaume, Mike Wilkinson, Tatiana
Tatarinova, Hannah Garbett, Les Baillie, Jenny Hawkins
2. Barcode Wales: Cod Bar Cymru
• DNA barcode the
native flowering
plants and
conifers of Wales
• Develop
applications that
utilise this
research
platform
3. Sample collection
• 1143 native flowering
plants and conifers
• 455 genera, 95 families,
34 orders
• 4272 individuals sampled,
3637 herbarium, 635
freshly collected
• All specimens verified by
taxonomic expert
• Herbarium vouchers and
full collection details for
all samples
4. DNA extraction, amplification and
sequencing
• Qiagen kits, modified
for herbarium material
• rbcL: 5 primer
combinations
• matK 29 primer
combinations
• Macrogen Europe for
Sanger sequencing
5. Sequence editing and multiple
alignment
• Sequencher 4.9. contig
assembly and manual
editing
• rbcL alignment:
MUSCLE
• matK alignment:
Transalign and
Geneious Pro 5.4.4
• Sequences BOLD and
Genbank
7. Analysis
• Interspecific and
intraspecific divergence
• Species discrimination:
BLASTn
Barcode Gap: min.
interspecific p-distance > • Discrimination at
than max. intraspecific different spatial
(CBOL Plant Working scales, using species
Group 2009) distribution records
• Test discrimination using • Scripts written in
GenBank data Python
8. Recoverability
rbcL matK rbcL & matK
No. of spp. sequenced 1117 (98%) 1031 (90%) 1025 (90%)
No. of spp. with > 1 individual 1041 (91%) 814 (71%) 808 (71%)
sequenced
Mean no. of individuals per 3 2 2
spp.
Mode of individuals per spp. 3 3 3
Range of individuals per spp. 1-9 1-8 1-8
Total no. of individuals 3304 2419 2349
sequenced
In total 5,723 barcode sequences obtained for the 1143 species
9. Fresh vs Herbarium
matK:
Fresh = 5 primer
combinations
Herbarium = 29
primer
combinations
10. Effect of herbarium specimen age
Spearman Rank
Correlation:
rbcL rho =
0.993***
matK rho =
0.986***
11. Intra and interspecific divergence
rbcL matK
No. of spp. showing intraspecific variation 66/1041 136/814
(6.3%) (16.7%)
Mean intraspecific divergence: all individuals (SD) 0.0001 0.0003
(0.0005) (0.0009)
Mean intraspecific divergence: theta(SD) 0.0001 0.0004
(0.0006) (0.0011)
Mean coalescent depth (max. intraspecific) (SD) 0.0001 0.0004
(0.0006) (0.0012)
Mean interspecific divergence (SD) 0.0063 0.0174
(0.0069) (0.0231)
Using uncorrected p-distances
14. Testing discrimination
rbcL GenBank sequences Species % Genus % Family % Failed %
Sequences correctly identified 57 93 99 1
(n = 1346)
Taxa correctly identified 58 94 100 0
(n = 592)
matK GenBank sequences Species % Genus % Family % Failed %
Sequences correctly identified 67 95 99 1
(n = 1380)
Taxa correctly identified 72 96 99 1
(n = 533)
GenBank sequences queried against Barcode Wales database using BLASTn
15. rbcL discrimination
Scale n Mean
discrimination %
(SD)
10x10 km 253 72 (4)
2x2 km 1116 90 (9)
Species lists generated for each square,
discrimination assessed by presence of a
barcode gap
16. matK discrimination
Scale n Mean
discrimination %
(SD)
10x10 km 253 81 (3)
2x2 km 1116 93 (7)
Species lists generated for each square,
discrimination assessed by presence of a
barcode gap
17. rbcL & matK discrimination
Scale n Mean
discrimination %
(SD)
10x10 km 253 82 (3)
2x2 km 1116 93 (6)
Species lists generated for each square,
discrimination assessed by presence of a
barcode gap
18. DNA barcoding and drug discovery
• Collect wildflower honey from throughout UK
• Test antibacterial properties of honey against MRSA
and Clostridium difficile
• DNA barcode honey
• Identify plant derived
phytochemicals
• New drug discovery
routes
19. Drug discovery – prelim results
• 150 honey samples
• Agar diffusion assay,
plates with MRSA,
activity present in
some samples
• Successfully amplified
rbcL from honey
Next:
• Identify cause of antimicrobial activity
• Next gen sequencing of honey samples
20. DNA barcoding and phylogenetics
Good match
ML tree for rbcL, with APGIII.
RAxML (GTR+CAT) 56% of species form
1000 bootstraps, on the monophyletic groups
CIPRES supercomputer 44% with bootstrap support
cluster >70%
21. DNA barcoding and phylogenetic
ecology
ML tree for rbcL, threatened species
traced using Mesquite
25. Thank you!
• Funding from Welsh
Government, National
Botanic Garden of Wales,
National Museum Wales,
Countryside Council for
Wales, Spirent
Communications plc
• Sponsorship from the
people of Wales
• www.gardenofwales.org.
uk
• Science at the Garden of
Wales on facebook