Biodiversity Informatics of the Cyperaceae: Where we stand and where we’re heading
1. Biodiversity
Informa1cs
of
the
Cyperaceae:
Where
we
stand
and
where
we’re
heading
Andrew
Hipp,
Marlene
Hahn,
Ed
Baker,
Vince
Smith
and
The
Cariceae
Working
Group
2. A
set
of
tools
for
Cariceae
informa1cs
Andrew
Hipp,
Marlene
Hahn,
Ed
Baker,
Vince
Smith
and
The
Cariceae
Working
Group
3.
4.
5.
6.
7. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data
8. What
tools
do
we
need?
• An
easily-‐updated
hierarchical
checklist
to
visualize
sampling
progress
across
labs,
extrac1ons,
sequences;
•
A
specimen-‐level
phylogene6cs
pipeline
that
we
can
use
to
harvest
exis1ng
data
from
NCBI
as
well
as
generate
ongoing
phylogene1c
snapshots;
•
A
way
to
automate
mapping
from
specimen
data,
so
that
we
can
visualize
(and
assess
our
visualiza1ons
of)
species
distribu1ons
in
geographic
and
ecological
space;
and
•
A
pla8orm
for
collabora6on
–
a
virtual
research
environment
to
bring
together
researchers
worldwide
10. In
2011
• A
flat
checklist
exported
from
WCM
• A
set
of
spreadsheets
from
collabora1ng
labs
inventorying
their
DNA
and
sequence
collec1ons
• A
vague
idea
of
what
trips
are
needed
Today
• A
hierarchical
checklist
by
subgenus,
sec1on
• A
synthesis
of
what
materials
and
sequences
collaborators
have
on
hand,
and
what
taxa
are
unsampled
• A
concrete
sampling
plan
with
trips
and
taxa
iden1fied*
*
Okay,
we’re
working
on
this
one!
11. Taxonomy
Specimen(s)
DNA
extrac6on(s)
Sequence(s)
Trace
file(s)
/
con6g(s)
We
are
aiming
toward
a
database
in
which
the
taxonomy,
specimen
data,
DNA
extrac1ons,
raw
sequencing
data
and
DNA
matrices
all
live
together
and
can
be
curated
and
worked
on
jointly
by
the
community.
16. A
centralized
workflow
• Spreadsheets
imported
into
a
single
Excel
file
• Names
cleaned
(variable)
• DNA
data
summary
formula
created
for
each
spreadsheet
(ca.
5
mins
per
user)
• Names
matched
to
our
Scratchpads
checklist
• All
files
exported
to
CSV
• Sample
sheets
and
SP
checklist
imported
to
R
• DNA
records
added
to
checklist
as
nodes
that
are
children
to
their
taxa.
• Hierarchical
checklist
exported
in
text
format,
with
unsampled
taxa
marked
for
searching
17. ß
Sec1on
name
ß
Sampled
taxon
with
its
DNA
vouchers
and
summaries
ß
Unsampled
taxon
18. Because
Kew
has
coded
geography
using
TDWG
standards,
we
can
export
geographic
hit-‐lists
23. NCBI
is
a
morass
of
data.
Geneious
• Query
nucleo1de
database
(NCBI)
for
Organism
contains:
“Carex”,
“Uncinia”,
“Schoenoxiphium”,
“Kobresia”,
“Vesicarex”,
or
“Cymophyllus”
• Export
as
• FASTA
• TAB-‐Delim
• XML
• Only
export
that
maintains
all
informa1on
in
NCBI.
• Necessary
to
obtain
data
that
can
be
used
to
connect
sequence
to
a
specimen.
26. A
workflow
for
specimen-‐level
mul1gene
datasets
from
NCBI
• Download
from
NCBI
[we
used
Geneious,
but
any
bulk
download
is
fine]
• Parse
out
collector
name,
collector
number,
isolate
number,
geography
• Manually
clean
collector
names
(3
days
for
>6500
records)
• Iden1fy
specimens
by
unique
combina1ons
of
collector
name,
collector
number,
isolate
• Toss
out
“accessions”
having
more
than
one
scien1fic
name
• Clean
gene
region
names
so
that
names
are
not
duplicated
(30
minutes
for
>6500
records)
• Export
datasets
to
MUSCLE
and
align;
export
log
file
• Manually
check
alignments
and
code
logfile
(D,
RC;
variable)
• Rerun
MUSCLE
and
export
RAxML
batchfile
• Analyze
• Screen
for
non-‐monophyly;
concatenate
and
con1nue!
28. Tab-‐delimited
metadata
from
NCBI
/
Geneious
is
handy,
but
it
lacks
almost
all
the
informa1on
that
could
be
used
as
voucher
IDs.
No
way
to
link
sequences
to
specimens!
However,
some
NCBI
records
do
contain
this
data.
How
do
we
access
it?
29. NCBI
Specimen
Record
The FEATURES/Qualifier1 section has
information that allows us to connect sequences to
a specific specimen.
(for example,
some records contain the qualifier specimen_voucher)
To get this additional information, we need to
export the data as an XML file, and parse the data
out into a useable tab delimited file.
Other good information to export
30. We
parsed
the
NCBI
XML
and
embedded
fields
within
<qualifiers1>
to
get
voucher,
DNA
isolate,
popula1on
variants,
country,
geographic
coordinates,
collec1on
date,
collector
name,
and
other
fields…
many
informa1ve
about
the
iden1ty
of
the
plants
sequenced.
To
make
clean
voucher
IDs,
we
used
last
name,
collec1on
number,
and
DNA
isolate
(used
by
some
labs).
For
this
analysis,
sequences
that
could
not
be
assigned
to
a
single-‐species
voucher
were
discarded.
32. ITS,
ETS,
matK,
trnL-‐trnF
3,370
DNA
sequences
2,196
individuals
723
spp
397
spp
>
1
individual
31.7%
of
those
spp
monophyle1c
33.
34. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data
35. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data
36. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data
37. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data
40. Mapping
GBIF
Data
• Generate
species
list
to
extract
GBIF
data.
(i.e.
accepted
names
in
World
Checklist)
• Download
GBIF
data
using
a
wrapper
to
dismo::gbif
(R),
allowing
us
to
capture
and
log
errors
and
missing
data.
41. Clean
up
downloaded
GBIF
data
• Flag
duplicate
specimen
datasets
– Flags
specimens
within
the
same
species
that
have
iden1cal
coordinates.
– This
should
be
expanded
to
include
specimens
that
have
iden1cal
locality
descrip1ons.
• Flag
imprecise
loca1on
data
– Flags
specimens
in
which
the
la1tude
is
precise
only
to
the
degree
or
to
a
tenth
of
a
degree.
– This
threshold
could
be
adjusted,
but
is
tailored
to
the
Worldclim
database
we
are
using
(2.5
arc
minutes).
• Create
a
delimited
file
for
each
species
containing
specimen
data
with
flagged
columns
(reference
file
of
which
data
are
u1lized
excluded
in
mapping
step).
This
file
becomes
part
of
our
analysis
archive,
so
that
we
can
always
go
back
and
edit
or
evaluate
old
data.
43. Mapping
"cleaned-‐up"
dataset
(Map_gbif_jpeg_imprecise)
• Maps
need
to
be
manually
checked
for
accuracy
and
completeness
• We
export
the
maps
as
images
to
a
Scratchpads
media
gallery
that
can
be
queried
or
filtered
by
taxon
• Map
reviewing
is
conducted
in
a
dedicated
SP2
forum
44.
45. There
are
bugs
to
work
out,
though
Some
taxa
are
missing
data.
Example:
Carex
humilis
• Map
of
2331
specimen
records
from
R
code
download
• Website
individual
species
download
– Filtered
for
specimens
with
coordinate
data
(=
7209
records)
– Missing
records
include
some
from
France,
Japan,
&
South
Korea
46. Some
maps
will
need
adjustments:
in
next
itera1ons,
it
should
be
possible
to
automate
some
of
this
Carex
alata
specimen
is
missing
a
“-‐”
in
longitude
column
Carex
lanceolata
has
specimens
where
the
la1tude
and
longitude
are
switched.
47. In
the
end,
integra1ng
clean
coordinate
data
with
WorldClim
clima1c
data
allows
us
to
correlate
clima1c
niche
evolu1on
with
morphological
and
lineage
diversifica1on*.
*
See
Thursday
talk
for
exci1ng
findings
in
subgenus
Vignea!
49. Iden1fy
gaps
in
our
knowledge
and
sampling
Formulate
sampling
plan
New
collec1ons
DNA
sequences
DNA
matrices
Mul1ple
alignments
Species
tree
es1mates
Revised
classifica1on
A
central
database
for
specimen-‐level
data