2. A talk of
two halves
1.) Outlining the extent of the problem
(lack of) sharing, standards, care (?)
2.) What I'm trying to do about it:
Digging data out of PDFs
Re-releasing as
3. Where's the data?
Just ~4% of published phylogenetic studies in 2010
publicly archived their supporting phylo data in
Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012
Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis
BMC Research Notes 10.1186/1756-0500-5-574
Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
4. Scientists cannot be relied upon to
share published data upon request
This has been known for a while now
e.g. (in Psychology) Wicherts et al 2006
But has been confirmed to be true for phylogenetics too:
Drew et al 2013 'Lost Branches in the Tree of Life'
report that just ~16% of researchers contacted supplied
the requested ('published') phylo data.
My own experience tallies with this – I soon stopped bothering to try and
ask people via email for a copy of their published data. It's a waste of time.
5. The (Single) Supplementary Data File
was a Y2K solution – a dump
Many legacy journal supplementary data systems
bury data and leave it there to decompose
Often not re-usable in form e.g. a lazy PDF
Sometimes 'typeset', corrupting the data
A jumble of words & data where the bit you
want is on page 92 (no programmatic access)
Research
BURIED and really not very discoverable
Data
Do reviewers even look at it? I think not tbh
6. I wasted too much of my PhD
trying to get usable data to re-analyze
This is what I felt like...
So I tried to do something
about it...
An open letter in support of
palaeontology data archiving
www.supportpalaeodatarchiving.co.uk
Which was picked-up by Nature News
Which, in turn got me in touch with:
7. Part 2
Since few will help you to re-use their data
You've got to dig it out
and
make it re-usable yourself
AND
re-release it openly
so no-one else wastes their time doing this
8. It's not just phylogenetics.
I learned from the Open Knowledge Conference (Berlin 2011)
that a lot different academic fields seem also struggle to
make re-usable published data available.
If it's a common, shared-problem...
why not seek a shared, cross-disciplinary solution?
9. AMI (Amanuensis)
Building upon tools first developed
in computational chemistry by the Murray-Rust lab
e.g.
ChemicalTagger → PhyloTagger (Entity tagging)
(Chem)PubCrawler → (Phylo)PubCrawler
(to getting 10,000+ PDFs to work on)
https://bitbucket.org/nickday/pub-crawler
http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger
Open Source
10. BBSRC grant approved
“PLUTo: Phyloinformatic Literature Unlocking Tools”
Software for making published phyloinformatic
data discoverable, open, and reusable
...I just need to get my PhD viva done & rubber-stamped
Instructions for getting the current working setup here:
(multiple repositories, dependencies & requirements!)
http://rossmounce.co.uk/2013/10/06/setting-up-ami2-on-windows/
11. PDF
HTML
AMI
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle Håstad
and Per Alström 4
2,3
Styles , superscripts
And diåcritics
preserved!
12. PDF
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
13. Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
15. Acknowledgements & Thanks
For the Panton Fellowship,
inspiration and support
To the organisers
of both the session:
Nico, Hilmar, Rutger
and the conference
as a whole!
For travel & accommodation
support, without which I couldn't
possibly attend TDWG
My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust