1st LEARN Workshop. Embedding Research Data as part of the research cycle. 29 Jan 2016. Presentation by Peter Murray-Rust, ContentMine.org and University of Cambridge
The Culture of Research Data, by Peter Murray-Rust
1. The Culture of Research Data
Peter Murray-Rust,
ContentMine.org and UniversityOfCambridge
LEARN, London, UK 2016-01-29
The technology for Managing Research Data is already here…
…but we need a change of culture
Open Notebook Science
Publishers must be forced to serve us, not tyrranize us
2. Just read the big
letters
He’s got zillions of
slides…
4. The Right to Read is the Right to Mine
http://contentmine.org
5. Themes
• Highly domain-dependent (chem, cryst, phylo)
• Requires community and centrality
• University repositories are NOT the solution
• Openness makes it dramatically easier/better
• The publisher-academic complex is a major
problem.
• Infrastructure must be open and under our
control
6. WE pay for scholarly
publications that WE
can’t read
[1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
8. Some topics
• Github / software mgt informs data mgt
• Open notebook science
• Open source malaria + LabTrove
• Open phylogenetics
• Computational chemistry
• Crystallography
• Early career researchers can change the world, if
we let them.
• Are “publishers” tyrants or servants?
10. Why I reposit software in GitHub
I WANT TO!!!
BETTER
QUICKER
SECURE
AUDIT, BACKTRACKABLE
EASY
get collaborators
Most early career software creators have repos
How many people have USED Git?
14. Compile Fail
Inactive
Fail Tests
Pass Tests
Continuous Integration (Jenkins)
Every time I commit a change
50 projects are recompiled
and tested.
Impossible to do this manually!
16. Traditional Research and Publication
“Lab” work paper/th
esis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output “belongs”
to publisher
Every process is LOSSY
17. How NOT to publish data
HT Henry Rzepa
From Henry Rzepa:
this article http://doi.org/10.1126/science.aad6252
which provides a 22 Mbyte PDF of data (mostly bitmaps of NMR
spectra) and comes in at 404 pages long. [1]
But this one http://doi.org/10.1021/jacs.5b05902 [comp chem]
is 505 pages long (the current record holder?)
[1] DATA Behind paywall
18. 505 pages PDF, was a
machine-readable log file
that could and
should have been in a repo
Computational
Chemistry
19. MORE of the PDF
DATA Destruction
Blind humans and
Machines cannot
read this
22. JD Bernal’s 1965 vision
However large an array of facts, however rapidly they
accumulate, it is possible to keep them in order and to
extract from time to time digests containing the most
generally significant information, while indicating how to
find those items of specialized interest. To do so, however,
requires the will and the means. (Bernal, 1965)
Quoted by PMR in http://journals.iucr.org/d/issues/1998/06/01/ba0011/ba0011.pdf
24. https://en.wikipedia.org/wiki/Bermuda_Principles
• Automatic release of sequence assemblies larger than 1
kb (preferably within 24 hours).
• Immediate publication of finished annotated
sequences.
• Aim to make the entire sequence freely available in the
public domain for both research and development in
order to maximise benefits to society.
HUMAN GENOME project used
Open Notebooks
30. Mat Todd (Sydney) and MANY collaborators
http://opensourcemalaria.org/ (Chrome for interactivity)
Mat Todd, Univ Sydney, runs an Open Notebook community
to create new antimalarials.
36. data is associated with the proposed
scientific endeavour prior to or at the
point of creation rather than by
annotating the data with commentary
after the experiment has taken place
University of Southampton
38. Henry Rzepa does Open
Notebook Computational
Chemistry…
http://www.rzepa.net/blog/?p=14272
This is a current open notebook discussion,
http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552 (see comments,
currently 67).
… on his blog
42. Crystallography – a model for Data
Management
• Pro-active, friendly international community
• Committed active International Union(IUCr)
• Data publication valued (1960-present)
• Community develops semantics/dictionaries
• Committed volunteer software innovators
• Heavily Open approach
• Massive and valuable re-use of data
• Culture of validation/reproducibility
• Respect and credit for tool development
47. Where to reposit published
crystallography?
Proteins -> PDB, Open
BUT
Inorganics -> ICSD Closed
Organics -> Cambridge (CCDC) Closed
SO
The community has built a Crystallography Open
Database
48. Restrictions on Re-use of Crystallographic data
NOTE: The CCDC is based on data contributed by
scientists as part of publication and validation
Crystallographic data from
publications now belongs to CCDC
55. Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
existing literature
• Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
• Communication between
groups still prohibited
• Validate and develop the
inherited research story
• Critique your predecessors
• Role of research producer vs. research user
• Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
• Daily lectures on open
science culture & techniques
• First-hand application to own
research work
• Version control using GitHub
• Daily group supervision
73. The John S. and James L. Knight Foundation is an American private, non-profit foundation
dedicated to supporting "transformational ideas that promote quality journalism, advance
media innovation, engage communities and foster the arts."[2]
DAT supports public data
74. @Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink
75. I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion [1].
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my
university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
76. Some Children
of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton
• Rayna Stamboliyska & Pierre-Carl Langlais
• Jon Tennant
• Ross Mounce
• Jenny Molloy
• Erin McKiernan
• Jack Andraka
• Michelle Brook
• Heather Piwowar
• TheContentMine Team
• Rufus Pollock
• Jonathan Gray
• Sophie Kay
Jean-Claude Bradley [1] a chemist
developed Open notebook science;
making the entire primary record of a
research project publicly available
online as it is recorded. (WP)
J-C promoted these ideas with
UNDERGRADUATE scientists.
[1] Unfortunately J-C died in 2014;
we held a memorial meeting in
Cambridge
Sophie
Kay
80. This is a current open notebook discussion, http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552
(see comments, currently 67).
This is an earlier one, http://www.rzepa.net/blog/?p=14272 (with 86 comments) and also
incorporates Jsmol to visualise all the data
This one starts discussion as an open notebook http://www.ch.imperial.ac.uk/rzepa/blog/?p=1211
with the resulting formal publication at 10.1002/jcc.23985
This was the original open notebook post http://www.ch.imperial.ac.uk/rzepa/blog/?p=984 with
the resulting formal publication at 10.1038/NCHEM.596
This one incorporates open data into its citation list
http://www.ch.imperial.ac.uk/rzepa/blog/?p=15505 and is also an open notebook follow up to my
PhD thesis work, formally published in 1975 or so, thus operating in reverse to the above.
This shows some end outcomes: http://www.ch.imperial.ac.uk/rzepa/blog/?p=15313
This shows the principles: http://www.ch.imperial.ac.uk/rzepa/blog/?p=10972
This is an introductory tutorial http://www.ch.imperial.ac.uk/rzepa/blog/?p=14454
This is a critique http://www.ch.imperial.ac.uk/rzepa/blog/?p=13826
This is “convincing case” http://www.ch.imperial.ac.uk/rzepa/blog/?p=13248
This is about metadata http://www.ch.imperial.ac.uk/rzepa/blog/?p=12932
And its use http://www.ch.imperial.ac.uk/rzepa/blog/?p=12526
You have seen this data nightmare before: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12728
This is about ORCID http://www.ch.imperial.ac.uk/rzepa/blog/?p=12513
88. Ross Mounce (Bath), Panton Fellow
• Sharing research data:
http://www.slideshare.net/rossmounce
• How-to figures from PLOS/One [link]:
Ross shows how to bring figures to life:
• PLOSOne at http://bit.ly/PLOStrees
• PLOS at http://bit.ly/phylofigs (demo)
Editor's Notes
Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.