1. The Right to Read is the Right to Mine
http://contentmine.org
2. Background
• Contentmine aims to make large areas of scientific fact OPEN (100
million facts/year)
• We’re working with WellcomeTrust, Europe PubMedCentral, etc.
• A politically “hot” area (Hargreaves legislation, EU activity)
• 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough
consensus” on what was needed.
• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben
Goldacre)
• 2-day workshop at Edinburgh on Systematic Reviews of Animal Test
publications
• In the last few months we’ve prototyped a unique Open starting
point, continuously released.
• Now actively building communities (plants, clinical, animals,
psychology, crystallography, HEPhys)
3. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
4. catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
8. ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
10. Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
• JISC , London
Upcoming
• LIBER
• Cochrane
• BL
• Wellcome Trust (April)
• WHO
Collaborators
• Wikimedia/Wikidata
• Mozilla
• Open Knowledge
• LIBER (European Research Libraries)
• British Library
• Wellcome Trust
• EBI (Eur. Bioinf. Inst.)
• JISC
• Open Access Button
• SPARC
• Creative Commons
• CORE
• EuropePubmedCentral
11. Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
12. Linked Open Data – the world’s knowledge
very little physical science and THESES??
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
17. Questions we can tackle
• How to we find (mentions of) clinical/animal trials?
• Is a document a trial?
• What is the subject of the trial?
• What is the methodology used?
• Does the design and practice conform to
CONSORT/ARRIVE?
• What are the outcomes?
• Can we extract specific re-usable information?
• Who are involved? (researchers, sponsors, patients?)
• Has a proposed trial been completed and reported?
18. Text-based plugins
• Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
• Templates and regexes (regular expressions).
22. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
23.
24. Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
25. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
28. What we can do
• Recognize and promote autonomous sub-
communities
• Engage Early Career Researchers, including
undergraduates and let THEM BUILD the
systems.
• COMMUNALLY build tools for data checking
• Insist on semantic data input, even if it costs
submissions