SlideShare a Scribd company logo
1 of 35
ContentMine and WikiData
Peter Murray-Rust
Wikimania,
London UK 2014-08-08
ContentMine: We use machines to
liberate 100 million facts /yr from the
scientific literature and make them free
for everyone (WikiData)
With Wikipedia we are ALL scientists
ContentMine is a social machine
WikiData is the future of science data
http://en.wikipedia.org/wiki/Tim_Berners-Lee
Everything in this presentation is ODOSOS
(Open Data, Open Standards, Open Source)
CC0, CC-BY, W3C etc., Apache2, etc. *
http://contentmine.org
http://bitbucket.org/petermr
http://wwmm.ch.cam.ac.uk
*Sorry about the Powerpoint (Power corrupts, Powerpoint corrupts absolutely (Tufte))
A promise: I (Petermr) will never sell out to non-transparent organizations.
petermr: I believe in Wikipedia
• 2006 http://en.wikipedia.org/wiki/User:Petermr
• 2006 started Open Data (term unknown then!)
• 2009: “the bit of Wikipedia that I wrote is correct” [challenging the
idea of “WP is junk”]
• 2009: “Wikipedia is the digital library of this century”
• 2012: I alert WP that Springer has copyrighted > 1000 of our
images [Springergate]
• 2014: “For facts in maths, physical and biological sciences I trust
Wikipedia.” (Wikimania2014)
A meritocratic
critical
volunteer
community
Volunteer community in chemistry: Open Data/Source/Standards
Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” [*]…
• … $10,000,000,000 from academic libraries …
• … to “publishers” who forbid access to 99.9% of
citizens of the world …
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper
4 Billion USD on human genome
yielded 800 Billion USD and 4 M job-years
Gloom Warning
…three problems—flawed design, non-
publication, and poor reporting—together
meant >85% of research funds were wasted, a
global total loss >100 billion USD per year.
[Lancet 2009]
[Even more] waste clearly occurs after
publication: from poor access, poor
dissemination, and poor uptake of the findings
of research. [PLOS Medicine 2014-05-27]
Bad publication wastes science
Publishers’ PDFs destroy science
PDFs do not contain words
or subscripts!
PDFs do not contain tables
and do not have columns
SVG is turned into JPEG because it’s easier to process
Elsevier wants to control Open Data
[asked by Michelle Brook]
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
Licences destroy Content Mining
CLOSED ACCESS MEANS PEOPLE DIE
CLOSED DATA MEANS PEOPLE DIE
Happiness Restored
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
The Right to Read is the Right to Mine
http://contentmine.org
• Science can be read and understood by
human-machine Amanuensis-symbionts.
• Amanuenses are based on Wikipedia,
databases and software (e.g. ContentMine’s
AMI)
• The results are fed back into WP and WikiData
http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby
• Crawl scientific literature
(Open Bibliography)
• Scrape each scientific article
(ContentMine-quickscrape)
• Extract the facts (ContentMine-AMI)
• Index (Wikipedia)
• Republish (WikiData)
Machine Extraction of scientific facts
Human-machine symbionts can read science!
WP_Lion
WP_Aspergillus_oryzae
WP_Soybean
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Parsing chemical sentences
A FACT, uncopyrightable, and representable by triples
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Gaussian
Filter
Automatic
extraction
Takes < 1 second
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 –
“Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),(
(((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n
215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),
n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n
102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((
n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1
60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139
,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222)))
)))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(
n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((
n53,n131),n159)))))));
http://en.wikipedia.org/wiki/Digital_image_processing
http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
Open notebook science is the practice of
making the entire primary record of a research
project publicly available online as it is
recorded. (WP)
Jean-Claude Bradley was a chemist who
actively promoted Open Science in
chemistry,… He coined the term Open
Notebook Science. … A memorial
symposium was held July 14, 2014 at
Cambridge University, UK.[9]
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
My Wikiwishes
• An Open Bibliography of science, updated
daily
• An interface for ContentMine to feed new
facts into WikiData
• Domain-specific enthusiasts to create and run
fact extraction and validation
• Wikipedia to become a C21 publisher of
science
Thanks
• Shuttleworth Foundation and Fellowship
• Contentmine.org: Michelle Brook, Jenny Molloy,
Ross Mounce, Richard Smith-Unna,
CottageLabs, Charles Oppenheim
• Open Knowledge Foundation Community
• Wikimedia Community
• Blue Obelisk Community

More Related Content

What's hot

What's hot (20)

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
Petermrjisc20141201
Petermrjisc20141201Petermrjisc20141201
Petermrjisc20141201
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Ontologies in Physical Science
Ontologies in Physical ScienceOntologies in Physical Science
Ontologies in Physical Science
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 

Viewers also liked

How Chinese Companies Make Investment Decisions in the U.S.
How Chinese Companies Make Investment Decisions in the U.S.How Chinese Companies Make Investment Decisions in the U.S.
How Chinese Companies Make Investment Decisions in the U.S.
IntelCollab.com
 
ISI TICs - SENAI INSTITUTES and status - english version v3
ISI TICs - SENAI INSTITUTES and status - english version v3ISI TICs - SENAI INSTITUTES and status - english version v3
ISI TICs - SENAI INSTITUTES and status - english version v3
Jones Albuquerque
 
Apresentação (EQV– Encontro da Qualidade de Vida) Oficial
Apresentação (EQV– Encontro da Qualidade de Vida) OficialApresentação (EQV– Encontro da Qualidade de Vida) Oficial
Apresentação (EQV– Encontro da Qualidade de Vida) Oficial
Haroldo Jr Lima
 

Viewers also liked (17)

Cultural event presentation_khurram
Cultural event presentation_khurramCultural event presentation_khurram
Cultural event presentation_khurram
 
How Chinese Companies Make Investment Decisions in the U.S.
How Chinese Companies Make Investment Decisions in the U.S.How Chinese Companies Make Investment Decisions in the U.S.
How Chinese Companies Make Investment Decisions in the U.S.
 
Competitive intelligence in action final copy 1 english
Competitive intelligence in action final copy 1 englishCompetitive intelligence in action final copy 1 english
Competitive intelligence in action final copy 1 english
 
Llb sc u 1.4 law of agency
Llb sc u 1.4 law of agencyLlb sc u 1.4 law of agency
Llb sc u 1.4 law of agency
 
Diari del 3 de novembre de 2014
Diari del 3 de novembre de 2014Diari del 3 de novembre de 2014
Diari del 3 de novembre de 2014
 
Diari del 10 de novembre de 2014
Diari del 10 de novembre de 2014Diari del 10 de novembre de 2014
Diari del 10 de novembre de 2014
 
1 massacre da lapa o delator e a tropicália maoista final
1 massacre da lapa o delator e a tropicália maoista final1 massacre da lapa o delator e a tropicália maoista final
1 massacre da lapa o delator e a tropicália maoista final
 
Loan
Loan Loan
Loan
 
Loans, Marketing, Strategy And Many More
Loans, Marketing, Strategy And Many More Loans, Marketing, Strategy And Many More
Loans, Marketing, Strategy And Many More
 
ISI TICs - SENAI INSTITUTES and status - english version v3
ISI TICs - SENAI INSTITUTES and status - english version v3ISI TICs - SENAI INSTITUTES and status - english version v3
ISI TICs - SENAI INSTITUTES and status - english version v3
 
Webcast - Creative Best Practices for Mortgage Marketing
Webcast - Creative Best Practices for Mortgage MarketingWebcast - Creative Best Practices for Mortgage Marketing
Webcast - Creative Best Practices for Mortgage Marketing
 
Apresentação (EQV– Encontro da Qualidade de Vida) Oficial
Apresentação (EQV– Encontro da Qualidade de Vida) OficialApresentação (EQV– Encontro da Qualidade de Vida) Oficial
Apresentação (EQV– Encontro da Qualidade de Vida) Oficial
 
Aci 318 08-seismic-requirements-l e garcia
Aci 318 08-seismic-requirements-l e garciaAci 318 08-seismic-requirements-l e garcia
Aci 318 08-seismic-requirements-l e garcia
 
Haroldo Lima: Dilma, convoque eleições já!
Haroldo Lima: Dilma, convoque eleições já!Haroldo Lima: Dilma, convoque eleições já!
Haroldo Lima: Dilma, convoque eleições já!
 
March 2010 - New Regulation for the Oil Sector - A Salty Debate
March 2010 - New Regulation for the Oil Sector - A Salty DebateMarch 2010 - New Regulation for the Oil Sector - A Salty Debate
March 2010 - New Regulation for the Oil Sector - A Salty Debate
 
Atlas of economic complexity part I
Atlas of economic complexity part IAtlas of economic complexity part I
Atlas of economic complexity part I
 
PORQUE SOY UJIER?
PORQUE SOY UJIER?PORQUE SOY UJIER?
PORQUE SOY UJIER?
 

Similar to ContentMine and WikiData

Similar to ContentMine and WikiData (20)

Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 

More from petermurrayrust

More from petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
 
WikiFactMine for Plant Chemistry
WikiFactMine for Plant ChemistryWikiFactMine for Plant Chemistry
WikiFactMine for Plant Chemistry
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Recently uploaded (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 

ContentMine and WikiData

  • 1. ContentMine and WikiData Peter Murray-Rust Wikimania, London UK 2014-08-08
  • 2. ContentMine: We use machines to liberate 100 million facts /yr from the scientific literature and make them free for everyone (WikiData) With Wikipedia we are ALL scientists ContentMine is a social machine WikiData is the future of science data
  • 3. http://en.wikipedia.org/wiki/Tim_Berners-Lee Everything in this presentation is ODOSOS (Open Data, Open Standards, Open Source) CC0, CC-BY, W3C etc., Apache2, etc. * http://contentmine.org http://bitbucket.org/petermr http://wwmm.ch.cam.ac.uk *Sorry about the Powerpoint (Power corrupts, Powerpoint corrupts absolutely (Tufte)) A promise: I (Petermr) will never sell out to non-transparent organizations.
  • 4. petermr: I believe in Wikipedia • 2006 http://en.wikipedia.org/wiki/User:Petermr • 2006 started Open Data (term unknown then!) • 2009: “the bit of Wikipedia that I wrote is correct” [challenging the idea of “WP is junk”] • 2009: “Wikipedia is the digital library of this century” • 2012: I alert WP that Springer has copyrighted > 1000 of our images [Springergate] • 2014: “For facts in maths, physical and biological sciences I trust Wikipedia.” (Wikimania2014)
  • 6. Volunteer community in chemistry: Open Data/Source/Standards
  • 7. Scientific and Medical publication (STM)[+] • World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles … • … cost $300,000 each to create … • … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries … • … to “publishers” who forbid access to 99.9% of citizens of the world … [+] Figures probably +- 50 % [*] arXiV preprint server costs $7 USD per paper
  • 8. 4 Billion USD on human genome yielded 800 Billion USD and 4 M job-years
  • 10. …three problems—flawed design, non- publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009] [Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27] Bad publication wastes science
  • 11. Publishers’ PDFs destroy science PDFs do not contain words or subscripts! PDFs do not contain tables and do not have columns SVG is turned into JPEG because it’s easier to process
  • 12. Elsevier wants to control Open Data [asked by Michelle Brook]
  • 13. STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]” WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R Licences destroy Content Mining
  • 14. CLOSED ACCESS MEANS PEOPLE DIE CLOSED DATA MEANS PEOPLE DIE
  • 16. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 17. The Right to Read is the Right to Mine http://contentmine.org
  • 18. • Science can be read and understood by human-machine Amanuensis-symbionts. • Amanuenses are based on Wikipedia, databases and software (e.g. ContentMine’s AMI) • The results are fed back into WP and WikiData http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby
  • 19. • Crawl scientific literature (Open Bibliography) • Scrape each scientific article (ContentMine-quickscrape) • Extract the facts (ContentMine-AMI) • Index (Wikipedia) • Republish (WikiData) Machine Extraction of scientific facts
  • 20. Human-machine symbionts can read science! WP_Lion WP_Aspergillus_oryzae WP_Soybean
  • 21. Facts Marked by “non-scientists” in ContentMine workshops With Wikipedia everyone can be a scientist
  • 22. “nuggets” in a scientific paper quantity units Value ranges Humans aren’t designed to mine this …  chemical project places
  • 23. Parsing chemical sentences A FACT, uncopyrightable, and representable by triples
  • 25. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 26. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 27. But we can now turn PDFs into Science We can’t turn a hamburger into a cow
  • 30. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 31. (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” . ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),( (((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n 215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187), n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n 102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),((( n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1 60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139 ,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))) )))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,( n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),(( n53,n131),n159))))))); http://en.wikipedia.org/wiki/Digital_image_processing http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
  • 32. Open notebook science is the practice of making the entire primary record of a research project publicly available online as it is recorded. (WP) Jean-Claude Bradley was a chemist who actively promoted Open Science in chemistry,… He coined the term Open Notebook Science. … A memorial symposium was held July 14, 2014 at Cambridge University, UK.[9]
  • 33. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 34. My Wikiwishes • An Open Bibliography of science, updated daily • An interface for ContentMine to feed new facts into WikiData • Domain-specific enthusiasts to create and run fact extraction and validation • Wikipedia to become a C21 publisher of science
  • 35. Thanks • Shuttleworth Foundation and Fellowship • Contentmine.org: Michelle Brook, Jenny Molloy, Ross Mounce, Richard Smith-Unna, CottageLabs, Charles Oppenheim • Open Knowledge Foundation Community • Wikimedia Community • Blue Obelisk Community