SlideShare a Scribd company logo
1 of 35
Martin A Walker, SUNY Potsdam
 Chemical data in Wikipedia
 Validation of Wikipedia chemical data
 RSC Learn Chemistry
 Conclusion
   Wikipedia is designed as an encyclopedia, NOT a
    database, BUT many cheminformatics groups want to
    use data from Wikipedia
   Since most data are entered by a human being, rather
    than by machine, Wikipedia can often provide a data
    source that is independent of the main online databases
   Could the Wikipedia chemists make the data more
    accessible without compromising the project’s mission?
    What about DBpedia?
 TheChembox on a substance page contains
 standard representations such as
    Skeletal formula
    IUPAC name
    InChI and InChIKey
    CAS no. (represents substance, not structure)
    SMILES (proprietary but de facto standard before InChI)
 Thesewere traditionally supplied for use by
 readers to copy/paste, but we were asked to
 make a machine-friendly version
Chemboxes were
originally set up as
tables – OK for people,
but not for data mining.




                  A typical
                  chembox
                  From 2007
 Now designed as a set of data
  fields with values entered by
  the editor – better for data
  extraction and for validation
 Drugboxes also redesigned
 Machine-friendly formats
  (SMILES, InChI, InChIKey, CAS
  Reg. No.) included in nearly all
  chemboxes
 Hide/show used to avoid table
  “explosions”
 Collections of Wikipedia data
  are now available for
  cheminformatics groups to use
SIMPLE   FULL FORM
Some data (e.g., InChIs for complex molecules)
can be very long – and this was a hindrance to
their use in Wikipedia
 InChI can be used to define what structure is
  being represented when compiling a virtual
  database.
 InChI can provide an unambiguous reference
  when validating structures on Wikipedia
 InChIKey is useful to help those using search
  engines
PROBLEM: Table creep – users ask for the table to
include the Standard Free Energy of Hydroformylation
in a Black Box

ANSWER: Put it on a sub-page – the supplementary
data page (something unique to chemistry!).
Click on a link from the bottom of the Chembox:
How I use the key terms:

Validation =>
“How I can be sure the data are correct?”

Curation => an ongoing process of fixing
errors
 In 2008 a data validation drive was
  initiated for basic chemical
  identifiers
 Led to a collaboration with CAS, to
  ensure Wikipedia CAS registry nos.
  are correct
 Now around 3500+ substances have
  been validated against CAS Common
  Chemistry, as having correct name,
  structure & CAS RN
 Other fields now being validated
 Validated content indicated with a
  check mark
Every old version (called a RevID) of an article is
preserved (for all) for posterity, and can
potentially serve as a permanent record of a
validated version.
PROBLEM: This is “the encyclopedia anyone
can edit” – so anyone can change the BP of
water to 200 oC.

SOLUTION: A bot patrols the pages, and
watches for edits to key fields. Any dubious
edits are flagged with a red X (next to the
data), and logged.
System developed by Dirk Beetstra
(Eindhoven University of Technology). It is
the only such tool on Wikipedia.
If anyone tries to
vandalize a validated
field, this will be
flagged by a bot soon
afterwards.
    This example
     received a red X 11
     minutes after it was
     vandalized.
 IN 2008-2010, around 3000 chemical
  structures were informally checked against
  CAS Common Chemistry
 PROBLEM: Structures are loaded from an
  external file on Wikimedia Commons, which
  can be “invisibly” changed
The bot has been modified to watch changes
to the RevID of the Wikimedia Commons
structure image
A few hundred images validated so far
Drugboxes are patrolled by
the bot, but at present
WP:PHARM not active in
formal validation. Most work
done by Dirk Beetstra, using
official lists from data
sources (e.g., ChEBI).
Aims to enrich RSC educational content with data
from ChemSpider, then make it open for educators
to contribute their own content (licensed under
Creative Commons)
 Wikipedia can provide a useful “virtual
  database” of highly curated information on
  common chemicals and drugs.
 Don’t forget the data page information!
 The validation effort needs to go further –
  YOUR help is very welcome!
 RSC Learn Chemistry shows that chemical data
  can also be used to enrich an educational site.
 Congratulations to Henry and Peter, and thanks
  for the invitation to speak in their symposium.
 Thanks to Antony Williams for his many
  contributions to both Wikipedia and Learn
  Chemistry.
 Thanks to Aileen Day, Lorna Thomson, Duncan
  McMillan and RSC Education staff, and to RSC for
  the funding of Learn Chemistry.
 Thanks to undergraduate student Tyson Terpstra
  for uploading many quiz InChIs.
 Thank you for your attention!
Thank you for your attention
 All of my own content in this presentation is
  released under a Creative Commons BY-SA-
  3.0 license
 Copyright information for images is usually
  attributed on the slide itself
 Content from Wikipedia and Learn Chemistry
  is reused via a Creative Commons BY-SA-3.0
  license. For authors, please visit the original
  Wikipedia page and select the “history” tab.
 Other pictures not attributed should only be
  my own personal pictures, also CC-BY-SA3.

More Related Content

What's hot

Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career ResearchersRoss Mounce
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
 
Discovering Library Collections
Discovering Library CollectionsDiscovering Library Collections
Discovering Library CollectionsRosemie Callewaert
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetRoss Mounce
 
PRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangePRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangeBrian Hole
 
Sustainable, Successful Open Data Publication
Sustainable, Successful Open Data PublicationSustainable, Successful Open Data Publication
Sustainable, Successful Open Data PublicationBrian Hole
 
The Journal of Open Economics Data
The Journal of Open Economics DataThe Journal of Open Economics Data
The Journal of Open Economics DataBrian Hole
 
Publishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising RigourPublishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising RigourBrian Hole
 
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelRobert H. McDonald
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)petermurrayrust
 
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...UKSG: connecting the knowledge community
 
Obtaining Credit for Research Software
Obtaining Credit for Research SoftwareObtaining Credit for Research Software
Obtaining Credit for Research SoftwareBrian Hole
 
The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?Brian Hole
 

What's hot (15)

Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
 
Discovering Library Collections
Discovering Library CollectionsDiscovering Library Collections
Discovering Library Collections
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
PRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangePRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata Exchange
 
Sustainable, Successful Open Data Publication
Sustainable, Successful Open Data PublicationSustainable, Successful Open Data Publication
Sustainable, Successful Open Data Publication
 
The Journal of Open Economics Data
The Journal of Open Economics DataThe Journal of Open Economics Data
The Journal of Open Economics Data
 
Linked open data for science, culture and society
Linked open data for science, culture and societyLinked open data for science, culture and society
Linked open data for science, culture and society
 
Publishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising RigourPublishing Open Data: Incentivising Rigour
Publishing Open Data: Incentivising Rigour
 
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...
UKSG Conference 2017 Breakout - Advancing the Research Paper of the Future: c...
 
Obtaining Credit for Research Software
Obtaining Credit for Research SoftwareObtaining Credit for Research Software
Obtaining Credit for Research Software
 
The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?The data journal: incentivizing open scholarship or 'a convenient fiction'?
The data journal: incentivizing open scholarship or 'a convenient fiction'?
 
Building A Community Resource For The Life Sciences
Building A Community Resource For The Life SciencesBuilding A Community Resource For The Life Sciences
Building A Community Resource For The Life Sciences
 

Similar to Wikipedia Chemical Data Validation and RSC Learn Chemistry Project

Chemistry collaborations on wikipedia
Chemistry collaborations on wikipediaChemistry collaborations on wikipedia
Chemistry collaborations on wikipediaMartin Walker
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow toolsImproving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow toolsMitch Miller
 
Using wikis for teaching
Using wikis for teachingUsing wikis for teaching
Using wikis for teachingMartin Walker
 

Similar to Wikipedia Chemical Data Validation and RSC Learn Chemistry Project (20)

Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Chemistry collaborations on wikipedia
Chemistry collaborations on wikipediaChemistry collaborations on wikipedia
Chemistry collaborations on wikipedia
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow toolsImproving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Qualifying Online Information Resources for Chemists
Qualifying Online Information Resources for ChemistsQualifying Online Information Resources for Chemists
Qualifying Online Information Resources for Chemists
 
InChI for connecting and navigating chemistry.
InChI for connecting and navigating chemistry.InChI for connecting and navigating chemistry.
InChI for connecting and navigating chemistry.
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
Online Public Compound Databases
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
ChemSpider Overview Presentation at Special Libraries Association
ChemSpider Overview Presentation at Special Libraries AssociationChemSpider Overview Presentation at Special Libraries Association
ChemSpider Overview Presentation at Special Libraries Association
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008
 
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
 
Using wikis for teaching
Using wikis for teachingUsing wikis for teaching
Using wikis for teaching
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Wikipedia Chemical Data Validation and RSC Learn Chemistry Project

  • 1. Martin A Walker, SUNY Potsdam
  • 2.  Chemical data in Wikipedia  Validation of Wikipedia chemical data  RSC Learn Chemistry  Conclusion
  • 3. Wikipedia is designed as an encyclopedia, NOT a database, BUT many cheminformatics groups want to use data from Wikipedia  Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases  Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?
  • 4.  TheChembox on a substance page contains standard representations such as  Skeletal formula  IUPAC name  InChI and InChIKey  CAS no. (represents substance, not structure)  SMILES (proprietary but de facto standard before InChI)  Thesewere traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
  • 5.
  • 6. Chemboxes were originally set up as tables – OK for people, but not for data mining. A typical chembox From 2007
  • 7.  Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation  Drugboxes also redesigned  Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes  Hide/show used to avoid table “explosions”  Collections of Wikipedia data are now available for cheminformatics groups to use
  • 8. SIMPLE FULL FORM
  • 9. Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia
  • 10.  InChI can be used to define what structure is being represented when compiling a virtual database.  InChI can provide an unambiguous reference when validating structures on Wikipedia  InChIKey is useful to help those using search engines
  • 11. PROBLEM: Table creep – users ask for the table to include the Standard Free Energy of Hydroformylation in a Black Box ANSWER: Put it on a sub-page – the supplementary data page (something unique to chemistry!). Click on a link from the bottom of the Chembox:
  • 12.
  • 13.
  • 14. How I use the key terms: Validation => “How I can be sure the data are correct?” Curation => an ongoing process of fixing errors
  • 15.  In 2008 a data validation drive was initiated for basic chemical identifiers  Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct  Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN  Other fields now being validated  Validated content indicated with a check mark
  • 16. Every old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.
  • 17. PROBLEM: This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC. SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged. System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.
  • 18. If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards.  This example received a red X 11 minutes after it was vandalized.
  • 19.
  • 20.  IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry  PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
  • 21. The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure image A few hundred images validated so far
  • 22. Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).
  • 23.
  • 24. Aims to enrich RSC educational content with data from ChemSpider, then make it open for educators to contribute their own content (licensed under Creative Commons)
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.  Wikipedia can provide a useful “virtual database” of highly curated information on common chemicals and drugs.  Don’t forget the data page information!  The validation effort needs to go further – YOUR help is very welcome!  RSC Learn Chemistry shows that chemical data can also be used to enrich an educational site.
  • 33.  Congratulations to Henry and Peter, and thanks for the invitation to speak in their symposium.  Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry.  Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry.  Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs.  Thank you for your attention!
  • 34. Thank you for your attention
  • 35.  All of my own content in this presentation is released under a Creative Commons BY-SA- 3.0 license  Copyright information for images is usually attributed on the slide itself  Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.  Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.