2. Chemical data in Wikipedia
Validation of Wikipedia chemical data
RSC Learn Chemistry
Conclusion
3. Wikipedia is designed as an encyclopedia, NOT a
database, BUT many cheminformatics groups want to
use data from Wikipedia
Since most data are entered by a human being, rather
than by machine, Wikipedia can often provide a data
source that is independent of the main online databases
Could the Wikipedia chemists make the data more
accessible without compromising the project’s mission?
What about DBpedia?
4. TheChembox on a substance page contains
standard representations such as
Skeletal formula
IUPAC name
InChI and InChIKey
CAS no. (represents substance, not structure)
SMILES (proprietary but de facto standard before InChI)
Thesewere traditionally supplied for use by
readers to copy/paste, but we were asked to
make a machine-friendly version
7. Now designed as a set of data
fields with values entered by
the editor – better for data
extraction and for validation
Drugboxes also redesigned
Machine-friendly formats
(SMILES, InChI, InChIKey, CAS
Reg. No.) included in nearly all
chemboxes
Hide/show used to avoid table
“explosions”
Collections of Wikipedia data
are now available for
cheminformatics groups to use
9. Some data (e.g., InChIs for complex molecules)
can be very long – and this was a hindrance to
their use in Wikipedia
10. InChI can be used to define what structure is
being represented when compiling a virtual
database.
InChI can provide an unambiguous reference
when validating structures on Wikipedia
InChIKey is useful to help those using search
engines
11. PROBLEM: Table creep – users ask for the table to
include the Standard Free Energy of Hydroformylation
in a Black Box
ANSWER: Put it on a sub-page – the supplementary
data page (something unique to chemistry!).
Click on a link from the bottom of the Chembox:
12.
13.
14. How I use the key terms:
Validation =>
“How I can be sure the data are correct?”
Curation => an ongoing process of fixing
errors
15. In 2008 a data validation drive was
initiated for basic chemical
identifiers
Led to a collaboration with CAS, to
ensure Wikipedia CAS registry nos.
are correct
Now around 3500+ substances have
been validated against CAS Common
Chemistry, as having correct name,
structure & CAS RN
Other fields now being validated
Validated content indicated with a
check mark
16. Every old version (called a RevID) of an article is
preserved (for all) for posterity, and can
potentially serve as a permanent record of a
validated version.
17. PROBLEM: This is “the encyclopedia anyone
can edit” – so anyone can change the BP of
water to 200 oC.
SOLUTION: A bot patrols the pages, and
watches for edits to key fields. Any dubious
edits are flagged with a red X (next to the
data), and logged.
System developed by Dirk Beetstra
(Eindhoven University of Technology). It is
the only such tool on Wikipedia.
18. If anyone tries to
vandalize a validated
field, this will be
flagged by a bot soon
afterwards.
This example
received a red X 11
minutes after it was
vandalized.
19.
20. IN 2008-2010, around 3000 chemical
structures were informally checked against
CAS Common Chemistry
PROBLEM: Structures are loaded from an
external file on Wikimedia Commons, which
can be “invisibly” changed
21. The bot has been modified to watch changes
to the RevID of the Wikimedia Commons
structure image
A few hundred images validated so far
22. Drugboxes are patrolled by
the bot, but at present
WP:PHARM not active in
formal validation. Most work
done by Dirk Beetstra, using
official lists from data
sources (e.g., ChEBI).
23.
24. Aims to enrich RSC educational content with data
from ChemSpider, then make it open for educators
to contribute their own content (licensed under
Creative Commons)
25.
26.
27.
28.
29.
30.
31.
32. Wikipedia can provide a useful “virtual
database” of highly curated information on
common chemicals and drugs.
Don’t forget the data page information!
The validation effort needs to go further –
YOUR help is very welcome!
RSC Learn Chemistry shows that chemical data
can also be used to enrich an educational site.
33. Congratulations to Henry and Peter, and thanks
for the invitation to speak in their symposium.
Thanks to Antony Williams for his many
contributions to both Wikipedia and Learn
Chemistry.
Thanks to Aileen Day, Lorna Thomson, Duncan
McMillan and RSC Education staff, and to RSC for
the funding of Learn Chemistry.
Thanks to undergraduate student Tyson Terpstra
for uploading many quiz InChIs.
Thank you for your attention!
35. All of my own content in this presentation is
released under a Creative Commons BY-SA-
3.0 license
Copyright information for images is usually
attributed on the slide itself
Content from Wikipedia and Learn Chemistry
is reused via a Creative Commons BY-SA-3.0
license. For authors, please visit the original
Wikipedia page and select the “history” tab.
Other pictures not attributed should only be
my own personal pictures, also CC-BY-SA3.