Webinar for the Chemical Information Division of the American Chemical Society. Describes descriptions of the types of chemical data in Wikipedia, and also how these are uploaded and maintained by the Wikipedia community.
Using wikipedia as a source of chemical information
1. Prof. Martin A. Walker, SUNY Potsdam
June 27, 2013
Webinar for ACS Chemical Information Division
2. Introduction
Chemical substance data in Wikipedia
Other chemistry-related content
Behind the scenes:
•How articles are written
•WikiProjects
Conclusion
Overview
3.
4. What is a wiki?
“A collaborative website which can be directly edited
by anyone with access to it.”
(Wiktionary, March 20, 2007)
From the Hawaiian word “wiki wiki” meaning “quick.”
Picture by
Jshapiro
WM Commons
CC license
6. Wikipedia is…
An encyclopedia in over 200
languages
An incredibly useful resource
for academia
Written by volunteers
Editable by anyone
Free to be copied, re-used
Free to use (no cost)
Operating for no profit
Wikipedia is not…
A “soapbox” or a place to
publish your own work
An authoritative resource for
academia
Written mainly by kids, or by
paid professionals
Free to re-use without
attribution
Run by a corporation
7. Traditional encyclopedia:
“Experts know best”
1
• Editors choose an expert
2
• Expert writes, based on
authoritative resources
3
• Editors review and check
facts
8. Wikipedia – a new paradigm?
“Many eyes are better”
1
• Volunteer writes, supposedly
using authoritative resources
2
• Other volunteers review and
check facts
3
• Ongoing process of adding
content then review
9. Much chemical information on the Web
is generated by machine. Wikipedia is
large, even though most information is
entered word-by-word by a human.
This means that:
• It exhibits nuances of human analysis
• Much of it first enters the Web
throughWikipedia
• It is curated by humans
• It has silly human errors!
The value to cheminformatics –
original human input
Editing Wikipedia articles
Pic by Girona7, Wikimedia Commons
CC license
10. Article pages describe a specific topic
To comment on something in the article, click on the
“Discussion” tab
To look at earlier versions, click the “History” tab
To change the article, click the “Edit” tab – but be careful!
TheWikipedia article page
13. After a general lead section (“lede”), most
decent substance articles cover these main
areas:
• Physical & chemical properties
• Preparation
• Uses
• Identifiers, physical & chemical data (in a
Chembox)
Detailed information on safety or chemical
suppliers is considered inappropriate.
Substance articles
14. Wikipedia - an encyclopedia, NOT a database
• But can it be used like a database anyway?
• What about DBpedia?
Substance data inWikipedia
15. The Chembox on a substance page contains standard
representations such as
•Skeletal formula
•IUPAC name
•InChI and InChIKey
•CAS no. (represents substance, not structure)
•SMILES (de facto standard before InChI)
These were traditionally supplied for use by readers to
copy/paste, but we were asked to make a machine-
friendly version
Chemboxes & Drugboxes
16. Chemboxes were originally set up
as tables – OK for people, but not
for data mining.
EARLY CHEMBOXES
A typical
chembox
From 2007
17. Some data (e.g., InChIs for complex molecules) can be very
long – and this was a hindrance to their use in Wikipedia.
TABLE EXPLOSIONS!
18. Now designed as a set of data fields with
values entered by the editor – better for data
extraction and for validation
Drugboxes also redesigned
Machine-friendly formats (SMILES, InChI,
InChIKey, CAS Reg. No.) included in nearly all
chemboxes
Hide/show used to avoid table “explosions”
Collections of Wikipedia data are now
available for cheminformatics groups to use
NEW CHEMBOXES
20. • InChI can be used to define what structure is being
represented when compiling a virtual database.
• InChI can provide an unambiguous reference when validating
structures on Wikipedia
• InChIKey is useful to help those using search engines
Value of the InChI and InChIKey
21. PROBLEM:Table creep – a user asks for the table to include
the Standard Free Energy of Hydroformylation in a Black Box
ANSWER: Put it on a sub-page – the supplementary data page
(chemistry is unique in Wikipedia in having these!).
Click on a link from the bottom of the Chembox:
Data pages
These do have value, with some data pages having over 50,000 hits/year
24. Maintained by the Pharmacology WikiProject,
which has a medicinal focus. This means that:
• Some items of interest to chemists may be
missing (though main ones are in the drugbox)
• There are no supplementary data pages with
spectral data, etc.
• At the “border” between drugs and chemicals,
there may be two similar substances that have
different boxes. For example:
• caffeine has a drugbox, but
paraxanthine has a chembox
Drugboxes
28. Good coverage of named organic reactions, but otherwise
coverage is patchy – Wikipedia is very weak on reactions
compared to March
probably because of the classic cheminformatics problem –
substances are easy to define, reactions are hard
Only a handful have ReactionBoxes. No database available based
on Wikipedia reaction articles
Typical content:
• Mechanism
• Reagents, catalysts, conditions
• Scope & limitations
• Stereochemistry
• Variations
Reaction articles
30. • Large proportion of Wikipedia overall,
but low in chemistry – chemists tend to
be more interested in chemistry than in
people! Many more could be written.
• Mainly covers Nobel Laureates and
important historical figures, plus a few
chemists where someone has taken the
time to write an article.
• “Vanity articles” are strongly
discouraged!
Biographical articles
31. Variable coverage. None of these usually have data
boxes, but many include templates to show related topics.
• Methods and equipment
• Constants, equations
• Theories and hypotheses
• Chemical families (e.g., “Aldehyde”)
• Terms used (e.g., “Coordination complex”)
• Many others – history, chemical companies, etc.
Concepts & other chemistry content
33. The lonely editor…
Most articles started by a topic-
enthusiast, and then expanded
& maintained by the community
if it is considered useful.
Picture: WM Commons, Public domain
These “Wikipedians” are
motivated by altruism and a love
of learning, and they want to
share their knowledge with the
world, for free. They can also
enjoy seeing their work read by
thousands, or even millions.
Picture by Ziko van Dijk, CC license
34. WikiProjects provide a place for like-minded editors to
discuss articles and organize collaborations. They also
agree on standards & templates, and assess quality.
WikiProjects
35. If you plan major changes to an article or articles, post a comment on the
article talk page and also on the relevant WikiProject talk page.
WikiProject talk pages – for informing
36. These discussions matter; the article discussed here had half a million hits
the the last year. Wikipedia’s influence may be unofficial, but it is
powerful and in many cases its definitions become the de facto standard.
…and for discussions
37. Types of chemistry article
WIKIPROJECT CHEMISTRY
Chemical concepts
Chemical reactions & processes
Chemists
WIKIPROJECT ELEMENTS
Chemical elements
WIKIPROJECT CHEMICALS
Chemical substances
WIKIPROJECT PHARMACOLOGY
Pharmaceuticals
WIKIPROJECT CELL & MOLECULAR BIOLOGY
Molecular biology
38. WikiProject Chemicals
~60 members (10-20 active)
Collaborates on writing quality
articles and standards for:
•developing data boxes for articles
•chemical naming, structure drawing
•article assessment
Data validation
Beta-Cyclodextrin
Public domain picture by Edgar181
39. ChemBoxes, article validation, chemical
names, structure drawing, style guide: all are
organized by the WikiProjects. Type
WP:MOSCHEM into Wikipedia to find the
Manual of Style for Chemistry.
WikiProjects collaborate to set
standards
40. Articles are assessed, then tagged on the talk page. A bot
compiles these assessments into lists & tables, allowing
the project to review and track their articles.
WikiProjects assess articles for quality
& importance
43. WikiTrust – to check trustworthiness of
contributions
Downloadable as an extension to Firefox, this
adds a tab above the article – click to see :
44.
45. General ways to remove vandalism
Watchlists: Users watch all changes to specific pages they
care about
Huggle: Software to help Wikipedians track and remove
vandalism quickly
Bots: “Obvious” vandalism (such as deleting all content
from a page) is spotted and reverted almost immediately
by “bots” that patrol the recent edits. (Bots are scripts
that automate the editing process.)
Part of my
Watchlist from
early this morning
46. Collaborations for validating data
2007-present: ChemSpider and Antony Williams have a
longstanding collaboration with the Chemicals WikiProject,
aimed at curating data in both ChemSpider and Wikipedia.
2008-2010: CAS provided a
database of around 8000
substances to the Chemical
WikiProject free of charge; this
collection was also used as the
basis for a new CAS open
access site for the general
public, CAS Common Chemistry
48. Since 2007 Wikipedia has collaborated with IUPAC to help
propagate IUPAC definitions.
This ensures that Wikipedia has accurate, current definitions,
and IUPAC can reach a much wider audience.
Currently, a collaboration is actively inserting IUPAC definitions
for polymer terms into articles, and editing/expanding content
as needed.
IUPAC collaboration
49.
50. How I use the key terms:
Validation =>
“How I can be sure the data are correct?”
Curation => an ongoing process of fixing errors
Data validation
51. In 2008 a data validation drive was initiated
for basic chemical identifiers, in collaboration
with Antony Williams (ChemSpider)
Led to a collaboration with CAS, to ensure
Wikipedia CAS registry nos. are correct
Now around 3500+ substances have been
validated against CAS Common Chemistry, as
having correct name, structure & CAS RN
Other identifier fields (e.g., KEGG) have since
been validated.
Validated content indicated with a check
mark
Content validation
52. Every old version (called a RevID) of an article is
preserved (for all) for posterity, and can potentially
serve as a permanent record of a validated version.
The approach to validation
53. PROBLEM:This is “the encyclopedia anyone can edit” –
so anyone can change the BP of water to 200 oC.
SOLUTION: A bot patrols the pages, and watches for
edits to key fields. Any dubious edits are flagged with a
red X (next to the data), and logged.
System developed by Dirk Beetstra (Eindhoven University of
Technology). It is the only such tool on Wikipedia.
Protecting validated fields
54. If anyone tries to vandalize a
validated field, this will be
flagged by a bot soon
afterwards.
• This example received a red X
11 minutes after it was
vandalized.
Validation protected by bot
56. IN 2008-2010, around 3000 chemical structures
were informally checked against CAS Common
Chemistry
PROBLEM: Structures are loaded from an
external file on Wikimedia Commons, which can
be “invisibly” changed
Checking structures
57. The bot has been modified to watch changes to
the RevID of the Wikimedia Commons structure
image
A few hundred images validated so far
Since fall 2010
58. Drugboxes are patrolled by the
bot, but at present WP:PHARM
not active in formal validation.
Most work done by Dirk
Beetstra, using official lists
from data sources (e.g., ChEBI).
Drugboxes
59.
60. Type the shortcuts shown in yellow into the Wikipedia
search window
• P:CHEM takes you to the Chemistry Portal
• WP:CHEM and WP:CHEMISTRY – WikiProject pages
are often a useful place to look for guidelines and to
ask for help
• WP:MOSCHEM takes you to the Chemistry Manual of
Style – be sure to check this before making major edits
• WP:CHEAT gives a “cheat sheet” for common edits
• For general chemical information resources, Gary
Wiggins has a WikiBook available at
http://en.wikibooks.org/wiki/Chemical_Information_Sources
Useful sources
61. • Wikipedia can be a useful source of highly
curated information on chemistry, common
chemicals and drugs.
• WikiProjects and the Wikipedia community
play an important role in setting standards
and maintaining articles. Validation will
improve quality further.
• Don’t forget the data page information!
• The writing and the validation need to go
further –YOUR help is very welcome!
Conclusion
62. Thanks to Antony Williams for the invitation to
present this Webinar, and also for his many
contributions to Wikipedia.
Thanks to Dave Martinsen for moderating this
session, even while traveling!
Thanks to the Wikipedia chemists who built this
resource.
Thank you for your attention.
Acknowledgements
Picture by
Vistamommy
Flickr, CC license
64. All of my own content in this presentation is released
under a Creative Commons BY-SA-3.0 license
Copyright information for images is usually attributed
on the slide itself
Content from Wikipedia and Learn Chemistry is
reused via a Creative Commons BY-SA-3.0 license.
For authors, please visit the originalWikipedia page
and select the “history” tab.
Other pictures not attributed should only be my own
personal pictures, also CC-BY-SA3.
Copyright information