SlideShare una empresa de Scribd logo
1 de 64
Prof. Martin A. Walker, SUNY Potsdam
June 27, 2013
Webinar for ACS Chemical Information Division
Introduction
Chemical substance data in Wikipedia
Other chemistry-related content
Behind the scenes:
•How articles are written
•WikiProjects
Conclusion
Overview
What is a wiki?
“A collaborative website which can be directly edited
by anyone with access to it.”
(Wiktionary, March 20, 2007)
From the Hawaiian word “wiki wiki” meaning “quick.”
Picture by
Jshapiro
WM Commons
CC license
What isWikipedia?
Wikipedia defines itself as:
"a free, web-based, collaborative, multilingual encyclopedia
project supported by the non-profitWikimedia Foundation."
Wikipedia logo is © Wikimedia Foundation, San Francisco, CA
Wikipedia is…
An encyclopedia in over 200
languages
An incredibly useful resource
for academia
Written by volunteers
Editable by anyone
Free to be copied, re-used
Free to use (no cost)
Operating for no profit
Wikipedia is not…
A “soapbox” or a place to
publish your own work
An authoritative resource for
academia
Written mainly by kids, or by
paid professionals
Free to re-use without
attribution
Run by a corporation
Traditional encyclopedia:
“Experts know best”
1
• Editors choose an expert
2
• Expert writes, based on
authoritative resources
3
• Editors review and check
facts
Wikipedia – a new paradigm?
“Many eyes are better”
1
• Volunteer writes, supposedly
using authoritative resources
2
• Other volunteers review and
check facts
3
• Ongoing process of adding
content then review
Much chemical information on the Web
is generated by machine. Wikipedia is
large, even though most information is
entered word-by-word by a human.
This means that:
• It exhibits nuances of human analysis
• Much of it first enters the Web
throughWikipedia
• It is curated by humans
• It has silly human errors!
The value to cheminformatics –
original human input
Editing Wikipedia articles
Pic by Girona7, Wikimedia Commons
CC license
Article pages describe a specific topic
 To comment on something in the article, click on the
“Discussion” tab
 To look at earlier versions, click the “History” tab
 To change the article, click the “Edit” tab – but be careful!
TheWikipedia article page
Substance articles
After a general lead section (“lede”), most
decent substance articles cover these main
areas:
• Physical & chemical properties
• Preparation
• Uses
• Identifiers, physical & chemical data (in a
Chembox)
Detailed information on safety or chemical
suppliers is considered inappropriate.
Substance articles
Wikipedia - an encyclopedia, NOT a database
• But can it be used like a database anyway?
• What about DBpedia?
Substance data inWikipedia
The Chembox on a substance page contains standard
representations such as
•Skeletal formula
•IUPAC name
•InChI and InChIKey
•CAS no. (represents substance, not structure)
•SMILES (de facto standard before InChI)
These were traditionally supplied for use by readers to
copy/paste, but we were asked to make a machine-
friendly version
Chemboxes & Drugboxes
Chemboxes were originally set up
as tables – OK for people, but not
for data mining.
EARLY CHEMBOXES
A typical
chembox
From 2007
Some data (e.g., InChIs for complex molecules) can be very
long – and this was a hindrance to their use in Wikipedia.
TABLE EXPLOSIONS!
Now designed as a set of data fields with
values entered by the editor – better for data
extraction and for validation
Drugboxes also redesigned
Machine-friendly formats (SMILES, InChI,
InChIKey, CAS Reg. No.) included in nearly all
chemboxes
Hide/show used to avoid table “explosions”
Collections of Wikipedia data are now
available for cheminformatics groups to use
NEW CHEMBOXES
FULL FORMSIMPLE
Current form of CHEMBOX
• InChI can be used to define what structure is being
represented when compiling a virtual database.
• InChI can provide an unambiguous reference when validating
structures on Wikipedia
• InChIKey is useful to help those using search engines
Value of the InChI and InChIKey
PROBLEM:Table creep – a user asks for the table to include
the Standard Free Energy of Hydroformylation in a Black Box
ANSWER: Put it on a sub-page – the supplementary data page
(chemistry is unique in Wikipedia in having these!).
Click on a link from the bottom of the Chembox:
Data pages
These do have value, with some data pages having over 50,000 hits/year
Data pages
Wikipedia Drug pages
Maintained by the Pharmacology WikiProject,
which has a medicinal focus. This means that:
• Some items of interest to chemists may be
missing (though main ones are in the drugbox)
• There are no supplementary data pages with
spectral data, etc.
• At the “border” between drugs and chemicals,
there may be two similar substances that have
different boxes. For example:
• caffeine has a drugbox, but
paraxanthine has a chembox
Drugboxes
Chemical reactions
Some have ReactionBoxes
Good coverage of named organic reactions, but otherwise
coverage is patchy – Wikipedia is very weak on reactions
compared to March
 probably because of the classic cheminformatics problem –
substances are easy to define, reactions are hard
Only a handful have ReactionBoxes. No database available based
on Wikipedia reaction articles
Typical content:
• Mechanism
• Reagents, catalysts, conditions
• Scope & limitations
• Stereochemistry
• Variations
Reaction articles
Biographical articles
• Large proportion of Wikipedia overall,
but low in chemistry – chemists tend to
be more interested in chemistry than in
people! Many more could be written.
• Mainly covers Nobel Laureates and
important historical figures, plus a few
chemists where someone has taken the
time to write an article.
• “Vanity articles” are strongly
discouraged!
Biographical articles
Variable coverage. None of these usually have data
boxes, but many include templates to show related topics.
• Methods and equipment
• Constants, equations
• Theories and hypotheses
• Chemical families (e.g., “Aldehyde”)
• Terms used (e.g., “Coordination complex”)
• Many others – history, chemical companies, etc.
Concepts & other chemistry content
The Wikipedia community
User:Polimerek – a Polish
Wikipedian and polymer chemist
Picture from Wikimedia Polska, CC license
The lonely editor…
Most articles started by a topic-
enthusiast, and then expanded
& maintained by the community
if it is considered useful.
Picture: WM Commons, Public domain
These “Wikipedians” are
motivated by altruism and a love
of learning, and they want to
share their knowledge with the
world, for free. They can also
enjoy seeing their work read by
thousands, or even millions.
Picture by Ziko van Dijk, CC license
WikiProjects provide a place for like-minded editors to
discuss articles and organize collaborations. They also
agree on standards & templates, and assess quality.
WikiProjects
If you plan major changes to an article or articles, post a comment on the
article talk page and also on the relevant WikiProject talk page.
WikiProject talk pages – for informing
These discussions matter; the article discussed here had half a million hits
the the last year. Wikipedia’s influence may be unofficial, but it is
powerful and in many cases its definitions become the de facto standard.
…and for discussions
Types of chemistry article
WIKIPROJECT CHEMISTRY
Chemical concepts
Chemical reactions & processes
Chemists
WIKIPROJECT ELEMENTS
Chemical elements
WIKIPROJECT CHEMICALS
Chemical substances
WIKIPROJECT PHARMACOLOGY
Pharmaceuticals
WIKIPROJECT CELL & MOLECULAR BIOLOGY
Molecular biology
WikiProject Chemicals
~60 members (10-20 active)
Collaborates on writing quality
articles and standards for:
•developing data boxes for articles
•chemical naming, structure drawing
•article assessment
Data validation
Beta-Cyclodextrin
Public domain picture by Edgar181
ChemBoxes, article validation, chemical
names, structure drawing, style guide: all are
organized by the WikiProjects. Type
WP:MOSCHEM into Wikipedia to find the
Manual of Style for Chemistry.
WikiProjects collaborate to set
standards
Articles are assessed, then tagged on the talk page. A bot
compiles these assessments into lists & tables, allowing
the project to review and track their articles.
WikiProjects assess articles for quality
& importance
Type WP:ASSESS into Wikipedia to see this
Article assessment – by editors
Assessment guides article
improvement priorities
WikiTrust – to check trustworthiness of
contributions
Downloadable as an extension to Firefox, this
adds a tab above the article – click to see :
General ways to remove vandalism
Watchlists: Users watch all changes to specific pages they
care about
Huggle: Software to help Wikipedians track and remove
vandalism quickly
Bots: “Obvious” vandalism (such as deleting all content
from a page) is spotted and reverted almost immediately
by “bots” that patrol the recent edits. (Bots are scripts
that automate the editing process.)
Part of my
Watchlist from
early this morning
Collaborations for validating data
2007-present: ChemSpider and Antony Williams have a
longstanding collaboration with the Chemicals WikiProject,
aimed at curating data in both ChemSpider and Wikipedia.
2008-2010: CAS provided a
database of around 8000
substances to the Chemical
WikiProject free of charge; this
collection was also used as the
basis for a new CAS open
access site for the general
public, CAS Common Chemistry
CAS CommonChemistry
• Launched in April 2009
• Offered as a free service to
provide CAS RNs to the public.
Since 2007 Wikipedia has collaborated with IUPAC to help
propagate IUPAC definitions.
This ensures that Wikipedia has accurate, current definitions,
and IUPAC can reach a much wider audience.
Currently, a collaboration is actively inserting IUPAC definitions
for polymer terms into articles, and editing/expanding content
as needed.
IUPAC collaboration
How I use the key terms:
Validation =>
“How I can be sure the data are correct?”
Curation => an ongoing process of fixing errors
Data validation
In 2008 a data validation drive was initiated
for basic chemical identifiers, in collaboration
with Antony Williams (ChemSpider)
Led to a collaboration with CAS, to ensure
Wikipedia CAS registry nos. are correct
Now around 3500+ substances have been
validated against CAS Common Chemistry, as
having correct name, structure & CAS RN
Other identifier fields (e.g., KEGG) have since
been validated.
Validated content indicated with a check
mark
Content validation
Every old version (called a RevID) of an article is
preserved (for all) for posterity, and can potentially
serve as a permanent record of a validated version.
The approach to validation
PROBLEM:This is “the encyclopedia anyone can edit” –
so anyone can change the BP of water to 200 oC.
SOLUTION: A bot patrols the pages, and watches for
edits to key fields. Any dubious edits are flagged with a
red X (next to the data), and logged.
System developed by Dirk Beetstra (Eindhoven University of
Technology). It is the only such tool on Wikipedia.
Protecting validated fields
If anyone tries to vandalize a
validated field, this will be
flagged by a bot soon
afterwards.
• This example received a red X
11 minutes after it was
vandalized.
Validation protected by bot
Validated revisionIDs
IN 2008-2010, around 3000 chemical structures
were informally checked against CAS Common
Chemistry
PROBLEM: Structures are loaded from an
external file on Wikimedia Commons, which can
be “invisibly” changed
Checking structures
The bot has been modified to watch changes to
the RevID of the Wikimedia Commons structure
image
A few hundred images validated so far
Since fall 2010
Drugboxes are patrolled by the
bot, but at present WP:PHARM
not active in formal validation.
Most work done by Dirk
Beetstra, using official lists
from data sources (e.g., ChEBI).
Drugboxes
Type the shortcuts shown in yellow into the Wikipedia
search window
• P:CHEM takes you to the Chemistry Portal
• WP:CHEM and WP:CHEMISTRY – WikiProject pages
are often a useful place to look for guidelines and to
ask for help
• WP:MOSCHEM takes you to the Chemistry Manual of
Style – be sure to check this before making major edits
• WP:CHEAT gives a “cheat sheet” for common edits
• For general chemical information resources, Gary
Wiggins has a WikiBook available at
http://en.wikibooks.org/wiki/Chemical_Information_Sources
Useful sources
• Wikipedia can be a useful source of highly
curated information on chemistry, common
chemicals and drugs.
• WikiProjects and the Wikipedia community
play an important role in setting standards
and maintaining articles. Validation will
improve quality further.
• Don’t forget the data page information!
• The writing and the validation need to go
further –YOUR help is very welcome!
Conclusion
Thanks to Antony Williams for the invitation to
present this Webinar, and also for his many
contributions to Wikipedia.
Thanks to Dave Martinsen for moderating this
session, even while traveling!
Thanks to the Wikipedia chemists who built this
resource.
Thank you for your attention.
Acknowledgements
Picture by
Vistamommy
Flickr, CC license
Thank you for your attention
All of my own content in this presentation is released
under a Creative Commons BY-SA-3.0 license
Copyright information for images is usually attributed
on the slide itself
Content from Wikipedia and Learn Chemistry is
reused via a Creative Commons BY-SA-3.0 license.
For authors, please visit the originalWikipedia page
and select the “history” tab.
Other pictures not attributed should only be my own
personal pictures, also CC-BY-SA3.
Copyright information

Más contenido relacionado

La actualidad más candente

Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...
Sukhdev Singh
 
Open access resources
Open access resourcesOpen access resources
Open access resources
Akshay Kumar
 

La actualidad más candente (20)

When Open is Not Enough
When Open is Not EnoughWhen Open is Not Enough
When Open is Not Enough
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You On
 
Open access for academics
Open access for academicsOpen access for academics
Open access for academics
 
Chemical Information Sources Wikibook poster
Chemical Information Sources Wikibook posterChemical Information Sources Wikibook poster
Chemical Information Sources Wikibook poster
 
Web 2.0 Technologies - Wikipedia
Web 2.0 Technologies - WikipediaWeb 2.0 Technologies - Wikipedia
Web 2.0 Technologies - Wikipedia
 
Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?Open Access: What it is and why it is required for scholarly community?
Open Access: What it is and why it is required for scholarly community?
 
Open access savvy skills 2011
Open access savvy skills 2011Open access savvy skills 2011
Open access savvy skills 2011
 
Open Access and Libraries
Open Access and LibrariesOpen Access and Libraries
Open Access and Libraries
 
The Future of Libraries and Wikipedia
The Future of Libraries and WikipediaThe Future of Libraries and Wikipedia
The Future of Libraries and Wikipedia
 
Wikimedia Australia rscd2018
Wikimedia Australia rscd2018Wikimedia Australia rscd2018
Wikimedia Australia rscd2018
 
Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...Opportunities and Challenges of establishing Open Access Repositories: A case...
Opportunities and Challenges of establishing Open Access Repositories: A case...
 
Open access resources
Open access resourcesOpen access resources
Open access resources
 
Open Access Explained
Open Access ExplainedOpen Access Explained
Open Access Explained
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Basic aspects of Open Access
Basic aspects of Open AccessBasic aspects of Open Access
Basic aspects of Open Access
 
IFMSA EuRegMe Workshop 2015
IFMSA EuRegMe Workshop 2015IFMSA EuRegMe Workshop 2015
IFMSA EuRegMe Workshop 2015
 
Open Access Initiatives in India
Open Access Initiatives in IndiaOpen Access Initiatives in India
Open Access Initiatives in India
 
Open Access. It's not a choice. It's a mandate
Open Access. It's not a choice. It's a mandateOpen Access. It's not a choice. It's a mandate
Open Access. It's not a choice. It's a mandate
 
Wikipedia and Libraries: Island Hopping the Data Archipelago
Wikipedia and Libraries: Island Hopping the Data ArchipelagoWikipedia and Libraries: Island Hopping the Data Archipelago
Wikipedia and Libraries: Island Hopping the Data Archipelago
 
Obtaining Credit for Research Software
Obtaining Credit for Research SoftwareObtaining Credit for Research Software
Obtaining Credit for Research Software
 

Destacado (7)

Organic chemistry
Organic chemistryOrganic chemistry
Organic chemistry
 
Chapter6
Chapter6Chapter6
Chapter6
 
Chemical bonding part 2
Chemical bonding part 2Chemical bonding part 2
Chemical bonding part 2
 
Chemical Bonding and Formula Writing
Chemical Bonding and Formula WritingChemical Bonding and Formula Writing
Chemical Bonding and Formula Writing
 
Chapter 7.1 : Chemical Names and Formulas
Chapter 7.1 : Chemical Names and FormulasChapter 7.1 : Chemical Names and Formulas
Chapter 7.1 : Chemical Names and Formulas
 
Chemical Structure: Chemical Bonding. Ionic, Metallic & Coordinate Bonds
Chemical Structure: Chemical Bonding. Ionic, Metallic & Coordinate BondsChemical Structure: Chemical Bonding. Ionic, Metallic & Coordinate Bonds
Chemical Structure: Chemical Bonding. Ionic, Metallic & Coordinate Bonds
 
Chemical Names and Formulas
Chemical Names and FormulasChemical Names and Formulas
Chemical Names and Formulas
 

Similar a Using wikipedia as a source of chemical information

Lecture 24 2012 Wikis & Writing
Lecture 24 2012  Wikis & WritingLecture 24 2012  Wikis & Writing
Lecture 24 2012 Wikis & Writing
Jessica Laccetti
 
Lecture 23 Wikis & Writing
Lecture 23  Wikis & WritingLecture 23  Wikis & Writing
Lecture 23 Wikis & Writing
Jessica Laccetti
 
Open Knowledge Management
Open Knowledge ManagementOpen Knowledge Management
Open Knowledge Management
Frieda Brioschi
 
Lecture 25: Wikipedia and Reliability
Lecture 25: Wikipedia and ReliabilityLecture 25: Wikipedia and Reliability
Lecture 25: Wikipedia and Reliability
dul_e
 
Alsc Wiki Overview
Alsc Wiki OverviewAlsc Wiki Overview
Alsc Wiki Overview
edu_and20
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar a Using wikipedia as a source of chemical information (20)

Wikimedia Presentation for Schools
Wikimedia Presentation for SchoolsWikimedia Presentation for Schools
Wikimedia Presentation for Schools
 
Using wikis for teaching
Using wikis for teachingUsing wikis for teaching
Using wikis for teaching
 
Chemistry collaborations on wikipedia
Chemistry collaborations on wikipediaChemistry collaborations on wikipedia
Chemistry collaborations on wikipedia
 
Using wikis in library liaison work: overview & trends
Using wikis in library liaison work: overview & trendsUsing wikis in library liaison work: overview & trends
Using wikis in library liaison work: overview & trends
 
Puzzled by Wikis And Blogs?
Puzzled by Wikis And Blogs?Puzzled by Wikis And Blogs?
Puzzled by Wikis And Blogs?
 
Lecture 24 2012 Wikis & Writing
Lecture 24 2012  Wikis & WritingLecture 24 2012  Wikis & Writing
Lecture 24 2012 Wikis & Writing
 
Dissecting Wikipedia
Dissecting WikipediaDissecting Wikipedia
Dissecting Wikipedia
 
2018 SAAM Art and Feminism Wikipedia Edit-a-thon
2018 SAAM Art and Feminism Wikipedia Edit-a-thon2018 SAAM Art and Feminism Wikipedia Edit-a-thon
2018 SAAM Art and Feminism Wikipedia Edit-a-thon
 
Web2.0 2012 - lesson 5 - wiki
Web2.0 2012 - lesson 5 - wikiWeb2.0 2012 - lesson 5 - wiki
Web2.0 2012 - lesson 5 - wiki
 
Lecture 23 Wikis & Writing
Lecture 23  Wikis & WritingLecture 23  Wikis & Writing
Lecture 23 Wikis & Writing
 
Open Knowledge Management
Open Knowledge ManagementOpen Knowledge Management
Open Knowledge Management
 
Lecture 25: Wikipedia and Reliability
Lecture 25: Wikipedia and ReliabilityLecture 25: Wikipedia and Reliability
Lecture 25: Wikipedia and Reliability
 
WikipediaWise
WikipediaWiseWikipediaWise
WikipediaWise
 
Wrangling Wikipedia
Wrangling WikipediaWrangling Wikipedia
Wrangling Wikipedia
 
Alsc Wiki Overview
Alsc Wiki OverviewAlsc Wiki Overview
Alsc Wiki Overview
 
Wikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloomWikipedia for GLAMS_by_jentzsch_&_ockerbloom
Wikipedia for GLAMS_by_jentzsch_&_ockerbloom
 
Using Wikis In An Anthropology Class
Using Wikis In An Anthropology ClassUsing Wikis In An Anthropology Class
Using Wikis In An Anthropology Class
 
Publishing Articles in the English Wikipedia
Publishing Articles in the English WikipediaPublishing Articles in the English Wikipedia
Publishing Articles in the English Wikipedia
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Wikis workshop
Wikis workshopWikis workshop
Wikis workshop
 

Último

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Using wikipedia as a source of chemical information

  • 1. Prof. Martin A. Walker, SUNY Potsdam June 27, 2013 Webinar for ACS Chemical Information Division
  • 2. Introduction Chemical substance data in Wikipedia Other chemistry-related content Behind the scenes: •How articles are written •WikiProjects Conclusion Overview
  • 3.
  • 4. What is a wiki? “A collaborative website which can be directly edited by anyone with access to it.” (Wiktionary, March 20, 2007) From the Hawaiian word “wiki wiki” meaning “quick.” Picture by Jshapiro WM Commons CC license
  • 5. What isWikipedia? Wikipedia defines itself as: "a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profitWikimedia Foundation." Wikipedia logo is © Wikimedia Foundation, San Francisco, CA
  • 6. Wikipedia is… An encyclopedia in over 200 languages An incredibly useful resource for academia Written by volunteers Editable by anyone Free to be copied, re-used Free to use (no cost) Operating for no profit Wikipedia is not… A “soapbox” or a place to publish your own work An authoritative resource for academia Written mainly by kids, or by paid professionals Free to re-use without attribution Run by a corporation
  • 7. Traditional encyclopedia: “Experts know best” 1 • Editors choose an expert 2 • Expert writes, based on authoritative resources 3 • Editors review and check facts
  • 8. Wikipedia – a new paradigm? “Many eyes are better” 1 • Volunteer writes, supposedly using authoritative resources 2 • Other volunteers review and check facts 3 • Ongoing process of adding content then review
  • 9. Much chemical information on the Web is generated by machine. Wikipedia is large, even though most information is entered word-by-word by a human. This means that: • It exhibits nuances of human analysis • Much of it first enters the Web throughWikipedia • It is curated by humans • It has silly human errors! The value to cheminformatics – original human input Editing Wikipedia articles Pic by Girona7, Wikimedia Commons CC license
  • 10. Article pages describe a specific topic  To comment on something in the article, click on the “Discussion” tab  To look at earlier versions, click the “History” tab  To change the article, click the “Edit” tab – but be careful! TheWikipedia article page
  • 11.
  • 13. After a general lead section (“lede”), most decent substance articles cover these main areas: • Physical & chemical properties • Preparation • Uses • Identifiers, physical & chemical data (in a Chembox) Detailed information on safety or chemical suppliers is considered inappropriate. Substance articles
  • 14. Wikipedia - an encyclopedia, NOT a database • But can it be used like a database anyway? • What about DBpedia? Substance data inWikipedia
  • 15. The Chembox on a substance page contains standard representations such as •Skeletal formula •IUPAC name •InChI and InChIKey •CAS no. (represents substance, not structure) •SMILES (de facto standard before InChI) These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine- friendly version Chemboxes & Drugboxes
  • 16. Chemboxes were originally set up as tables – OK for people, but not for data mining. EARLY CHEMBOXES A typical chembox From 2007
  • 17. Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia. TABLE EXPLOSIONS!
  • 18. Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation Drugboxes also redesigned Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes Hide/show used to avoid table “explosions” Collections of Wikipedia data are now available for cheminformatics groups to use NEW CHEMBOXES
  • 20. • InChI can be used to define what structure is being represented when compiling a virtual database. • InChI can provide an unambiguous reference when validating structures on Wikipedia • InChIKey is useful to help those using search engines Value of the InChI and InChIKey
  • 21. PROBLEM:Table creep – a user asks for the table to include the Standard Free Energy of Hydroformylation in a Black Box ANSWER: Put it on a sub-page – the supplementary data page (chemistry is unique in Wikipedia in having these!). Click on a link from the bottom of the Chembox: Data pages These do have value, with some data pages having over 50,000 hits/year
  • 24. Maintained by the Pharmacology WikiProject, which has a medicinal focus. This means that: • Some items of interest to chemists may be missing (though main ones are in the drugbox) • There are no supplementary data pages with spectral data, etc. • At the “border” between drugs and chemicals, there may be two similar substances that have different boxes. For example: • caffeine has a drugbox, but paraxanthine has a chembox Drugboxes
  • 25.
  • 28. Good coverage of named organic reactions, but otherwise coverage is patchy – Wikipedia is very weak on reactions compared to March  probably because of the classic cheminformatics problem – substances are easy to define, reactions are hard Only a handful have ReactionBoxes. No database available based on Wikipedia reaction articles Typical content: • Mechanism • Reagents, catalysts, conditions • Scope & limitations • Stereochemistry • Variations Reaction articles
  • 30. • Large proportion of Wikipedia overall, but low in chemistry – chemists tend to be more interested in chemistry than in people! Many more could be written. • Mainly covers Nobel Laureates and important historical figures, plus a few chemists where someone has taken the time to write an article. • “Vanity articles” are strongly discouraged! Biographical articles
  • 31. Variable coverage. None of these usually have data boxes, but many include templates to show related topics. • Methods and equipment • Constants, equations • Theories and hypotheses • Chemical families (e.g., “Aldehyde”) • Terms used (e.g., “Coordination complex”) • Many others – history, chemical companies, etc. Concepts & other chemistry content
  • 32. The Wikipedia community User:Polimerek – a Polish Wikipedian and polymer chemist Picture from Wikimedia Polska, CC license
  • 33. The lonely editor… Most articles started by a topic- enthusiast, and then expanded & maintained by the community if it is considered useful. Picture: WM Commons, Public domain These “Wikipedians” are motivated by altruism and a love of learning, and they want to share their knowledge with the world, for free. They can also enjoy seeing their work read by thousands, or even millions. Picture by Ziko van Dijk, CC license
  • 34. WikiProjects provide a place for like-minded editors to discuss articles and organize collaborations. They also agree on standards & templates, and assess quality. WikiProjects
  • 35. If you plan major changes to an article or articles, post a comment on the article talk page and also on the relevant WikiProject talk page. WikiProject talk pages – for informing
  • 36. These discussions matter; the article discussed here had half a million hits the the last year. Wikipedia’s influence may be unofficial, but it is powerful and in many cases its definitions become the de facto standard. …and for discussions
  • 37. Types of chemistry article WIKIPROJECT CHEMISTRY Chemical concepts Chemical reactions & processes Chemists WIKIPROJECT ELEMENTS Chemical elements WIKIPROJECT CHEMICALS Chemical substances WIKIPROJECT PHARMACOLOGY Pharmaceuticals WIKIPROJECT CELL & MOLECULAR BIOLOGY Molecular biology
  • 38. WikiProject Chemicals ~60 members (10-20 active) Collaborates on writing quality articles and standards for: •developing data boxes for articles •chemical naming, structure drawing •article assessment Data validation Beta-Cyclodextrin Public domain picture by Edgar181
  • 39. ChemBoxes, article validation, chemical names, structure drawing, style guide: all are organized by the WikiProjects. Type WP:MOSCHEM into Wikipedia to find the Manual of Style for Chemistry. WikiProjects collaborate to set standards
  • 40. Articles are assessed, then tagged on the talk page. A bot compiles these assessments into lists & tables, allowing the project to review and track their articles. WikiProjects assess articles for quality & importance
  • 41. Type WP:ASSESS into Wikipedia to see this Article assessment – by editors
  • 43. WikiTrust – to check trustworthiness of contributions Downloadable as an extension to Firefox, this adds a tab above the article – click to see :
  • 44.
  • 45. General ways to remove vandalism Watchlists: Users watch all changes to specific pages they care about Huggle: Software to help Wikipedians track and remove vandalism quickly Bots: “Obvious” vandalism (such as deleting all content from a page) is spotted and reverted almost immediately by “bots” that patrol the recent edits. (Bots are scripts that automate the editing process.) Part of my Watchlist from early this morning
  • 46. Collaborations for validating data 2007-present: ChemSpider and Antony Williams have a longstanding collaboration with the Chemicals WikiProject, aimed at curating data in both ChemSpider and Wikipedia. 2008-2010: CAS provided a database of around 8000 substances to the Chemical WikiProject free of charge; this collection was also used as the basis for a new CAS open access site for the general public, CAS Common Chemistry
  • 47. CAS CommonChemistry • Launched in April 2009 • Offered as a free service to provide CAS RNs to the public.
  • 48. Since 2007 Wikipedia has collaborated with IUPAC to help propagate IUPAC definitions. This ensures that Wikipedia has accurate, current definitions, and IUPAC can reach a much wider audience. Currently, a collaboration is actively inserting IUPAC definitions for polymer terms into articles, and editing/expanding content as needed. IUPAC collaboration
  • 49.
  • 50. How I use the key terms: Validation => “How I can be sure the data are correct?” Curation => an ongoing process of fixing errors Data validation
  • 51. In 2008 a data validation drive was initiated for basic chemical identifiers, in collaboration with Antony Williams (ChemSpider) Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN Other identifier fields (e.g., KEGG) have since been validated. Validated content indicated with a check mark Content validation
  • 52. Every old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version. The approach to validation
  • 53. PROBLEM:This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC. SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged. System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia. Protecting validated fields
  • 54. If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards. • This example received a red X 11 minutes after it was vandalized. Validation protected by bot
  • 56. IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed Checking structures
  • 57. The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure image A few hundred images validated so far Since fall 2010
  • 58. Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI). Drugboxes
  • 59.
  • 60. Type the shortcuts shown in yellow into the Wikipedia search window • P:CHEM takes you to the Chemistry Portal • WP:CHEM and WP:CHEMISTRY – WikiProject pages are often a useful place to look for guidelines and to ask for help • WP:MOSCHEM takes you to the Chemistry Manual of Style – be sure to check this before making major edits • WP:CHEAT gives a “cheat sheet” for common edits • For general chemical information resources, Gary Wiggins has a WikiBook available at http://en.wikibooks.org/wiki/Chemical_Information_Sources Useful sources
  • 61. • Wikipedia can be a useful source of highly curated information on chemistry, common chemicals and drugs. • WikiProjects and the Wikipedia community play an important role in setting standards and maintaining articles. Validation will improve quality further. • Don’t forget the data page information! • The writing and the validation need to go further –YOUR help is very welcome! Conclusion
  • 62. Thanks to Antony Williams for the invitation to present this Webinar, and also for his many contributions to Wikipedia. Thanks to Dave Martinsen for moderating this session, even while traveling! Thanks to the Wikipedia chemists who built this resource. Thank you for your attention. Acknowledgements Picture by Vistamommy Flickr, CC license
  • 63. Thank you for your attention
  • 64. All of my own content in this presentation is released under a Creative Commons BY-SA-3.0 license Copyright information for images is usually attributed on the slide itself Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the originalWikipedia page and select the “history” tab. Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3. Copyright information