1. A Cabinet of Web 2.0 Scientific
Curiosities
Ian Mulvany, Product Development
Manager, Nature Publishing Group
This talk takes a tour through science related web 2.0 efforts and discusses
areas of the practice of science that can be impacted through web 2.0 approaches.
A video of this presentation will be posted at http://videolectures.net/
2. Some of the people involved
• Timo Hannay - Director Nature.com
• Jason Wilde - Publisher Physical Sciences
• Amanda Ward - Head of Platform Technologies
• Tony Hammond - Applications Architect
• Alf Eaton - Product Development Manager
• Euan Adie - Product Development Manager
• Gavin Bell - Product Development Manager
• Hilary Spencer - Product Development Manager
• Ian Mulvany - Product Development Manager
3. • Publishing Industry Facts & Figures
• Nature
• (Some) Issues that Web 2.0 can impact
• Identity and Authority
• Content Discovery
• Citizen Science
• Google Wave
• Ongoing Challenge
• The Future
6. Costs of research Source: Research
Information Network
A significant contribution to the total cost of research is the time
required for researchers to find the appropriate material for reading.
There is an opportunity here to decrease such costs through creation
of better tools for information discovery.
source http://www.rin.ac.uk/
8. • "It is intended, first, to place before
the general public the grand results
of scientific work and scientific
discovery"
• "to aid scientific men ... by affording Norman Lockyer
them an opportunity of discussing
the various scientific questions that
arise from time to time"
Nature is principally a scientific communication company.
We have to engage with the methods of communication that are important for science.
If we started today our starting point would naturally be the web, and not a print journal.
9. (Some) Publishing
Milestones
• 1896, Wilhelm Röntgen, X-Rays
• 1925, Raymond Dart , Australopithecus africanus
• 1938, P Kapitza, Superfluidity
• 1953, J D Watson and F H C Crick, DNA
• 1985, J C Farman, B G Gardiner and J D Shanklin, Ozone
Hole
• 1995, Michel Mayor and Didier Queloz, Extra Solar Planets
• 2001, Human Genome
10. Journal Evolution
•1869 Journal Founded
•1899 Journal Makes a Profit
•1967 Peer Review
•1971 First Expansion (until 1974)
•1992 Nature Genetics
•1995 Holzbrink Ownership
•1995 Nature.com
•2004 Connotea
•2007 Nature Network
Peer review only introduced in 1967 in order
to deal with a backlog of about 3000 manuscripts.
11. Our current list of publications:
http://www.nature.com/siteindex/index.html
13. 2.0
Web 2.0 is about getting and using data.
There are two aspects, one is about lowering the barrier for participation,
and the second is about data mining the resultant information in order
to provide better services or tools.
This can also lead to a strong first mover advantage, as the network of data
or participation gets bigger the value in the network gets bigger
14. Web 1.0
Web 2.0
DoubleClick
Google AdSense
Ofoto
Flickr
Akamai
BitTorrent
mp3.com
Napster
Britannica Online
Wikipedia
personal websites
blogging
evite
upcoming.org and EVDB
domain name speculation
search engine optimization
page views
cost per click
screen scraping
web services
publishing participation
CMS wikis
directories (taxonomy)
tagging (folksonomy)
stickiness syndication
19. image credit sam brown, explodingdog
Should be aware not to focus on just the technology
" " Building for Machines:
" " " Semantic Markup
" " " Well documented API's
" " "
" " Building for Humans:"
" " " reduce the barrier to participation
" " " increase the usefulness of serendipity and recommendation
20. Stay Classy, SXSW:
Building Respectful
Software
http://panelpicker.sxsw.com/ideas/view/3691?return=%2Fideas%2Findex
%2Finteractive%2Fq%3Abuilding+respectful
make your software respectful
http://panelpicker.sxsw.com/ideas/view/3691?return=%2Fideas%2Findex%2Finteractive%2Fq%3Abuilding+respectful
21. “ While scientists have gloried in the disruptive
effect that the Web is having on publishers and
libraries, with many fields strongly pushing
open publication models, we are much more
resistant to letting it be a disruptive force in
the practice of our disciplines.”
Jim Hendler
Scientists resist
Although the idea of a data driven approach should have an appeal to scientists,
science changes slowly. There are a lot of implicit norms that are hard to change.
22. }
NIH requests all
Nature offers to
fundholders
upload to PubMed
deposit their 70% of
Central on behalf
manuscripts in scientists can’t
of authors with
PubMed Central even be
their permission
archive bothered to say
}
}
”yes”
4% compliance 30% compliance
Scientists resist
An example of low participation in open data models is the low uptake
of deposition of articles into pubmed.
23. Some Issues Where Web
2.0 May Help in Science
•
Identity and Reputation
•
Content Discovery
• Citizen Science
24. Humans
Public Academic
Machines
This is the framework that Iʼm going to be using to think about the topics in this
talk. These are just two dimensions against which one can look at things, there are many other
ways of looking at these issues. When putting together these slides I got interested in the
tension between machine oriented efforts and human oriented efforts on the web. In addition web 2.0
can have a big impact on public engagement with science, so I wanted to see if I could line up these two
trends together.
26. Identity on the web is a fractured thing. It makes it difficult to manage
all of the accounts that a person has, but on the other hand it makes it easy
to present different personas to different online communities.
27. 100, 000
Identity is a significant and growing issue in science. Each year India produces
100, 000 postdocs.
Full names are often not revealed owing to caste discrimination.
http://www.nature.com/nature/journal/v452/n7187/full/452530d.html
28. 1.1 Billion > 129
photo: Szymon Kochanski
129 surnames are shared by 1.1 billion people, 85% of the chinese population.
Generally identity is a self enforcing protocol.
Works most of the time, but ... Surgeon Liu Hui, padded his CV with publications by another researcher
who shared his surname and initial, rose to become an assistant dean at Tsinghua University.
Discrepancies were noticed and he was dismissed by the university in March 2006
29. http://www.mluvany.net
Scopus Author ID 6603325879
Thompson
Researcher ID B-2805-2008
CrossRef 62.1000/182
Contributor ID
These are currently the most commonly discussed options for managing identity within an academic
context, each has pros and cons, and none has gained enough momentum to be universally adopted.
Nature is currently taking a wait and see approach, but we would like to see an open system gaining adoption.
30. Why is the issue of identity important, for reputation!
31. 1619 - 1677
Henry Oldenburg, first secretary of the Royal Society, invented the practice of peer review with the
Transactions of the Philosophical Society.
His own reputation suffered, he was jailed for being a potential dutch spy
and thrown in the tower of london for a while.
32. TM
Impact Factor
IF (year) = A/B
A = # of articles published in (year -1)
+ (year - 2)
B = # of citations to journal in year
33. Impact factor measures an average statistic of a single journal.
80% of citations into a journal come from 20% of articles.
General agreement that IF is a poor measure of individual article quality.
35. doi/10.1371/journal.pone.0004803.g007
Other metrics can also reveal the connections between the sciences,
Bollen et al. used website access data from publisherʼs http logs to
look at how people browed the literature. This gave a more rounded picture
than just looking at citations.
36. There is a move to now look instead of at journal level metrics rather
37.
38. Citations
time
One thing that fascinates me about citations is that they
are unidirectional.
Also there must be more citations than papers, and yet 85% of papers
receive at most 1 citation.
39. Ideas
time
They can be used to study the flow of ideas forward in time.
40. Main-path analysis and path-dependent transitions
in HistCite™-based historiograms
Journal of the American Society for Information Science and Technology (forthcoming)
Diana Lucio-Arias1 & Loet Leydesdorff2
Amsterdam School of Communications Research (ASCoR), University of Amsterdam
Kloveniersburgwal 48, 1012 CX Amsterdam, The Netherlands.
This is the Main-Path Analysis technique, but as yet such analysis tends to
be done on a case by case basis.
41. 1
Cox, D.R. (1972) Regression models and
life-tables. J. Roy. Statist. Soc. B 34:
21 000
Some papers act as a kind of black hole for citations, they get into the literature
and get cited and cited and cited.
This paper has over 21 000 citations.
The mis-citations to this paper have a h-index of 12,
a level that Hirsch had concluded “…might be a typical value for advancement to tenure…”
http://network.nature.com/people/boboh/blog/2008/06/24/outdone-by-mis-prints
43. y
easy plain text, emails hyperlinks
Twitter views
tags
citations?
contributing
microformats
MicroFormats
(semantic web)
academic papers Semantic Web
hard mining easy
PDF sucks, academic papers are hard to create and PDF is hard to extract
any useful information from in a programatic way.
44. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Public Academic
Machines
This is where most of the academic publishing workflow currently lives,
it is manual work that can only be done by highly trained experts.
45. XML
At nature we are consolidating all of our article content into a sigle XML
database.
46. Building a delivery
infrastructure http://www.flickr.com/photos/zhzheka/
We then deliver this content via print, RSS, paper, search queries,
to a host of endpoints.
47.
48. XML
Blue - Done
Green - Done within the last year
Yellow - coming to completion
Red - depreciated
49.
50. Extensible Containers
http://www.flickr.com/photos/cherieking/
We want to be able to extend the data that we deliver.
51. XML
Medline + MESH
We pull in MESH terms for our articles from medline post-publication.
52. Case Study: Nature Chemistry
We have started extracting entities from our Nature Chemistry journal, and
we hope to roll this program out to other journals.
53. HO
CAS – 50-67-9
NH 2
NH
Serotonin
SMILES – Oc1cc2c(cc1)ncc2CCN
InChI – 1S/C10H12N2O/c11-43-7-6-12-10-2-1-8(13)5-9
(7)10/h1-2,5-6,1 2-13H,3-4,11H2
InChIKey – QZAYGJVTTNCV MB-HFFFAOYSA-N
Chemistry is a visual science! molecules
cas #s first appeard in 1907, is owned by ACS, contains no semantics
smiles 1987, not unique to a compound
Inchi/Inchikey 200/2005
54. GIF/PNG
GIF/PNG
3D
Author file
Author file
Compound Data
CDX
CDX
55. Enhanced compound pages offer:
Chemdraw file
CML file
View structure in 3D
Synonyms
Chemical formula
Molecular Weight
Elemental Analysis
InChI and InChIKey
SMILES string
Links to external databases
56. PubChem
InChi
ChemSpider
We can start to link from articles into databases, and vice versa.
57. PubChem
ChemSpider
XML TXT
xpath
Medline
UIMA
+ MESH
Schematic of our current entity extraction workflow,
Initially we are extracting chemical and compound names form Nature Chemistry articles.
58. We have a bespoke interface that allows editorial curation of the
annotations.
59. <dl class="meta">
<dt>InChI</dt>
<dd class="inchi">InChI=1/
C10H14N5O7P.2Na/c11-8-5-9
(13-2-12-8)15(3-14-5)</dd>
</dl>
Making the markup of the bold numbers makes the online
version of the paper more semantic,
60. Organise metadata: create good architecture so
generated data can be easily reused across a
range of applications.
http://www.flickr.com/photos/timecollapse/
We hope to be able to extended the types of entities that
we are extracting from our articles.
61. Expanding the annotation of journal
articles from Nature Chemistry to
Nature Chemical Biology and then to
all NPG journals
Creating a central NPG database of
compounds and related journal
articles
62. InChI=1S/C32H16N8.Cu/c1-
N
2-10-18-17(9-1)25-33-
26(18)38-28-21-13-5-6-14-22
N N
(21)30(35-28)40-32-24-16-8-
N Cu N
7-15-23(24)31(36-32)39-29-
N N
20-12-4-3-11-19(20)
27(34-29)37-25;/h1-16H;
N
This then makes the article a more integrated object, with
links to databases, entities and the products of scientific research.
63. There are many curated databases that look for information about domain
specific results in the literature. An example is flybase that collects
information about results using the model organism Drosophila.
64. Wormbase does the same for C. elegans.
Both require a large amount of human curating. Having the body of scientific
literature be semantically annotated should help with this kind of curation.
65. Site such as Chemspider and Crystal Eye demonstrate what can be done though
data mining the literature.
66. So we have moved into a situation in which our scholarly network
can now connect to entity databases, rather than just to articles.
67. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Public Academic
Entity Extraction
Machines
Article publishing hopefully becomes enriched through semantic markup and
entity extraction.
68. Getting Social
photo credit: flickr mcgeez
We can go beyond published articles and entities and look at
both other published artefacts and the social annotation that
is associated with them.
69. The amount of grey literature available in physics has grown
steadily, as displayed by submissions to the Physics ArXiVe.
70. Nature Precedings was the first preprint server for the life sciences.
It also includes the ability to vote and comment on submissions and
provides each submission with a unique identifier.
71. PLoS have launched PloS Currents: Influenza, based on top of Google Knol.
Both Preceedings and Currents have editorial curation of content, and allow
easy publication of objects such as posters, proceedings papers and white papers.
75. The Kind of Information that we can capture with Connotea includes full citation information
Usage patterns, (when did an item get added to our DB, how many times has it been added)
Extra meta-data such as tags
Potentially social network information, how many of my friends have added this item?
76. Total number of tags
Total number of unique tags
Growth in usage of the service has been steady
77. And it displays the characteristic power law behaviour of an online network.
83. http://www.connotea.org/data/user/IanMulvany
http://www.connotea.org/data/users/tag/scifoo
http://www.connotea.org/data/user/IanMulvany/tag/
scifoo
http://www.connotea.org/data/user/IanMulvany/tag/
science
http://www.connotea.org/data/user/IanMulvany/tag/
science2.0+citation
Example of API calls
84. There are plenty of other such services currently available.
Interestingly Fuzzy has the most semantically enabled technology, but is one of the least used.
85. A few start-ups are redefining the academic paper management
space, Papers is a mac based “iTunes” for Pdfs.
86. Mendeley provides the same kind of features, with a Last FM metadata scrobbling model.
87. This allows one to see data on what is being read in Mendeley libraries.
This starts to open up a new layer of information about the impact of papers
that goes beyond what can be captured by the impact factor.
88. Nature Network
Online social communities also allow us to begin to capture conversations about science.
NPG launched Nature Network in 2009 and is one of the most active online forums for
the discussion of science.
89. It has specific features to allow members to track the conversations that they
have participated in.
90. There are 3 main local hubs, but we track the geographic location of members,
and try to connect people with other members in their neighbourhood.
91. Bringing things together
photo: flickr Thomas Hawk
Q: How do you manage all of these streams of information?
A: Aggregation is one answer (probably not the only answer).
93. Nature blogs finds blog posts that discuss scientific articles.
Science Blogs and researchblogging.org do much the same.
94. Scinitalla is another Nature product that creates recommendations based
on a users reading habits.
95. Friend Feed aggregates discussions around resources from difference sources.
It has seen widespread adoption by the scientific digerati, the life scientists
room is one of the most active.
96. People are using these rooms to have real-time conversations around real-time
events. This broadcasts an event and the conversions around an event to the
web. It enables real time distant participation.
97. streamosphere.nature.com/preview.php is an aggregator for
discussions on twitter, friendfeed some other lightweight user signals.
It again aggregates over a curated list of sources.
98. So now we can see a world in which the article is no longer the
only digital artefact of note. Much more of the process of science
is becoming visible through online engagement of scientists.
99. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
SIOC
Public Academic
Entity Extraction
Machines
Social media as it exists now is problematic
- effervescent
- closed
- siloed
- unstructured
Tools like SioC, an ontology for social media, can help draw this layer of information
to the machine.
101. Seti@home
Folding@home
“Thinking@home”
One kind of participatory science is getting users to donate their hardware.
102. 10 000 sheep, Aaron Koblin, 2006
You can also build interfaces to people, e.g. the Mechanical Turk.
The sheep market created by Aaron Koblin in 2006 by getting
10 000 turks to draw sheep.
104. http://blog.doloreslabs.com/2009/05/the-programming-
language-with-the-happiest-users/
Two people checking a subset of tweets can data mine twitter for you.
We used crowdsourcing to analyse all of the comments to PlOS articles.
105. But another more interesting version is to get people in interact directly with your data!
" stardust at home
" http://stardustathome.ssl.berkeley.edu/about.php
" http://folding.stanford.edu/
" http://fold.it/portal/
" citizen science blog
" http://citizensci.com/
" great backyard bird count
" http://www.birdsource.org/gbbc/
106. You need to make it engaging, like the Fold it Project, or Galaxy Zoo.
Even if machines and machine learning could answer some of these questions
(like image analysis of galaxy rotation), humans can do it now. You get the scientific
benefit now, you engage the public with science now.
107. Fold it
Stardust at home
Humans Article Writing
Peer to Patent Peer Review
Galaxy Zoo
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
Turk SIOC RDF
Public Academic
Entity Extraction
Seti at Home
Folding at home Machines
Now we have an interesting picture, but most of the arrows in this picture
point down. Where are the efforts to make computers more friendly to people?
One pointer to how that will happen in the future is Google Wave.
108. Google Wave
photo credit: flickr prgibbs
New product from Google, launching in September 09
For the definitive guide to google wave look at:
http://www.youtube.com/watch?v=v_UyVmITiYQ
110. Robot
App Engine
Gadget
html5
Embed Container
(blogger)
Of interest for developers are the APIʼs the wave exposes.
Naively one can think of Robots as allowing two way communication with
a wave, Gadgets for pulling content into a wave, and the Embed gadget
as a tool for pushing waves into other contexts, such as blogs or wikis.
111. Importantly Google intends to open source the server code allowing anyone to run a wave server, much as anyone can
run an email server.
112. Email Thread?
Document?
Game Server?
IM? Gallery?
Group?
? ?? ?
The metaphors for what wave is have not settled down yet.
This is a consequence of the current interface, new interfaces will be possible.
The key is that Wave enables exposing 3rd party APIʼs to the user in a
totally opaque way. It hides the details, and makes it easier for people
to interact with computers.
113. image credit sam brown, explodingdog
Finally we can live in a a world where computers and humans can be friends.
114. Fold it
Stardust at home
Humans Article Writing
Peer to Patent Peer Review
Galaxy Zoo
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
Turk SIOC RDF
Public Academic
WAVE
Entity Extraction
Seti at Home
Folding at home Machines
115. • http://code.google.com/p/helpmeigor/
• http://github.com/cameronneylon/ChemSpidey/
tree/master
• http://github.com/IanMulvany/janey-robot/tree/
master
Some scientific robots have already been created.
121. biological pathways
Text
http://www.reactome.org/
Itʼs a hard problem, some data sets are big and complicated.
http://www.reactome.org/ tries to visualise pathways in the
human genome.
123. • Publishers will continue to exist but will become
communication companies
• They must learn to treat the web as a network, not a
distribution channel
• Journals should be more like databases, and vice versa
• Publishing and broadcasting are merging (or colliding?); to
some extent, he same goes for publishing and software
• The disruptive forces include new economics, lower barriers
to entry, and a complex competitive environment
Final thoughts
Some predictions for scientific publishing.
124. • Mobile devices as sensors e.g. noisetube.net
• Rich web applications building on HTML 5 will be a real
competitor to the desktop
• The problem of scientific identity will be solved
• We will have a scientific recommendation engine that works
• Frameworks for programming genetic code, much like we
now program computer code, will be available
• Computers will do much of the heavy lifting of science
• http://www.nature.com/nature/focus/arts/futures
Final thoughts
Some predictions for science.
125. “The future is already here. It's just not very evenly distributed” - William Gibson
Sci Foo is an annual weekend un-conference that brings together people
doing interesting things at the interface between science, technology and culture.
Looking at what these people are doing gives us a hint of things to come.
127. Extra image
Acknowledgements
• http://www.flickr.com/people/matthewfield/ Matthew Field, Lots Of
People
• http://www.flickr.com/people/garthimage/ Garth Burgess, Southampton
Docks
• http://13c4.wordpress.com/ Pamela Bumstead, 50 reasons not to
• http://www.flickr.com/people/mayeve/ clock
• http://www.flickr.com/people/sublimelyhappy/ Sarah Gerke, Rolodex
• http://www.flickr.com/people/thedepartment/ Kate Andrews, Library
• http://www.flickr.com/people/sirstick/ Alexander Hauser, new mail
• http://commons.wikimedia.org/wiki/User:CJ The Thinker
• Gavin Bell, helpful discussions about OpenID