Recombination DNA Technology (Nucleic Acid Hybridization )
Why would a publisher care about open data?
1. Why would a publisher care
about open data?
Anita de Waard
November 2019
2. Why would a publisher care about open data?
What do we mean by open?
What do we mean by data?
What do we mean by a publisher?
3. data
Data, after all, is stuff machines can handle […]
we could create a world in which it would be programs
-- not just people -- that would enjoy the data.
For data, as for documents, the value of any part of the web is
increased by the amount of other stuff out there.
For documents it is the ability to follow links,
but for open data it is the ability to also interconnect and join,
to summarise and compare, to monitor, extrapolate, to infer.
Tim Berners-Lee, 2009
NOW!
• Provenance of data: STAR Methods at Cell
• Contributor Roles (CRediT) taxonomy
• Citation and linking to data and software
• Versioned linking to data & software
REAGENT/RESOURCE SOURCE IDENTIFIER
Antibodies
Rabbit monoclonal anti-
Snail
Cell Signaling
Technology
Cat#3879S; RRID:
AB_2255011
Mouse monoclonal anti-
Tubulin (clone DM1A)
Sigma-Aldrich Cat#T9026; RRID:
AB_477593
Rabbit polyclonal anti-
BMAL1
This paper N/A
Bacterial and Virus Strains
pAAV-hSyn-DIO-
hM3D(Gq)-mCherry
Krashes et al.,
2011
Addgene AAV5;
44361-AAV5
AAV5-EF1a-DIO-
hChR2(H134R)-EYFP
Hope Center Viral
Vectors Core
N/A
Cowpox virus Brighton
Red
BEI Resources NR-88
Zika-SMGC-1,
GENBANK: KX266255
Isolated from
patient (Wa 2016)
N/A
Staphylococcus aureus ATCC ATCC 29213
Streptococcus pyogenes:
M1 serotype strain: strain
SF370; M1 GAS
ATCC ATCC 700294
Biological Samples
Healthy adult BA9 brain
tissue
University of
Maryland Brain &
Tissue Bank
Cat#UMB1455
4. 19.11.2019
Elsevier Data Solutions for Research
open
Scholix: A Linked Open Data Hub
to connect papers and datasets
Research Object Composer:
An Open source editor for
Research Objects
5. a publisher
What does a publisher even do anymore?
cites
20081977
newexisting
Example 1: Human papilloma virus causes cervical cancer
6. What does a publisher even do anymore?
Example 2: Top 20 universities in Quantum Computing
7. 7
7
Author
Editor/
Publishers
Reader/
User
Researcher
Data Results Article UI
article
article
article
article
tool
tool
data
user
user
tool
data
article
article
tool
tool
data
data
data
datauser
user
user
article
Model: Castle
• Goal: selling content
• Metrics: number of units sold
• Strategy: optimize content delivery to users
Model: Marketplace
• Goal: grow number of interactions
• Metrics: number of interactions between users
• Strategy: optimize number of network interactions
Today:
linear supply chains
Linear supply chains are evolving into complex,
dynamic and connected value webs
Win by reputation Win by trust
Why publishers care about open science:
The future:
networked open science
8. 19.11.2019
Elsevier Data Solutions for Research
Extra Slides:
1. Elsevier in numbers
2. Research Data Management
3. Research Object Composer
4. Entellect and Life Science Solutions
5. Data analytics: Quantum Computing
6. Elsevier and Open Science
10. Elsevier by the numbers
25,000
Our products are used at
more than 25,000 Academic
and Government institutes
globally
14+ m
people a month use Science
Direct, our flagship platform
for academic research
320+
Reaxys®'s ML capability enables the
chemistry of drug discovery, and
materials innnovation for over 320
pharma innovators, 130 chemical
companies, and over 1100
7,500
Elsevier has 7,500
employees and serves
customers in over 180
countries.
430,000
Elsevier publishes 430,000
peer-reviewed articles
annually
9 m
Mendeley is a scientific social media
platform that enables around 9
million users worldwide, to organize,
write, collaborate and promote their
12. 19.11.2019
Elsevier Data Solutions for Research
Elsevier Data Solutions for Research
DisseminateAnalyzeCollaborateControlStoreCreate & Collect
Collect
Create
Extract
Store
Secure
Manage
Control
Workspaces
Researchers
Data sets
Search
Integrate
Analyze
Share
Publish
Archive
EntellectTM
MACRO EDC
Hivebench GDPR
13. 19.11.2019
Elsevier Data Solutions for Research
How we deliver
1. Open system: through open
APIs, modules can be
integrations with other RDM tools
2. Data remains private at or
owned by institution
3. System is integrated with the
researcher workflows, to ensure
simple and clear use
4. Researchers continue to work
the same way, avoiding
additional bureaucracy and
administration
14. 19.11.2019
Elsevier Data Solutions for Research
Data Search
Retrieve active data, discover public data
Discover data
• 10 million+ datasets indexed from more than
35 repositories
• Deep indexing of data significantly enhances
the relevancy of results
• Keyword search within data files
• Filter search results by specific author,
institution, journal, subject category
Retrieve active data*
• Navigate to locally held institutional data
• Powerful keyword search and filtering
15. 19.11.2019
Elsevier Data Solutions for Research
Data Manager
Researchers can
• Share data privately within a research project
• Invite external collaborators to join a project
• Gather research data from data sources as it’s
generated (including ELNs)
• Annotate research data with detailed, subject-
specific metadata
• Curate data according to project or institutional
workflows
• Prepare to publish data on a repository of your
choice
• Open APIs allow tailored upload forms, automated
workflows, analyze and re-upload data files
Go from raw files to active datasets
16. 19.11.2019
Elsevier Data Solutions for Research
Data Repository
Researchers can
• Store up to 100 GB of data per
dataset in many formats
• Describe how experiments can be
reproduced
• Keep track of dataset versions
• Create DOI
for citation
(or university prefix)
Store datasets in a secure and trusted repository
17. 19.11.2019
Elsevier Data Solutions for Research
Data Monitor
Institutions can
• Keep track of data inside
and outside institution
• Achieve credibility,
visibility and integrity of
key research outputs
• Maintain visibility of
events in RDM space
• Improve researcher's
adoption of data sharing
tools
• Communicate value of
data sharing to
researchers during the
research process
Encourage and monitor compliance
18. Five Facts about Elsevier and Research Data
Fact #1 Elsevier’s Mendeley Data supports the entire lifecycle of research data
The 5 modules that make up Mendeley Data are specifically designed to utilize data
to its fullest potential, simplifying and enhancing current way of working.
Fact #3 Mendeley Data is an open system
It is a flexible platform — modules are designed to be used together, standalone, or
combined with other Elsevier and non-Elsevier solutions
Fact #2 Researchers and institutions own and control all the data
Mendeley Data allows researchers to keep data private, or publish it under one of
16 open data licenses, so they stay in full control
Fact #4 Mendeley Data can increase the exposure and impact of research
Mendeley Data Search indexes over 10 million datasets from more than 35
repositories
Fact #5 Elsevier is an active participant in the open data community
Elsevier partners with the open data community, and is currently working on
more than 20 projects globally
19. 19.11.2019
Elsevier Data Solutions for Research
Mendeley Data already integrates through open APIs with the global Research Data
Management ecosystem, as well as other Elsevier solutions
+ 35 repositories
(BePress planned)
• Mendeley Data Repository
datasets are automatically
synced with the Pure
curation workflow
• Projects, grants,
equipment, showcase
on portal (planned)
• Mendeley Data Search results
are visible on Scopus
• Notify new articles to Monitor
for data sharing compliance
• Datasets appear as records
on Scopus (planned)
• Mendeley Data usage is
accessible through Plum API
and widget
• Plumx metrics (citations,
usage, social mentions) are
captured and shown on
Mendeley Data Repository
Publish datasets
alongside an article
on Mendeley Data
within the SSRN
publication flow
Publish or link datasets
alongside an article on
Mendeley Data within the
ScienceDirect publication flow
Researcher and
Institutional
Dataset metrics
• User identity & login
• Library (planned)
• Notes (planned)
• Projects (planned)
Existing integration
Planned integration
• Mendeley Data indexed
by OpenAIRE index
• OpenAire Zenodo
repository indexed by
Mendeley Data Search
Long-term
preservation of
published datasets
Links between articles and datasets:
• Contributed by Mendeley
Data to Scholix
• Indexed by Menndeley Data
Search and Data Monitor
• Consumed by Scopus and
ScienceDirect
Integrate with machine
readabledata management plans
• For more than 35 repositories the
metadata as well as the underlying
datasets are indexed by Mendeley
Data Search
• First repositories are actively
integrating with the free and open
‘push API’ of Mendeley Data
Search
• Mint DOIs for Mendeley Data
Repository
• Data Cite indexed by
Mendeley Data Search
21. Building an open interoperable data ecosystem:
Aggregates
link things together
Annotations
about things & their
relationships
Container
Packaging content & links:
Zip files, BagIt, Docker images
Identification
locate things
regardless where
21
22. Building an open interoperable data ecosystem:
database
Open
repository
Workflow Tool
Task 1
Workflow
Input
Task 2
Task 3
Output
Research Object Composer
http://www.researchobject.org
Research Object Profiler
Add annotation and
relationships (metadata)
to collection to describe a
research object:
- URI
- Length
- Filename
- Checksums
etc.
Research Object Serializer
(a manifest itemizing file names)
Serialise Research Object
in standard format based BagIt
=1
=2
=3
RO
1
2
3
Open API
22
Mendeley Data
RO
1
2
3
• DOIs
• Metadata
(Findability)
• Open repo
(Accessibility)
• Versioning
• RO Standard
(Interoperability,
Reusability)
23. • The RO Composer is not a registry of research objects, but it can list research objects currently under construction.
• The RO Composer is a microservice which responsibility is to help other services create and deposit research objects.
• The composer acts as a temporary construction site that can be completed by multiple services (e.g. a data management
system, a workflow system, a user interface).
• These clients will be jointly building a Research Object
that can then be validated according to the schema,
before the RO is downloaded or deposited into an archive
(like Zenodo or Mendeley Data).
• Clients of the RO Composer are applications
(driven by a user interface) or agents (engaged
automatically from other events, e.g. a workflow run).
• The RO Composer is not a required component to this:
any software may generate research objects by following
Research Object specifications.
Purpose of the Research Object Composer*:
23* From: https://github.com/ResearchObject/research-object-composer/blob/master/introduction.ipynb
27. 27
Human Papilloma Virus and Cervical Cancer
2008
zur Hausen awarded
Nobel Prize
1976
zur Hausen
proposes link
between HPV and
Cervical Cancer
1946
Papanicolau
develops PAP
smear
2006
Gardasil HPV
vaccine approved
Study impact of intervening
research in this talk
28. 28
Early Work
1977
“a hypothesis has been presented that the virus
found in genital warts may be involved in the etiology
of human genital cancer”
30. 30
Citation Mapping Process
19.11.2019
Build corpus of papers using broad search (~20,000 papers) on all aspects of cervical
cancer and HPV
Expand corpus by adding all cited works not in the original corpus
Add cited works from the cited corpus (“grandchild” references )
Connect the discrete steps of scientific advances connecting the works
Apply graph mathematics to find all connected paths
31. 31
Assembling The Graph
19.11.2019
• Dense interconnected web of
cititations
• Filter for only cited works within 3
years of the citing work – building
on relevant knowledge
First level Second level
Recognize
identities in
graph
Corpus
32. 32
Building the Corpus
19.11.2019
'papillomaviridae' AND 'cancer' AND [article]/lim - 2,747 results from 1975-2019
• 55,414 references total cited in this set
• 29,064 unique references (the references overlap) 1870-2019
• 719,470 references cited in this set of 29,064 papers
• 259,908 unique in this set.
Total corpus of work using this method is 182,402 unique articles
• Citation network has 103,443 edges
33. 33
Path Finding
19.11.2019
Select “interesting” endpoints
• Significant starting point – proposal that HPV could be related to cancer
• Significant endpoint – recognition of HPV/cancer connection
Use graph traversal analytics to find all paths greater than 5 papers that connect the two
ideas
Separate by year
44. Top 20 universities active in Quantum Computing
University of
Waterloo
National University
of Singapore
Massachusetts
Institute of
TechnologyUniversity of Science
and Technology of
China
University of Oxford
Tsinghua University
University of Tokyo
Harvard University
University of
Maryland
University of New
South Wales
University of
California at Santa
Barbara
ETH Zurich
University of Sydney
RAS
University of
Southern California
Perimeter Institute
for Theoretical
Physics
University College
London
Princeton University
University of
Michigan
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 50 100 150 200 250
FWCI
Publications
55. ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
55
Global market dynamics and technologies are reconfiguring the academic ecosystem:
Macroeconomic developments
Ecological and societal sustainability
• Global population is growing; 9B people in 2050
• Challenge to produce more with less and cleaner
input
• Challenge to solve poverty and unequal
allocation of resources
Shifting power balance from West to East
• Strong economic growth in China and India
• Rise of the middle class; improvement of
educational and health care system and food
supply chain
Technological developments
The web
• Everyone is a publisher
• Content access is ubiquitous
The social web
• Professional and personal networks emerge
without traditional institutions
• Everyone is a peer reviewer
Big data
• Explosion of data through networking of
measurement tools
• Radically cheaper tools and computing
power
Social developments
• Pressure from society and funders to
justify the costs of science
• Need for reliable research results (that can
be trusted).
• Patients/citizens demand access.
increased participation
• Distributed computing makes it easier to
make and share tools, content and code
• Overall need for more transparency and
accountability, also in doing and reporting
research
Emergence of open
science
Open Peer Review
New social networks
Data, tools and workflows are sharedOpen Data
Society is engaging moreOpen API’s Open Source Software
56. ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
Carl Kesselman builds tools to enable
neuroscientists to store and share their data
in a better way
Viktor Pankratius builds software programs
that generate hypotheses about volcano
eruptions: the software can steer drones to
collect data.
Lena Deus solves scientific problems
through Kraggle: the system awards her
points for scoring highest on Machine
Learning tasks.
Scientists build data sharing
tools Computers are scientists
Science becomes a game,
which anyone can play
Some examples of Open Science:
57. ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
57
Moving to a network of connected components:
Take an Open Source data repository and find some Open Data:1
Deriva, an Open
Source data
repository
2
Write some Open Source
software to mash them up:
3 Share outputs as
OA/OD/OS:
Share new data
sets on data
Deriva
Publish
papers in an
OA journal
Share code on
platforms like
Github
user
A
1
Community adds
elements to open
science platforms that
can be used by
everyone.
2
Researchers build upon
the combination of
shared content/system
elements. This leads to
new scientific knowledge
and output.
All sharable elements find
their way to other open
platforms and formats and
can be re-used, causing a
network effect.
3
Networked system:
PLATFOR
M A
Data v1
user
B
PLATFOR
M BTools B
Open Research Platform
Data v2
Tools Carticle
user
C
Open Data
Repositorie
s
Open
Access
Journals
Code
Networks
Neuroscience data
Jupyter Notebook to calculate
properties
Share code on
platforms like
Github
58. ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
58
Manu-
facturers
Distri-
butors
Consu-
mers
Suppliers
data
tool article user
article
article
article
article
tool
tool
data
user
user
tool
data
article
article
tool
tool
data
data
data
datauser
user
user
article
Open Science represents a transition from a pipeline to a networked knowledge system:
Model: Castle
• Goal: selling content
• Metrics: number of units sold
• Strategy: optimize content delivery to users
Model: Marketplace
• Goal: grow number of interactions
• Metrics: number of interactions between users
• Strategy: optimize number of network interactions
Today:
linear supply chains
The future:
networked open
science
Linear supply chains are evolving into complex, dynamic and connected value
webs
Win by reputation Win by trust
59. ELSEVIER I Elsevier Open Science: Creating value through collaboration I
CONFIDENTIAL
59
Some current Open Science efforts:
Open
Access
Open
Data
Open
Metrics
Research
Integrity
&
Reproduci
bility
Science
&
Society
Open Tools and Software
Open Science
Open Access:
- Hybrid/Gold journals, open/self-
archive options
- Contributing to CHORUS,
CrossMark, RA21
- ‘Platinum OA’ on bepress Digital
Commons
- Pilot SSRN Preprint of the Lancet
.
Research Integrity and Reproducibility:
Many efforts, including:
- Full GDPR Compliance across all Elsevier products
- Preregistration and Registered Reports
- STAR Methods for Cell, transparent reporting
- Plagiarism and Image manipulation detection
- Statistics checking
- Reproducibility badges/TOP guidelines
- Transparency in contributorship roles (CRediT
Taxonomy)
- Research collaborations e.g Humboldt, Data Integrity
Science and Society:
- Science Literacy effort: Topic Pages,
Audioslides, Science and People
- Access to content via Patient Inform,
Research4life, Bookshare and Load2Learn.
- Elsevier Foundation supporting many
projects including Green and Sustainable
Chemistry, awards for early-career women
scientists from developing world, many
more
Open Data:
- All data is open on all platforms
- Following TOP guidelines across board
- Coleads on Enabling FAIR Data requiring
data deposits in Earth & Space Science
- Coleads Data Citation Principles in
Force11
- Supporting Scholix Linked Data repository
and other open data standards, efforts
through RDA, ORCID, CrossRef, etc
Open Metrics:
- CiteScore free API
- PlumX metrics and NewsFlo: free layer of
societal impact metrics on article level
- Helping lead RDA Make Data Count effort
with CDL/Datacite to establish data
metrics
Open Tools and Software:
- Open APIs for most products
- Many research collaborations leading to Open Source
software, e.g. Github4Labs, NIH Data commons
- Hackathons, in medicine <Elsevier Hacks>, for Mendeley
- Content and data available for research and development
and hackathons
Notas del editor
Analogies:
Manager is like OneDrive for dataset: collaborate on active project; Allows for review and approval of datasets prior to publication by library
Manager is the Trello for research project management
RESEARCHER: Example from Wouter: Why would a psychologist use this?
Project management dashboard : It enables organized project management (where is the data? Could be dropbox)
Templates can be set up
MOVE FROM FILES TO DATASET (files with description, metadata and structure)
Manager helps make your data FAIR
INSTITUTION: Monitir allows for clear presentation and enables librarians to make a decision to keep/delete private data, esp when someone has left the instituions. Archival policies. Monitor helps prevent «data loss»
Now let’s dive a little deeper into each module, starting with Repository. We know that counting only publications does not reflect the true amount of research created during an experience- we know there is likely more than 1 dataset tied to a published article. By using Repository, Researchers can:
Store up to 100GB of data per dataset
Ensure proper metadata tagging and storage
Increase discoverability of their dataset by easily creating a DOI to allow for citation. This also ensures datasets gets counted as a research output.
Standards-based metadata framework for logically and physically bundling resources with context http://researchobject.org
So let’s get to quantum computing, which is the area we were asked to focus on within the larger topic of quantum technologies. Here we can see the institutions that create the largest number of papers on QC, with the Chinese Academy of Sciences and CNRS, two national lab systems, at or near the top.
If we flip this to look at field-weighted citation impact, however, a measure of the works relative impact in the field, we get a very different picture—still highly international, but more US institutions here, and notably a number of US companies producing high-impact work.
The word cloud represents the top 50 semantically-derived keyphrases for the total set of papers representing quantum computing.
If we click on the specific term “polynomial approximation” in the word cloud, we find out how quickly the topic is growing over the last 5 years, and even which individuals and instituions worldwide are working on the particular concept of polynomial approximation. It’s immediately evident that quantum computing is a highly international and competitive field. And remember, 50 keyphrases exist for each of the 100,000 topics that are modeled in the topic prominence calculation.
Let’s slice the data in a different way. Here are top 20 institutions outside the US, again arranged by FWCI, who are doing important work in quantum computing. Notice anything? Virtually every one of these is a university.
Here’s the same list for the US. What is different here? For the US list alone, there are 3 large corporations, the NSF, and a DOE national lab contributing high-impact research. We know that quantum computing is being invested in and chased vigorously across the globe. The Chinese are pouring immense financial resources into this, and they have plenty of human talent, including many who are likely employed by the people in this room.
In my view, it is this nexus of different organizations, the close linkages between them, that gives the US its edge, if we have any edge. SEMATECH is another example of a complex of different organizations engaging in coordinated action. Over 90% of the research papers that Google publishes, and over 80% that IBM publishes, are done with one or more collaborators from academia.
So what does this difference look like in action? This geomap captures global research activity in quantum computing. The size of the bubble is the number of papers, the color intensity is the FWCI. Here we can see research is fairly evenly distributed in the US, Europe, and East Asia.
The Y axis here is the Field-Weighted Citation Impact for each university, while the position on the X-axis looks at total number of papers—clearly UC Santa Barbara is doing something exceptional here, we’ll explore that a bit more later. Waterloo and NUS are producing a lot of papers, though at a relatitvely low citation impact. Generally, the more papers one is publishing, the lower the overall impact will be. (traken from slightly different data set)
We can look at other proxies for quality, including the number of outputs in top percentiles—here the percentage of research in the top 10% of cited outputs, which is around 29% for the US in 2016, around 15.5% for non-US institutions.
Here’s the same map, but now the color intensity is the level of academic-corporate collaboration. The dark red are tech companies, but US universities also have much higher levels of AC collaboration than others. Europe and Asia are very pale by comparison.
Let’s look at different and more granular view of the same information. So there is a lot going on in this graphic—It’s a different way of looking at the landscape. The bluer the dot, the higher the FWCI. The thicker the line, the more papers are shared between between the two nodes. Network centrality implies higher levels of connectedness. Japan is peripheral and mostly connected to other Japanese entities. China, particularly Tsinghua and UST China, are more connected, Singapore still more so. However, they are not as connected or central as a few key US, UK, Australian and Canadian institutions, and one can clearly see that as few large US corporations are also quite central here.
In my view, one remaining advantage the US seems to have (in addition to lots of high-quality research) is the nexus between industry and academia--because of the enormous manufacturing complexities, the SEMATECH kind of highly coordinated approach (academia/industry/govt) may make more sense in this sector than many--also given questions of cryptographic security and national security implications.
,. We can also look at three-factor analysis. Here we map total scholarly output on the Y axis. US output of 2392 papers (2008-2016) represents about 27% of global output. The X axis is the level of academic-corporate collaboration. 7.7% of US papers, but only 1.2% of non-US papers, are AC collaborations. Finally, the size of the bubble shows the number of patent citations for every thousand papers published. For the US, this is 111 citations, meaning over 11% of these papers were cited in patents worldwide. It generally takes 3-5 years before papers are cited in patents, so this likely understates the total since we have 2016 papers in here. The same measure for non-US institutions is 21.6 per 1000, less than one-fifth the level.
This graphic really points out the large gap between the US and the ROW regarding UI collaboration, and overall patenting activity driven by university research as well.
The quantum computing topic is actually an aggregate made up of somewhat more and distinct granular topics—the same kinds of analysis can be done on these topics, which are generated directly from the topical model that I covered earlier.
This is the same topics by country and number of downloaded articles.
We can look at top corporations publishing in this area, and can see that the bulk are US firms with some Japanese representation as well.
Top universities for the same topic—Yale, UCSB, Berkeley, and MIT produce a great deal, with UCSB and Yale authors having a particularly high FWCI
One can always do a Keyphrase-based analysis if you want to delve into a particular aspect of the topic. Here we look at the same set of papers on flux qubits that cover the concepts of circuits, resonators, and Josephson junctions—note the number of papers from Yale has gone down from 85 to 46 here. Dr. Devoret has produced more work than anyone else covering these concepts.