Lecture for a course at NTNU, 27th January 2021
CC-BY 4.0 Dag Endresen https://orcid.org/0000-0002-2352-5497
See also http://bit.ly/biodiversityinformatics
https://www.gbif.no/events/2021/lecture-ntnu-gbif.html
Biodiversity Informatics Lecture on Open Science Principles
1. Biodiversity Informatics
Dag Endresen | GBIF Node Manager for Norway
Lecture | NTNU, Trondheim, Norway | 27th January 2021
Global Biodiversity Information Facility
Illustration by Harry Potter Bygett (NTNU) Wiki Commons CC BY 2.0
3. WHY TEACH STUDENTS OPEN SCIENCE?
v We are in the middle of an ongoing paradigm shift
in scientific practice (and impact metrics).
v The open science wave is moving fast!
v Young scientists will need different skills, than was
needed previously to succeed in academia.
v Researchers will need to develop different
approaches, than they needed in the past – to
remain relevant.
v Society is quickly gaining Big Data maturity and will
expect new services from biodiversity information
and research.
4. DATA CITATION AS A NEW CURRENCY OF SCIENCE
● Peer-reviewed scholarly papers in high impact journals
maintain considerable weight for impact metrics.
● A movement is under way to build similar status for
open data, open metadata, open material samples, and
other open scientific research products…
5. DECLARATION ON RESEARCH ASSESSMENT
● The Declaration on Research Assessment (DORA) recognizes the
need to improve the ways in which the outputs of scholarly research
are evaluated. Developed in 2012 in San Francisco.
● It has become a worldwide initiative covering all scholarly
disciplines and all key stakeholders including funders, publishers,
professional societies, institutions, and researchers.
● DORA’s vision is to advance practical and robust approaches to
research assessment globally and across all scholarly disciplines.
● The Research Council of Norway (RCN) signed DORA (May 2018)
To date (2021-01-27), 16 873 individuals and 2 139 organizations in 144 countries have signed DORA.
6. DATA CITATION PRINCIPLES
1. Data to be legitimate citable products of research.
2. Data citations giving scholarly credit and attribution.
3. In scholarly literature, whenever claims are based on data, data should always be cited.
4. Persistent method for identification of data, that is machine actionable, globally unique,
universal.
5. Data citation facilitate access to data or at least to metadata.
6. Unique identifiers that persist even beyond the lifespan of the data.
7. Data citation identify and access the specific data that support verification of the claim
(provenance, time-slice, version).
8. Flexible, but attention to interoperability of practices across communities.
Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014; https://doi.org/10.25490/a97f-egyk
7. FAIR DATA PRINCIPLES
… researchers need to do more than simply post
their data on the web for it to be re-usable.
8. WHAT IS FAIR DATA?
● FINDABLE à Data and supplementary materials have sufficiently rich
metadata and a unique and persistent identifier.
● ACCESSIBLE à Metadata and data are understandable to humans and
machines. Data is deposited in a trusted repository.
● INTEROPERABLE à Metadata use a formal, accessible, shared, and
broadly applicable language for knowledge representation.
● REUSABLE à Data and collections have a clear usage licenses and
provide accurate information on provenance.
● https://www.go-fair.org/
10. Novel possibilities… (for novel curiosity-driven research)
Open science
Traditional science
Young
Researcher
11. CAREER OPPORTUNITIES
● Skills for open research and open data
are in increasing demand!
● Enables new research methodologies
that were not possible before.
● Bring benefits for your career as a
(young) researcher.
13. REPRODUCIBILITY CRISIS
"Scientific irreproducibility —
the inability to repeat others'
experiments and reach the same
conclusion” (Nature 2016)
Baker (2016) 1,500 scientists lift the lid on reproducibility. Nature.
doi:10.1038/533452a
14. Baker (2016) 1,500 scientists lift the lid on reproducibility. Nature. doi:10.1038/533452a
Scientific irreproducibility
is a growing concern.
Open Science solution: researchers to
share their methods, data, computer code
and results in central data repositories.
For physical real-world samples we also
need herbarium specimen and bio-
repositories (museums).
15. WILL ANYBODY TRUST CLOSED SCIENCE AGAIN?
● Studies (1,2) indicates that p-hacking
is a significant problem – sometimes
even without the scientist even being
aware of doing so.
● Pre-registered (open) data provides a
good insurance against suspicion of
both data dredging (and plain data
falsification).
● “p-hacking” = occurs when researchers collect or
select data or statistical analyses until nonsignificant
results become significant (data fishing, …)
(1) Head et al. (2015) The Extent and Consequences of P-Hacking in Science . PLoS Biol. doi:10.1371/journal.pbio.1002106
(2) Ioannidis (2005). "Why Most Published Research Findings Are False". PLoS Medicine. doi:10.1371/journal.pmed.0020124.
17. DATA MANAGEMENT PLAN (DMP)
• A formal document that describes how data are to
be handled during a research project, and after the
project is completed.
• The goal is to plan data management before the
project begins.
• Including a plan for the costs of data management
and archiving.
• This saves time in the long run and promotes data
fitness for reuse.
• Reduce duplication of existing scientific studies.
• Reduce the loss of data.
https://innsida.ntnu.no/wiki/-/wiki/English/Data+management+plan
Illustration CC BY Jørgen Stamp
18. WHAT IS METADATA?
• Metadata, literally means
“data about data” are an
essential component of a data
management system,
describing such aspects as
the “what, where, when, who
and how” pertaining to a
resource.
19. METADATA SHOULD ALLOW A
USER OF THE DATA TO
• Identify & discover its existence
• Learn how to access or acquire the data
• Understand its fitness-for-use
• Learn how to obtain a copy of the data
• Learn how the data should be used
21. WHAT IS A DATA PAPER?
• A data paper is a peer reviewed document
describing a dataset, published in a peer
reviewed journal.
• It takes effort to prepare, curate and
describe data.
• Data papers provide recognition for this
effort by means of a scholarly article.
• Getting scholarly recognition for improving
the fitness for reuse for your own datasets.
• Getting scholarly recognition for preparing
and making legacy research data available
for reuse.
https://www.gbif.org/data-papers
22.
23. Intergovernmental network
and research infrastructure
Provides anyone, anywhere,
free and open access to data
about all types of life on Earth
Voluntary collaboration
through Memorandum of
Understanding (MoU)
Participant nodes, Secretariat
in Copenhagen, Denmark
WHAT IS GBIF?
https://www.gbif.org
25. A WINDOW ON EVIDENCE ABOUT WHERE SPECIES HAVE LIVED, AND WHEN
https://www.gbif.org/occurrence/search
Digitized
specimens
Observations
Literature
Remote-sensing
Environmental
DNA
Common
standards
(DwC)
Data publishing
and indexing
Data discovery and use
27. BY THE NUMBERS | 27 JANUARY 2021
62
Country
Participants
39
Organizational
Participants
5 373
Peer-review papers
using data
1 650 849 882
Species occurrence records
55 902
Datasets
1 634
Publishers
23.6 billion
Average records downloaded per month
(2020)
28. BY THE NUMBERS | 27 JANUARY 2021 -- NORWAY
124
Peer-review papers
using data (co-author
from Norway
40 894 819
Species occurrence records (published from)
301
Datasets (published from)
38
Publishers
(from Norway)
29. DATA TRENDS ON GBIF.org
https://www.gbif.org/analytics/global
% specimens
31. SPECIES OCCURRENCE RECORDS
WITH MULTIMEDIA EVIDENCE
27th January 2021
75 million records with taxonomically
identified images (1.8 million from Norway)
• 41.1 million specimens (Norway: 884 763)
• 31.3 million human observations (Norway: 923 763)
• 1.4 million material samples (Norway: 38 386)
685 806 audio files (Norway: 3 566)
2 825 videos (Norway: 4)
https://www.gbif.org/occurrence/gallery
33. SOURCES OF DATA IN GBIF: DIGITIZED SPECIMENS FROM MUSEUM COLLECTIONS
34. SOURCES OF DATA IN GBIF: TAXONOMIC LITERATURE, OLD AND NEW
Data liberation
35. GLOBAL BIODIVERSITY VS. DIGITALLY AVAILABLE DATA
Image:
FL
Fawcett
in
Wheller
Ann.
Entomol.
Soc.
Am.
1990
Troudet
et
al.
Nature
Scientific
Reports
2017
1200 mill.
animals
300 m
plants
20 m
fungi
16 m
bacteria
0,04 m
virus
36. LATIN NAMES ARE RULED BY THE CODES
Domain (Eukarya)
Kingdom (Animalia)
Phylum (Chordata)
Class (Mammalia)
Order (Primates)
Family (Hominidae)
Genus (Homo)
Species
(Homo sapiens)
PhyloCode
ICN
ICZN
37. OTU = SH,
Species
hypothesis
numbers [DOI]
OTU = BIN,
Barcode
identification
number
GBIF
backbone
taxonomy
BIN DEF0002
SH ABC0001
OTU = Operational Taxonomic Unit
Species DNA Barcode
38. SOURCES OF DATA IN GBIF: DNA SEQUENCE-DERIVED OCCURRENCE DATA
MGnify -- https://www.gbif.org/publisher/ab733144-7043-4e88-bd4f-fca7bf858880
39. NEW GBIF GUIDE: PUBLISHING
SEQUENCE-DERIVED DATA THROUGH
BIODIVERSITY DISCOVERY
PLATFORMS
• Authors from Australia, Norway, Sweden, Denmark, UNITE, and GBIF
• Based on practical mapping and data publishing experiences
• Cross-platform
• About 40 pages long ”cookbook”
v Introduction – refresh your ”data culinary” knowledge
v Categorization – what ”data ingredients” you got to publish?
v Mapping – choose and follow the ”recipe”
v Visuals – clarity and guidelines
v Future prospects
v Resources: glossary, links, references
Based on Darwin Core and MIxS data standards
https://doi.org/10.35035/doc-vf1a-nr22
41. POLICY LINKS: AICHI TARGETS
- Trend in invasive
alien species
introductions (through
Global Register of
Introduced and
Invasive Species)
- Species Protection
Index
- Protected Area
Representativeness
Index
- Comprehensiveness
of conservation of
socioeconomically/cu
lturally valuable
species
- Agrobiodiversity
Index
- Crop Wild Relative
Index
- Growth in species
occurrence records
accessible through
GBIF
- Species Status
Information Index
https://www.cbd.int/cooperation/csp/gbif.shtml | https://www.cbd.int/csp/survey/GBIF.pdf
42. A DATA RESOURCE TO SUPPORT RESEARCH AND SUSTAINABLE DEVELOPMENT
Conservation
- Protected areas
- Threatened species
- Invasive species risk
Food Security
- Crop wild relatives
- In situ, ex situ
conservation of
genetic diversity
- Fisheries planning
Climate change
- Modelling impacts on
species ranges
- Adaptation strategies
- Mitigation benefits,
risks
Human health
- Disease risk based on
occurrence of vectors,
hosts, reservoirs
- Medicinal plants
- Hazards e.g. snakebite
https://www.gbif.org/science-review
43. PEER-REVIEWED PUBLICATIONS USING GBIF-MEDIATED DATA September 2020
https://www.gbif.org/resource/search?contentType=literature&literatureType=journal&relevance=GBIF_USED&peerReview=true
626
52
89
148
169
229
249
350
407
428
696
676
743
938
0 200 400 600 800 1 000 1 200
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Year-to-date Annual total (with projection for 2020)
~ 2-3 papers a day
#CiteTheDOI
44. DATA USE IN PEER-REVIEWED JOURNALS
https://www.gbif.org/resource/search?contentType=literature&year=2020&literatureType=journal&relevance=GBIF_USED&countriesOfResearcher=NO&peerReview=true
GBIF-mediated data use citations 2020 2019
1 United States 260 215
2 China 192 175
3 Brazil 119 95
4 United Kingdom 107 89
5 Mexico 92 64
5 Germany 79 63
7 Spain 77 51
8 Canada 63 44
9 Australia 55 53
10 France 52 48
-- Norway 20 19
78
58
220
261
255
597
47
54
176
182
214
398
Oceania
Africa
Asia
North America
Latin America
Europe
0 200 400 600 800
Peer-reviewed uses by region
2020 2019
45. HOW TO CITE DATA MEDIATED BY GBIF
1. Download data from GBIF.org
2. and receive recommended citation with a download DOI
3. Cite the DOI in published research or other work
Example: GBIF.org (12 October 2020) GBIF Occurrence Download https//doi.org/10.15468/dl.xxxxxx
https://www.gbif.org/citation-guidelines
#CiteTheDOI
46. WHY CITE DATA?
• Good academic practice for transparent and reproducible research
• Credit institutions who shared data and supported your research
• Help data publishing institutions to demonstrate value of digitization
and data publication through research
• Correct citation encourages data sharing
• Data accessed through GBIF is free for all – but not free of
obligations: see the user agreement
https://www.gbif.org/citation-guidelines
#CiteTheDOI
47. Filter
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Process
Archive
Final state of data
Dataset DOIs Download DOI Archive DOI Bibliographic DOI
Analyze &
publish
48. Filter
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Process
Archive
Final state of data
Dataset DOIs Download DOI Archive DOI Bibliographic DOI
Analyze &
publish
49. Filter
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Process
Archive
Final state of data
Dataset DOIs Download DOI Archive DOI Bibliographic DOI
Analyze &
publish
(possible with persistent object identifiers)
50. DOI BASED DATA CITATION AT GBIF.ORG
NTNU Vascular plants: https://doi.org/10.15468/zrlqok
citations papers
dataset
51. THE RESEARCH DATA LIFECYCLE
https://library.sydney.edu.au/research/data-management/research-data-management.html
GBIF.org