2. Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
3. Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
4. Communicating in-class
• Chat channel:
http://backchannelchat.com/chat/dw131
• Feel free to ask questions, requests to
speed up/slow down
• The example files & slides available here:
ftp://climb.genomics.cn/pub/10.5524/presentations/MLIM.dir/
Also feel free to email: chris@gigasciencejournal.com
5. This is me
• LinkedIN:
https://hk.linkedin.com/in/chr1shunter
• ORCID ID:
http://orcid.org/0000-0002-1335-0881
6. My background
• Applied Biology Degree (Nottingham, UK)
• Genetics/Genomics PhD (Cambridge, UK)
• Postdoc – function of small DNA motifs
• Postdoc – Cancer Genome Project
• EBI – Curator for SRA
• EBI – Bioinformatician/Curator on
Metagenomics portal
• GigaScience Database – Lead BioCurator
95-99
99-03
03-04
04-07
07 -09
09-12
13-
present
7. Why tell you about me?
• An indication of what qualifies me to be
teaching you about curation!
• The sort of person that you might meet in
the role of BioCurator
• To show that you don’t need to know your
end goal to make a career, just make the
most of opportunities.
8. Who are you?
• I would like to take a few minutes to hear
from each of you (~30secs each)
• Name
• Background
• Scientific/academic interests
• Any idea whats next for your career?
11. GigaScience journal
• GigaScience is an OPEN access publisher
of Life Science articles
• Highly reproducible articles
• Focus on Big data
• Peer reviewed for reliability
• Provide open access free to all
• Run as a not-for-profit to best benefit
researchers
13. What is GigaDB?
• Open access database
• Data organized into datasets
• Datasets associated to GigaScience
articles
• Manually curated
• Indexed and searchable metadata
enabling discoverability and reuse.
14. • Currently >300 datasets available
• Genomic datasets represent majority of
data(~55%)
• ~75% of all data from BGI (or
collaborators)
• ~20 different data types represented
• All manually curated
15. Data types
• Nucleotide:
– Genomic, Transcriptomic, Metagenomic,
• Mass spectrometry:
– Proteomics, Metabolomics, MS-Imaging.
• Software & Workflows
• Other
– Imaging, Neuroscience, Network analysis
17. Anatomy of a GigaDB entry
• All relevant information
is held together in
packets called Datasets
• Each dataset has a
stable DOI page
• If required there can be
a hierarchy of datasets
18. • Title
• Study type(s)
• Image
• Citation
• Description
• Funders
• Links to Google
scholar and EuroPMC
to see who has cited
this dataset
• Email submitter
• Link to manuscript
• Links to external
resources
Cont.
19. • Samples used in
the study
• Files listed as part
of the study
• History of dataset
changes
• Social media links
• Links to other
datasets of similar
nature
20. Downloading the data
FTP
• Conventional/easy to use
• Can pull individually from
web page
• 1 or multiple files using
command line unix
• Speed = upto 1 Mb/sec
22. Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
24. What is data?
• “Data may exist only in the eye of the
beholder: The recognition that an
observation, artifact, or record constitutes
data is itself a scholarly act.” (Borgman,
2012)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci
Tec, 63: 1059–1078. doi:10.1002/asi.22634
25. What is data?
• We use the term “data” to be broadly
inclusive. It includes
– digital manifestations of literature
– laboratory data: including spectrographic, genomic
sequencing, and electron microscopy data
– observational data: remote sensing, geospatial, and
socioeconomic data
– other forms of data either generated or
compiled, by humans or machines: software,
scripts, intermediary data, tabular data used to generate
charts
26. How data is created
• Gathered or produced by researchers
– Observations, experiments, or models
– Survey results
– Records (census, economic, etc.)
– Digitized/born digital text and images
27. So what is metadata?
• Data ABOUT data
• a set of data that describes and gives
information about other data.
– http://dictionary.casrai.org/Metadata
• Its not a new concept, think about old
catalog cards
WikiData: Tomwsulcer
28. Curate the data
• To classify and catalog data
• Metadata is the classification and
cataloging of data to aid discoverability
and reuse.
• Strongly reliant on controlled vocabularies
and ontological terms
29. Data curation is…
• “the active and ongoing management of
data throughout its entire lifecycle of
interest and usefulness to scholarship”
Cragin et al., 2007
http://hdl.handle.net/2142/3493
• I would also add:
“the cataloging of data to increase its
usefulness”
30. Data curation…
• Is a dynamic process
– Not a one time, or one step activity
• Happens in a lifecycle
– Creation, management, preservation
• Aims to maintain the utility of the data
31. What gets curated?
• Data
– At various stages
• Methods (sometimes)
– Algorithms, code
• Metadata
– Information about the data
• Links
– metadata can form networks of linked data to
help knowledge acquisition
32. Data curation or BioCuration?
• Distinct, but related
• Data curation is broader
• BioCuration is more specific to Biological
data curation
– “Biocuration involves the translation and
integration of information relevant to biology
into a database or resource that enables
integration of the scientific literature as well as
large data sets. ”
http://biocuration.org/dissemination/who-are-we/
33. BioCuration
• The process of curating biological data
• International Society of BioCuration (ISB)
– Yearly meetings
– Society website (http://biocuration.org/)
– Discussion forum
– Job adverts
36. Why share data?
• Concepts related to the scientific method
• Reproducibility:
– Experiment can be replicated by the original
researcher or another researcher
• Reliability:
– Similar results can be achieved in other
experiments
• Re-use
– Others can make use of data in other ways
than originally intended
37. What’s important?
• An attractive, tabular lay-out in a
spreadsheet for presentational purposes?
• An accessible version that is suitable for
re-use with minimal editing?
• Both of the above?
– Consider releasing multiple formats of your
data
38. Manuscripts
• The traditional publication is
“presentational” version of the data,
– often lurking in supplemental files as PDF’s
39. Data Journals
• Publication option for datasets
– Often discipline-specific
– Can be peer-reviewed
• Sometimes provide a means of useable
data release, or sometime just an
independently citable version of
supplemental files.
40. Data Repositories
• Where data is stored for the long term
• Computer accessible
• Some repositories are discipline-specific
– Genomic data: GenBank / ENA
• Some repositories are built for an
organization
– For a university / institute
– For a funder
– Not-for-profit (Dryad, Figshare, GigaDB, Zonodo)
41. FYI: GigaScience is…
• Combination of
– Peer reviewed Manuscript publication
linked to a
– Manually curated Data repository
42. Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
44. (Primary) BioCuration activities
• Documentation
– Keeping track of how the data was:
• Generated; used; analyzed
• Annotation
– Addition of structured information to
accompany data/files
• Connection
– Linking of files/data to related items both
within dataset and to external items
45. (Ancilliary) BioCuration activities
• Collection and aggregation
– Files in directories; databases
• Storage and archiving
– Saving data (on digital media)
– Providing consistent and permanent
identifiers (DOI)
• Migration
– Active preservation of data to keep it readable
• Repeat the process on an ongoing basis
47. Dictionary
• An alphabetical reference list of terms or names
important to a particular subject or activity along
with discussion of their meanings and applications
• Casrai
– Particularly IRIDIUM (Research Data Management):
• http://dictionary.casrai.org/Category:Research_Data_Domain
– Many other dictionaries maintained by Casra
http://casrai.org/standards
48. Controlled Vocabularies
• A controlled vocabulary is an organized
arrangement of words and phrases used to
index content
• Can be a subset of a dictionary
49. Key:value pairs
• A key-value pair (KVP) is a set of two linked
data items: a key, which is a unique identifier for
some item of data, and the value, which is either
the data that is identified or a pointer to the
location of that data.
• Structured pairing of particular terms,
• one or both can be from CV’s
• Particularly used for computer readable
matadata
50. Ontologies
• a set of concepts and categories in a subject area or
domain that shows their properties and the relations
between them.
• More complex than CV’s includes relationship
information and inherited concepts
• Most ontologies in common use in
BioCuration are infact hierarchical CVs
• Much work is being done to integrate, merge
and unify many of these into a true ontology
which will enable symantic web applications.
51. RDF (Resource Description
Framework
• a model for encoding semantic relationships
between items of data so that these
relationships can be interpreted computationally.
• A complete extrapolation of all ontologies
to include all CV’s with dictionary
definitions and links to all related terms
• Entirely computer readable using URIs
54. Whats good about spreadsheets?
• Most people are familiar with them
• No programing skills required
• Can be used to make data look pretty
(highlighting, different fonts, etc)
• Are forgiving of non-data cells (e.g.
comments)
55. Whats bad about spreadsheets?
• They allow merging of cells & other odd
formatting to appeal to the eye.
• Dates (reformatted)
• Spreadsheet programs are not appropriate
for analysis/statistics.
• Incompatible (native) file formats with
command line software such as R
• Size limitations (requires a lot of RAM to open files
with millions of rows)
56. • Most people still use spreadsheet to
organize there own data
• Good practices with data collection can aid
downstream processes
57. Using spreadsheets wisely
• Useful reference http://kbroman.org/dataorg/
– Be consistent
– Write dates as YYYY-MM-DD
– Fill in all of the cells
– Put just one thing in a cell
– Create a data dictionary (like a CV)
– No calculations in the raw data files
– Don’t use font colour or highlighting as data
– Choose good names for things
– Make backups
– Save the data in plain text files
59. Hand-on part 1 (Excel)
• First of three quick practical examples of
BioCuration
– Using Excel wisely
– Exploring the DataCite XML schema
– Rationalising data using OpenRefine
60. Excel
• Keep in mind: http://kbroman.org/dataorg/
• Using this file as a starting point:
• ftp://climb.genomics.cn/pub/10.5524/prese
ntations/MLIM.dir/sample_attribute_spread
sheet-example.csv
• It contains 10,000 rows of the GigaDB
sample attributes table
61.
62. Questions
• Are the dates effected by being
manipulated via Excel?
• Do the ages all have units?
• What has happened with some of the text
in the first few rows?!
• Are all latitude and longitude values
consistent and appropriate?
63. Answers
• Some dates appear as serial dates (i.e. the
number of days after (or before) 1900-Jan-
01 e.g. 37074 = 2001-Jul-02
• Null dates have been converted to 0 or
1900-Jan-00
• Only 403 / 928 age values have units
• The hyphen has been converted to –
which is UTF8 code:
– http://www.i18nqa.com/debug/utf8-debug.html
• Only 2 Lat-long values in this subset and
they are both in different formats! 29.097221 -83.067351
44.000306N, 16.01625E
66. Standards
• Examples:
– Dublin core
– GSC
• Resources:
– www.BioSharing.org
• Results of the use of standards:
– www.Repositive.io
67. Dublin Core
• “The Dublin Core metadata standard is a simple
yet effective element set for describing a wide
range of networked resources.”
http://dublincore.org/documents/usageguide/index.shtml
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type
68. Genomics Standard Consortium
• Minimal Information about any sequence –
“MIxS” *
• Covers a variety of different
“environmental packages”
• Each recommends terms from a list of
~700 defined attributes
• Each has ~10-20 mandatory attributes
• MIxS is effectively a dictionary of attributes
* Yilmaz, P et al. Nature Biotechnology 29, 415-420 (2011) doi:10.1038/nbt.1823
69.
70. Example of MIxS compliant sample
Standards in Genomic Sciences201611:91
DOI: 10.1186/s40793-016-0213-3
Attributes
Description Actinoalloteichus hymeniacidonis DSM 45092, an
actinomycete isolated from the marine sponge Hymeniacidon perleve
BioProject PRJNA273752
strain HPA177(T) (=DSM 45092(T))
host Hymeniacidon perleve
isolation source intertidal marine sponge from the beach of Dalian
collection date 2006
geographic location China: beach of Dalian
sample type pure culture
biomaterial provider DSM 45092
culture collection DSM:45092
environment biome intertidal zone
host tissue sampled washed sponge
latitude and longitude 38.8667 N 121.6833 E
Publication
71. Effective standards and checklists
• Make extensive use of CVs, Ontologies
and KVPs
• Uptake of new standards is usually slow
and requires incentives for users
72. Application Programming Interface
• While webpages are human readable
machine require structured data
• Application Programming Interface (API)
73. Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB has a fairly complex structure as a
relational database
75. Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB is complex
• DataCite is less complicated, it’s stored in
XML (the comprehensive XSD to describe it is ~500
lines)
76. DataCite
• The XSD is available here:
– http://schema.datacite.org/meta/kernel-
4.0/metadata.xsd
• And described here:
– http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• Example are provided
– http://schema.datacite.org/meta/kernel-4.0/
77. A simple DataCite example
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4"xsi:schemaLocation="http://datacite.org/schema/kernel-4
http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="DOI">10.5072/D3P26Q35R-Test</identifier>
<creators>
<creator>
<creatorName>Fosmire, Michael</creatorName>
</creator>
<creator>
<creatorName>Wertz, Ruth</creatorName>
</creator>
<creator>
<creatorName>Purzer, Senay</creatorName>
</creator>
</creators>
<titles>
<title>Critical Engineering Literacy Test (CELT)</title>
</titles>
<publisher>Purdue University Research Repository (PURR)</publisher>
<publicationYear>2013</publicationYear>
<subjects>
<subject>Assessment</subject>
<subject>Information Literacy</subject>
<subject>Engineering</subject>
<subject>Undergraduate Students</subject>
<subject>CELT</subject>
<subject>Purdue University</subject>
</subjects>
<language>eng</language>
<resourceType resourceTypeGeneral="Dataset">Dataset</resourceType>
<version>1</version>
<descriptions>
<description descriptionType="Abstract">
We developed an instrument, Critical Engineering Literacy Test (CELT), which is a multiple choice instrument designed to measure undergraduate students’ scientific and
information literacy skills. It requires students to first read a technical memo and, based on the memo’s arguments, answer eight multiple choice and six open-ended response
questions. We collected data from 143 first-year engineering students and conducted an item analysis. The KR-20 reliability of the instrument was .39. Item difficulties ranged
between .17 to .83. The results indicate low reliability index but acceptable levels of item difficulties and item discrimination indices. Students were most challenged when
answering items measuring scientific and mathematical literacy (i.e., identifying incorrect information).
</description>
</descriptions>
</resource>
79. Hand-on part 2 (DataCite)
• Looking at the DataCite schema
– Description:
• http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• What relationships do these two datacite
records show?:
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_100038.xml
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_101041.xml
80. Answers
• 100038.xml Is a New Version Of dataset
doi:10.5524/100015
• 100038.xml Is Compiled By dataset
doi:10.5524/100044
• 10.5524/101041 Continues dataset
doi:10.5524/101000
81. BioCuration Life Cycle Summary
• As lead BioCurator for GigaDB; I am
involved in the schema design and data
capture of all types of life science data
behind GigaScience publications.
• We receive, appraise and ingest data into
GigaDB
• We preserve and store data
• We provide access for re-use of data
• All the while attempting to maintain
consistency
84. OpenRefine
• According to http://openrefine.org/
“OpenRefine (formerly Google Refine) is a
powerful tool for working with messy data:
cleaning it; transforming it from one format
into another”
• Very useful for Curators to enable
exploration (and cleaning/curation) of vast
tables of metadata
86. OpenRefine
• Download:
– http://openrefine.org/download.html
• Install: (for windows that just unzip it)
• Run: open file “openrefine.exe”
• Download example file:
– ftp://climb.genomics.cn/pub/10.5524/presentat
ions/MLIM.dir/sample_attribute_spreadsheet-
example.csv
87.
88.
89. Some things to try
• Watch the 7 minute demo video:
– https://www.youtube.com/watch?v=B70J_H_zA
WM
• Common transformations
– Cells to numbers
– Remove trailing white space
• Text Facet
– Look for attribute name = “analyte”
• Merge clusters
– Text facet on “attribute_name”
90. Quick test
• Can you find 5 problems in the
“attribute_name” column?
• Put some answers in the backchannel
http://backchannelchat.com/chat/dw131
91. There maybe others!
• Alternative name = alternative names
• Height = Height or length = hight = high or
length
• Patient = patient ID
• Pool details = pooling details
• Specimen voucher = specimen_voucher
• Tissue = tissue type
• Life stage = life stageseed
92. Looking at “value” field
• Problem is >10,000 unique terms
• Solution, to first facet on attribute_name
• E.g. attribute_name = sex
– The number of different values
is 21! Can that be refined?
( I got down to 9)
94. Summary
I’m a BioCurator using a variety of experiences to
help others publish data effectively
GigaScience is a unique publication combining the
traditional manuscript with open access to
underlying data via GigaDB
Biocuration is a broad field from fine details to high
level metadata
The goal of curation is to enable discovery of
knowledge
A variety of tools are available
95. Further reading / useful links
OpenRefine online tutorial
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
Excel / spreadsheet do’s and don’ts
http://kbroman.org/dataorg/
GSC MIxS
http://www.doi.org/10.1038/nbt.1823
Casrai – dictionary and standards
http://casrai.org/standards
List of biological standards, checklists and
databases
www.BioSharing.org
97. Reflection: how fair is FAIR?
Read the FAIR principles
paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what
is needed to implement them?
http://www.nature.com/articles/sdata201618
Reminder: Please comment in Moodle Forum. Scott will give feedback on Monday
98. Reminder: Final Project
• For the final project for this course need to
choose from 3 assignment options (see
moodle).
• The assignment is due on the 15th May and it
is worth 40% of your grade.
• Time will be set aside for presenting on this
during the final class on the 24th April:
covering why you chose the option, what
discipline/dataset/topic you are covering, and
what work you've done so far (5 mins per
student including any group feedback)
Scott needs your slides by Monday morning for 5 min presentation.
99. Looking ahead…
• Final project due 10th May
– Need to present preliminary version on 26th
April to get feedback before completion. Send
Scott slides by the 25th April so he can get
them ready for the class