HKU Data Curation MLIM7350 Class 9

Data Curation:
A BioCurators perspective.
Chris Hunter
21 April 2017
chris@gigasciencejournal.com

Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical

Communicating in-class
• Chat channel:
http://backchannelchat.com/chat/dw131
• Feel free to ask questions, requests to
speed up/slow down
• The example files & slides available here:
ftp://climb.genomics.cn/pub/10.5524/presentations/MLIM.dir/
Also feel free to email: chris@gigasciencejournal.com

This is me
• LinkedIN:
https://hk.linkedin.com/in/chr1shunter
• ORCID ID:
http://orcid.org/0000-0002-1335-0881

My background
• Applied Biology Degree (Nottingham, UK)
• Genetics/Genomics PhD (Cambridge, UK)
• Postdoc – function of small DNA motifs
• Postdoc – Cancer Genome Project
• EBI – Curator for SRA
• EBI – Bioinformatician/Curator on
Metagenomics portal
• GigaScience Database – Lead BioCurator
95-99
99-03
03-04
04-07
07 -09
09-12
13-
present

Why tell you about me?
• An indication of what qualifies me to be
teaching you about curation!
• The sort of person that you might meet in
the role of BioCurator
• To show that you don’t need to know your
end goal to make a career, just make the
most of opportunities.

Who are you?
• I would like to take a few minutes to hear
from each of you (~30secs each)
• Name
• Background
• Scientific/academic interests
• Any idea whats next for your career?

GigaScience journal
• GigaScience is an OPEN access publisher
of Life Science articles
• Highly reproducible articles
• Focus on Big data
• Peer reviewed for reliability
• Provide open access free to all
• Run as a not-for-profit to best benefit
researchers

What makes us different?
• GigaDB

What is GigaDB?
• Open access database
• Data organized into datasets
• Datasets associated to GigaScience
articles
• Manually curated
• Indexed and searchable metadata
enabling discoverability and reuse.

• Currently >300 datasets available
• Genomic datasets represent majority of
data(~55%)
• ~75% of all data from BGI (or
collaborators)
• ~20 different data types represented
• All manually curated

Data types
• Nucleotide:
– Genomic, Transcriptomic, Metagenomic,
• Mass spectrometry:
– Proteomics, Metabolomics, MS-Imaging.
• Software & Workflows
• Other
– Imaging, Neuroscience, Network analysis

Anatomy of a GigaDB entry
• All relevant information
is held together in
packets called Datasets
• Each dataset has a
stable DOI page
• If required there can be
a hierarchy of datasets

• Title
• Study type(s)
• Image
• Citation
• Description
• Funders
• Links to Google
scholar and EuroPMC
to see who has cited
this dataset
• Email submitter
• Link to manuscript
• Links to external
resources
Cont.

• Samples used in
the study
• Files listed as part
of the study
• History of dataset
changes
• Social media links
• Links to other
datasets of similar
nature

Downloading the data
FTP
• Conventional/easy to use
• Can pull individually from
web page
• 1 or multiple files using
command line unix
• Speed = upto 1 Mb/sec

What is data?
• “Data may exist only in the eye of the
beholder: The recognition that an
observation, artifact, or record constitutes
data is itself a scholarly act.” (Borgman,
2012)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci
Tec, 63: 1059–1078. doi:10.1002/asi.22634

What is data?
• We use the term “data” to be broadly
inclusive. It includes
– digital manifestations of literature
– laboratory data: including spectrographic, genomic
sequencing, and electron microscopy data
– observational data: remote sensing, geospatial, and
socioeconomic data
– other forms of data either generated or
compiled, by humans or machines: software,
scripts, intermediary data, tabular data used to generate
charts

How data is created
• Gathered or produced by researchers
– Observations, experiments, or models
– Survey results
– Records (census, economic, etc.)
– Digitized/born digital text and images

So what is metadata?
• Data ABOUT data
• a set of data that describes and gives
information about other data.
– http://dictionary.casrai.org/Metadata
• Its not a new concept, think about old
catalog cards
WikiData: Tomwsulcer

Curate the data
• To classify and catalog data
• Metadata is the classification and
cataloging of data to aid discoverability
and reuse.
• Strongly reliant on controlled vocabularies
and ontological terms

Data curation is…
• “the active and ongoing management of
data throughout its entire lifecycle of
interest and usefulness to scholarship”
Cragin et al., 2007
http://hdl.handle.net/2142/3493
• I would also add:
“the cataloging of data to increase its
usefulness”

Data curation…
• Is a dynamic process
– Not a one time, or one step activity
• Happens in a lifecycle
– Creation, management, preservation
• Aims to maintain the utility of the data

What gets curated?
• Data
– At various stages
• Methods (sometimes)
– Algorithms, code
• Metadata
– Information about the data
• Links
– metadata can form networks of linked data to
help knowledge acquisition

Data curation or BioCuration?
• Distinct, but related
• Data curation is broader
• BioCuration is more specific to Biological
data curation
– “Biocuration involves the translation and
integration of information relevant to biology
into a database or resource that enables
integration of the scientific literature as well as
large data sets. ”
http://biocuration.org/dissemination/who-are-we/

BioCuration
• The process of curating biological data
• International Society of BioCuration (ISB)
– Yearly meetings
– Society website (http://biocuration.org/)
– Discussion forum
– Job adverts

Why share data?
• Concepts related to the scientific method
• Reproducibility:
– Experiment can be replicated by the original
researcher or another researcher
• Reliability:
– Similar results can be achieved in other
experiments
• Re-use
– Others can make use of data in other ways
than originally intended

What’s important?
• An attractive, tabular lay-out in a
spreadsheet for presentational purposes?
• An accessible version that is suitable for
re-use with minimal editing?
• Both of the above?
– Consider releasing multiple formats of your
data

Manuscripts
• The traditional publication is
“presentational” version of the data,
– often lurking in supplemental files as PDF’s

Data Journals
• Publication option for datasets
– Often discipline-specific
– Can be peer-reviewed
• Sometimes provide a means of useable
data release, or sometime just an
independently citable version of
supplemental files.

Data Repositories
• Where data is stored for the long term
• Computer accessible
• Some repositories are discipline-specific
– Genomic data: GenBank / ENA
• Some repositories are built for an
organization
– For a university / institute
– For a funder
– Not-for-profit (Dryad, Figshare, GigaDB, Zonodo)

FYI: GigaScience is…
• Combination of
– Peer reviewed Manuscript publication
linked to a
– Manually curated Data repository

(Primary) BioCuration activities
• Documentation
– Keeping track of how the data was:
• Generated; used; analyzed
• Annotation
– Addition of structured information to
accompany data/files
• Connection
– Linking of files/data to related items both
within dataset and to external items

(Ancilliary) BioCuration activities
• Collection and aggregation
– Files in directories; databases
• Storage and archiving
– Saving data (on digital media)
– Providing consistent and permanent
identifiers (DOI)
• Migration
– Active preservation of data to keep it readable
• Repeat the process on an ongoing basis

The BioCurators tools
• Ontologies / CV’s / Dictionaries
• key:value pairs, RDF/triplestores

Dictionary
• An alphabetical reference list of terms or names
important to a particular subject or activity along
with discussion of their meanings and applications
• Casrai
– Particularly IRIDIUM (Research Data Management):
• http://dictionary.casrai.org/Category:Research_Data_Domain
– Many other dictionaries maintained by Casra
http://casrai.org/standards

Controlled Vocabularies
• A controlled vocabulary is an organized
arrangement of words and phrases used to
index content
• Can be a subset of a dictionary

Key:value pairs
• A key-value pair (KVP) is a set of two linked
data items: a key, which is a unique identifier for
some item of data, and the value, which is either
the data that is identified or a pointer to the
location of that data.
• Structured pairing of particular terms,
• one or both can be from CV’s
• Particularly used for computer readable
matadata

Ontologies
• a set of concepts and categories in a subject area or
domain that shows their properties and the relations
between them.
• More complex than CV’s includes relationship
information and inherited concepts
• Most ontologies in common use in
BioCuration are infact hierarchical CVs
• Much work is being done to integrate, merge
and unify many of these into a true ontology
which will enable symantic web applications.

RDF (Resource Description
Framework
• a model for encoding semantic relationships
between items of data so that these
relationships can be interpreted computationally.
• A complete extrapolation of all ontologies
to include all CV’s with dictionary
definitions and links to all related terms
• Entirely computer readable using URIs

Questions?
Reminder for Chris: Its probably about time for a break!

The BioCurators tools(2)
• key:value pairs, RDF/triplestores,
• tools for handling metadata (Excel, CSV,
OpenRefine)

Whats good about spreadsheets?
• Most people are familiar with them
• No programing skills required
• Can be used to make data look pretty
(highlighting, different fonts, etc)
• Are forgiving of non-data cells (e.g.
comments)

Whats bad about spreadsheets?
• They allow merging of cells & other odd
formatting to appeal to the eye.
• Dates (reformatted)
• Spreadsheet programs are not appropriate
for analysis/statistics.
• Incompatible (native) file formats with
command line software such as R
• Size limitations (requires a lot of RAM to open files
with millions of rows)

• Most people still use spreadsheet to
organize there own data
• Good practices with data collection can aid
downstream processes

Using spreadsheets wisely
• Useful reference http://kbroman.org/dataorg/
– Be consistent
– Write dates as YYYY-MM-DD
– Fill in all of the cells
– Put just one thing in a cell
– Create a data dictionary (like a CV)
– No calculations in the raw data files
– Don’t use font colour or highlighting as data
– Choose good names for things
– Make backups
– Save the data in plain text files

Hand-on part 1 (Excel)
• First of three quick practical examples of
BioCuration
– Using Excel wisely
– Exploring the DataCite XML schema
– Rationalising data using OpenRefine

Excel
• Keep in mind: http://kbroman.org/dataorg/
• Using this file as a starting point:
• ftp://climb.genomics.cn/pub/10.5524/prese
ntations/MLIM.dir/sample_attribute_spread
sheet-example.csv
• It contains 10,000 rows of the GigaDB
sample attributes table

Questions
• Are the dates effected by being
manipulated via Excel?
• Do the ages all have units?
• What has happened with some of the text
in the first few rows?!
• Are all latitude and longitude values
consistent and appropriate?

Answers
• Some dates appear as serial dates (i.e. the
number of days after (or before) 1900-Jan-
01 e.g. 37074 = 2001-Jul-02
• Null dates have been converted to 0 or
1900-Jan-00
• Only 403 / 928 age values have units
• The hyphen has been converted to â€“
which is UTF8 code:
– http://www.i18nqa.com/debug/utf8-debug.html
• Only 2 Lat-long values in this subset and
they are both in different formats! 29.097221 -83.067351
44.000306N, 16.01625E

The BioCurators tools(3)
• key:value pairs, RDF/triplestores,
• tools for handling metadata (Excel, CSV,
OpenRefine)
• Database (SQL/MySQL etc.)
• Structured computational formats (XML,
JSON)
• Standards

Standards
• Examples:
– Dublin core
– GSC
• Resources:
– www.BioSharing.org
• Results of the use of standards:
– www.Repositive.io

Dublin Core
• “The Dublin Core metadata standard is a simple
yet effective element set for describing a wide
range of networked resources.”
http://dublincore.org/documents/usageguide/index.shtml
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type

Genomics Standard Consortium
• Minimal Information about any sequence –
“MIxS” *
• Covers a variety of different
“environmental packages”
• Each recommends terms from a list of
~700 defined attributes
• Each has ~10-20 mandatory attributes
• MIxS is effectively a dictionary of attributes
* Yilmaz, P et al. Nature Biotechnology 29, 415-420 (2011) doi:10.1038/nbt.1823

Example of MIxS compliant sample
Standards in Genomic Sciences201611:91
DOI: 10.1186/s40793-016-0213-3
Attributes
Description Actinoalloteichus hymeniacidonis DSM 45092, an
actinomycete isolated from the marine sponge Hymeniacidon perleve
BioProject PRJNA273752
strain HPA177(T) (=DSM 45092(T))
host Hymeniacidon perleve
isolation source intertidal marine sponge from the beach of Dalian
collection date 2006
geographic location China: beach of Dalian
sample type pure culture
biomaterial provider DSM 45092
culture collection DSM:45092
environment biome intertidal zone
host tissue sampled washed sponge
latitude and longitude 38.8667 N 121.6833 E
Publication

Effective standards and checklists
• Make extensive use of CVs, Ontologies
and KVPs
• Uptake of new standards is usually slow
and requires incentives for users

Application Programming Interface
• While webpages are human readable
machine require structured data
• Application Programming Interface (API)

Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB has a fairly complex structure as a
relational database

partially expressed in 785 lines of XSD schema for beta API

Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB is complex
• DataCite is less complicated, it’s stored in
XML (the comprehensive XSD to describe it is ~500
lines)

DataCite
• The XSD is available here:
– http://schema.datacite.org/meta/kernel-
4.0/metadata.xsd
• And described here:
– http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• Example are provided
– http://schema.datacite.org/meta/kernel-4.0/

A simple DataCite example
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4"xsi:schemaLocation="http://datacite.org/schema/kernel-4
http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="DOI">10.5072/D3P26Q35R-Test</identifier>
<creators>
<creator>
<creatorName>Fosmire, Michael</creatorName>
</creator>
<creator>
<creatorName>Wertz, Ruth</creatorName>
</creator>
<creator>
<creatorName>Purzer, Senay</creatorName>
</creator>
</creators>
<titles>
<title>Critical Engineering Literacy Test (CELT)</title>
</titles>
<publisher>Purdue University Research Repository (PURR)</publisher>
<publicationYear>2013</publicationYear>
<subjects>
<subject>Assessment</subject>
<subject>Information Literacy</subject>
<subject>Engineering</subject>
<subject>Undergraduate Students</subject>
<subject>CELT</subject>
<subject>Purdue University</subject>
</subjects>
<language>eng</language>
<resourceType resourceTypeGeneral="Dataset">Dataset</resourceType>
<version>1</version>
<descriptions>
<description descriptionType="Abstract">
We developed an instrument, Critical Engineering Literacy Test (CELT), which is a multiple choice instrument designed to measure undergraduate students’ scientific and
information literacy skills. It requires students to first read a technical memo and, based on the memo’s arguments, answer eight multiple choice and six open-ended response
questions. We collected data from 143 first-year engineering students and conducted an item analysis. The KR-20 reliability of the instrument was .39. Item difficulties ranged
between .17 to .83. The results indicate low reliability index but acceptable levels of item difficulties and item discrimination indices. Students were most challenged when
answering items measuring scientific and mathematical literacy (i.e., identifying incorrect information).
</description>
</descriptions>
</resource>

Hand-on part 2 (DataCite)
• Looking at the DataCite schema
– Description:
• http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• What relationships do these two datacite
records show?:
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_100038.xml
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_101041.xml

Answers
• 100038.xml Is a New Version Of dataset
doi:10.5524/100015
• 100038.xml Is Compiled By dataset
doi:10.5524/100044
• 10.5524/101041 Continues dataset
doi:10.5524/101000

BioCuration Life Cycle Summary
• As lead BioCurator for GigaDB; I am
involved in the schema design and data
capture of all types of life science data
behind GigaScience publications.
• We receive, appraise and ingest data into
GigaDB
• We preserve and store data
• We provide access for re-use of data
• All the while attempting to maintain
consistency

BioCuration Life Cycle Summary
Helping build knowledge from data

OpenRefine
• According to http://openrefine.org/
“OpenRefine (formerly Google Refine) is a
powerful tool for working with messy data:
cleaning it; transforming it from one format
into another”
• Very useful for Curators to enable
exploration (and cleaning/curation) of vast
tables of metadata

PRACTICAL EXAMPLE 3
Rationalizing data using OpenRefine

OpenRefine
• Download:
– http://openrefine.org/download.html
• Install: (for windows that just unzip it)
• Run: open file “openrefine.exe”
• Download example file:
– ftp://climb.genomics.cn/pub/10.5524/presentat
ions/MLIM.dir/sample_attribute_spreadsheet-
example.csv

Some things to try
• Watch the 7 minute demo video:
– https://www.youtube.com/watch?v=B70J_H_zA
WM
• Common transformations
– Cells to numbers
– Remove trailing white space
• Text Facet
– Look for attribute name = “analyte”
• Merge clusters
– Text facet on “attribute_name”

Quick test
• Can you find 5 problems in the
“attribute_name” column?
• Put some answers in the backchannel
http://backchannelchat.com/chat/dw131

There maybe others!
• Alternative name = alternative names
• Height = Height or length = hight = high or
length
• Patient = patient ID
• Pool details = pooling details
• Specimen voucher = specimen_voucher
• Tissue = tissue type
• Life stage = life stageseed

Looking at “value” field
• Problem is >10,000 unique terms
• Solution, to first facet on attribute_name
• E.g. attribute_name = sex
– The number of different values
is 21! Can that be refined?
( I got down to 9)

Summary
 I’m a BioCurator using a variety of experiences to
help others publish data effectively
 GigaScience is a unique publication combining the
traditional manuscript with open access to
underlying data via GigaDB
 Biocuration is a broad field from fine details to high
level metadata
 The goal of curation is to enable discovery of
knowledge
 A variety of tools are available

Further reading / useful links
 OpenRefine online tutorial
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
 Excel / spreadsheet do’s and don’ts
http://kbroman.org/dataorg/
 GSC MIxS
http://www.doi.org/10.1038/nbt.1823
 Casrai – dictionary and standards
http://casrai.org/standards
 List of biological standards, checklists and
databases
www.BioSharing.org

Reflection: how fair is FAIR?
Read the FAIR principles
paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what
is needed to implement them?
http://www.nature.com/articles/sdata201618
Reminder: Please comment in Moodle Forum. Scott will give feedback on Monday

Reminder: Final Project
• For the final project for this course need to
choose from 3 assignment options (see
moodle).
• The assignment is due on the 15th May and it
is worth 40% of your grade.
• Time will be set aside for presenting on this
during the final class on the 24th April:
covering why you chose the option, what
discipline/dataset/topic you are covering, and
what work you've done so far (5 mins per
student including any group feedback)
Scott needs your slides by Monday morning for 5 min presentation.

Looking ahead…
• Final project due 10th May
– Need to present preliminary version on 26th
April to get feedback before completion. Send
Scott slides by the 25th April so he can get
them ready for the class

HKU Data Curation MLIM7350 Class 9

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HKU Data Curation MLIM7350 Class 9

Similar to HKU Data Curation MLIM7350 Class 9 (20)

More from Scott Edmunds

More from Scott Edmunds (20)

Recently uploaded

Recently uploaded (20)

HKU Data Curation MLIM7350 Class 9

Editor's Notes