Dr Bennet McComish, Menzies Institute of Medical Research, University of Tasmania, presented at the Research Integrity Advisors Data Management Workshop in Hobart 2017
ICT role in 21st century education and it's challenges.
Big Datasets and Highly Sensitive Data
1. Big Datasets and Highly
Sensitive Data
Bennet McComish
31 July 2017
2. Computational Genomics
Study of the structure, function, evolution, and mapping of genomes
Genes control our basic biology, how the body works, how we respond to
drugs
Changes in your genome make you who you are
They can also cause disease (such as cancer) or mean your cancer therapy
doesn’t work (or works really well)
We study those changes to understand and improve your health
2/14
3. What is the human genome?
The genome is basically a string of letters (A T C G)
1 human genome = 3.2 billion letters or ‘bases’ spread across 23
chromosomes
3% of the genome (3 million bases) ‘coding’ for ~25,000 genes
Print version of one genome at the “Wellcome Collection”
120 books, 1000 pages each at 4.5 point text
3/14
4. Genome sequencing
Technology now allows us to read the code of our genomes
We have a human ‘reference’ genome – made of the most common (3.2
billion base) sequence
We compare a person’s genome with the reference to find all the ‘different’
sites (~3 million per person or 0.1%)
Then only focus on the places where there are differences
4/14
7. "Big data"
Hiseq 200G run
Image data 32 TB discarded
Intensity data 2 TB usually discarded
Raw sequence and quality score data 250 GB backed up
Aligned sequence 100 GB aligned to ref. genome
Variation data 1-10 GB used in most analysis
Filtered variants of interest 50-500 MB depends on study
7/14
8. One study: 254 samples from 5 large families
Don't try to drink from the fire hydrant!
Use smart study design
Filter the data:
Data overload?
changes that alter proteins
changes that run in families…
·
·
8/14
9. Pipelines
Use fast parallelised analysis pipelines where possible
Even parallelised pipeline takes several weeks to align 30 samples and call
variants
Makes it difficult to use standard HPC queuing systems
9/14
11. Data storage requirements
Australian code for the responsible conduct of research requires us to keep
research data and primary materials
All raw sequence data and final filtered data must be kept
Can discard some intermediate files, but need a large amount of fast
working storage
Data generation is now much cheaper and faster than data analysis
Data storage, transfer and analysis now critical
11/14
12. Indigenous genomes
High incidence of vulvar cancer in East Arnhem indigenous population
Ten years' work securing appropriate consent
Consent strictly limited to vulvar cancer study - indigenous communities
often wary of genetic research
Risk management - public perception and trust is often biggest risk
identified - far worse than losing data
12/14
13. Family studies
We infer family relationships from genetic data
These sometimes differ from those reported by the families
We can also infer information about family members not involved in the
study
Full pedigrees can't always be published or shared
13/14
14. Genomes technically identifiable
Privacy Act 1988 - information is "personal" if identity "can reasonably be
ascertained" from it
Identifying someone from their genome sequence is feasible and getting
easier
Gymrek et al. (2013) Science 339:321
Shared/cloud resources more challenging to use in terms of data privacy
14/14