SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Big Datasets and Highly
Sensitive Data
Bennet McComish
31 July 2017
Computational Genomics
Study of the structure, function, evolution, and mapping of genomes
Genes control our basic biology, how the body works, how we respond to
drugs
Changes in your genome make you who you are
They can also cause disease (such as cancer) or mean your cancer therapy
doesn’t work (or works really well)
We study those changes to understand and improve your health
2/14
What is the human genome?
The genome is basically a string of letters (A T C G)
1 human genome = 3.2 billion letters or ‘bases’ spread across 23
chromosomes
3% of the genome (3 million bases) ‘coding’ for ~25,000 genes
Print version of one genome at the “Wellcome Collection”
120 books, 1000 pages each at 4.5 point text
3/14
Genome sequencing
Technology now allows us to read the code of our genomes
We have a human ‘reference’ genome – made of the most common (3.2
billion base) sequence
We compare a person’s genome with the reference to find all the ‘different’
sites (~3 million per person or 0.1%)
Then only focus on the places where there are differences
4/14
Genome variation
5/14
Approaching the
"$1000 genome"
Exponential
increase in the
number of
genomes being
sequenced
Bottleneck has
moved from data
generation to data
analysis
Cost of sequencing
6/14
"Big data"
Hiseq 200G run
Image data 32 TB discarded
Intensity data 2 TB usually discarded
Raw sequence and quality score data 250 GB backed up
Aligned sequence 100 GB aligned to ref. genome
Variation data 1-10 GB used in most analysis
Filtered variants of interest 50-500 MB depends on study
7/14
One study: 254 samples from 5 large families
Don't try to drink from the fire hydrant!
Use smart study design
Filter the data:
Data overload?
changes that alter proteins
changes that run in families…
·
·
8/14
Pipelines
Use fast parallelised analysis pipelines where possible
Even parallelised pipeline takes several weeks to align 30 samples and call
variants
Makes it difficult to use standard HPC queuing systems
9/14
Menzies Computational Genomics
Cluster
Sunnydale
4 compute nodes
250 CPUs
2 TB RAM
214 TB working data
200 TB secure archive storage
·
·
·
·
·
10/14
Data storage requirements
Australian code for the responsible conduct of research requires us to keep
research data and primary materials
All raw sequence data and final filtered data must be kept
Can discard some intermediate files, but need a large amount of fast
working storage
Data generation is now much cheaper and faster than data analysis
Data storage, transfer and analysis now critical
11/14
Indigenous genomes
High incidence of vulvar cancer in East Arnhem indigenous population
Ten years' work securing appropriate consent
Consent strictly limited to vulvar cancer study - indigenous communities
often wary of genetic research
Risk management - public perception and trust is often biggest risk
identified - far worse than losing data
12/14
Family studies
We infer family relationships from genetic data
These sometimes differ from those reported by the families
We can also infer information about family members not involved in the
study
Full pedigrees can't always be published or shared
13/14
Genomes technically identifiable
Privacy Act 1988 - information is "personal" if identity "can reasonably be
ascertained" from it
Identifying someone from their genome sequence is feasible and getting
easier
Gymrek et al. (2013) Science 339:321
Shared/cloud resources more challenging to use in terms of data privacy
14/14

Más contenido relacionado

La actualidad más candente

Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
Sreekanth Gali
 
Argumentative essay power point
Argumentative essay power pointArgumentative essay power point
Argumentative essay power point
samasewa
 
EACR Travel Grant Page
EACR Travel Grant PageEACR Travel Grant Page
EACR Travel Grant Page
Dino Masic
 
Presentation1
Presentation1Presentation1
Presentation1
afkhokher
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
nadeem akhter
 

La actualidad más candente (20)

Kegg databse
Kegg databseKegg databse
Kegg databse
 
Neuromics base presentation 2019
Neuromics base presentation 2019Neuromics base presentation 2019
Neuromics base presentation 2019
 
Resume_020717
Resume_020717Resume_020717
Resume_020717
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 
Genome data management
Genome data managementGenome data management
Genome data management
 
Databases ii
Databases iiDatabases ii
Databases ii
 
Argumentative essay power point
Argumentative essay power pointArgumentative essay power point
Argumentative essay power point
 
Choose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseasesChoose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseases
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics) FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics)
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Rishi
RishiRishi
Rishi
 
Folding Aleja Ramírez
Folding Aleja RamírezFolding Aleja Ramírez
Folding Aleja Ramírez
 
EACR Travel Grant Page
EACR Travel Grant PageEACR Travel Grant Page
EACR Travel Grant Page
 
Presentation1
Presentation1Presentation1
Presentation1
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Advances in below and above-ground phenotyping
Advances in below and above-ground phenotypingAdvances in below and above-ground phenotyping
Advances in below and above-ground phenotyping
 

Similar a Big Datasets and Highly Sensitive Data

OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
Sean Davis
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeley
Shyam Sarkar
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talk
c.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5
Team Consulting Ltd
 

Similar a Big Datasets and Highly Sensitive Data (20)

 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
The Human Genome Project
The Human Genome Project The Human Genome Project
The Human Genome Project
 
Complete assignment on human Genome Project
Complete assignment on human Genome ProjectComplete assignment on human Genome Project
Complete assignment on human Genome Project
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Clinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeClinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal Genome
 
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeley
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talk
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Genetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptxGenetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptx
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
 
Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Annotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human VariomeAnnotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human Variome
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
2014 naples
2014 naples2014 naples
2014 naples
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 

Más de ARDC

Más de ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Último

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 

Big Datasets and Highly Sensitive Data

  • 1. Big Datasets and Highly Sensitive Data Bennet McComish 31 July 2017
  • 2. Computational Genomics Study of the structure, function, evolution, and mapping of genomes Genes control our basic biology, how the body works, how we respond to drugs Changes in your genome make you who you are They can also cause disease (such as cancer) or mean your cancer therapy doesn’t work (or works really well) We study those changes to understand and improve your health 2/14
  • 3. What is the human genome? The genome is basically a string of letters (A T C G) 1 human genome = 3.2 billion letters or ‘bases’ spread across 23 chromosomes 3% of the genome (3 million bases) ‘coding’ for ~25,000 genes Print version of one genome at the “Wellcome Collection” 120 books, 1000 pages each at 4.5 point text 3/14
  • 4. Genome sequencing Technology now allows us to read the code of our genomes We have a human ‘reference’ genome – made of the most common (3.2 billion base) sequence We compare a person’s genome with the reference to find all the ‘different’ sites (~3 million per person or 0.1%) Then only focus on the places where there are differences 4/14
  • 6. Approaching the "$1000 genome" Exponential increase in the number of genomes being sequenced Bottleneck has moved from data generation to data analysis Cost of sequencing 6/14
  • 7. "Big data" Hiseq 200G run Image data 32 TB discarded Intensity data 2 TB usually discarded Raw sequence and quality score data 250 GB backed up Aligned sequence 100 GB aligned to ref. genome Variation data 1-10 GB used in most analysis Filtered variants of interest 50-500 MB depends on study 7/14
  • 8. One study: 254 samples from 5 large families Don't try to drink from the fire hydrant! Use smart study design Filter the data: Data overload? changes that alter proteins changes that run in families… · · 8/14
  • 9. Pipelines Use fast parallelised analysis pipelines where possible Even parallelised pipeline takes several weeks to align 30 samples and call variants Makes it difficult to use standard HPC queuing systems 9/14
  • 10. Menzies Computational Genomics Cluster Sunnydale 4 compute nodes 250 CPUs 2 TB RAM 214 TB working data 200 TB secure archive storage · · · · · 10/14
  • 11. Data storage requirements Australian code for the responsible conduct of research requires us to keep research data and primary materials All raw sequence data and final filtered data must be kept Can discard some intermediate files, but need a large amount of fast working storage Data generation is now much cheaper and faster than data analysis Data storage, transfer and analysis now critical 11/14
  • 12. Indigenous genomes High incidence of vulvar cancer in East Arnhem indigenous population Ten years' work securing appropriate consent Consent strictly limited to vulvar cancer study - indigenous communities often wary of genetic research Risk management - public perception and trust is often biggest risk identified - far worse than losing data 12/14
  • 13. Family studies We infer family relationships from genetic data These sometimes differ from those reported by the families We can also infer information about family members not involved in the study Full pedigrees can't always be published or shared 13/14
  • 14. Genomes technically identifiable Privacy Act 1988 - information is "personal" if identity "can reasonably be ascertained" from it Identifying someone from their genome sequence is feasible and getting easier Gymrek et al. (2013) Science 339:321 Shared/cloud resources more challenging to use in terms of data privacy 14/14