SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
BIG DATA BIOLOGY FOR PYTHONISTAS:
GETTING IN ON THE GENOMICS REVOLUTION
DARYA VANICHKINA
STRUCTURE OF MY TALK
▸ Whoami, and why now?
▸ The meaning biology of life
▸ The data
▸ The reality (case studies)
▸ Other areas that need development talent
BIOLOGY 101
WHY BIOLOGY? WHY NOW?
WHY SHOULD *YOU* CARE? - IF YOU’RE A HUMAN BEING IN THE XXI CENTURY
BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
GENETIC CODE
NUCLEUS
DNA
DOUBLE HELIX.
BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UAC AAG UGC GUG - 3’
H2N - MET SER TYR LYS CYS VAL - COOH
GENETIC CODE
NUCLEUS
CYTOPLASM
DNA
RNA
PROTEIN
TRANSCRIPTION
TRANSLATION
DOUBLE HELIX. ATGC.
~6 BILLION/HUMAN CELL.
[37.2 TRILLION CELLS/BODY]
PACKAGED IN 23 PAIRS OF
CHROMOSOMES
20K CODING
GENES
BIOLOGY 201: A SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE
[A BIT] BEYOND THE CENTRAL DOGMA
5’ - ATG TCT TAmC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UIC AAG UGC GUG - 3’
H2N - MET SER pTYR LYS CYS VAL - COOH
NUCLEUS
CYTOPLASM
DNA
RNA
PROTEIN
TRANSCRIPTION
TRANSLATION
5’ - AUGUCUUUCTTAUGCGUG - 3’
NCRNA
H2N - MET SER CYS LYS CYS VAL - COOH
WHAT THE DATA LOOKS LIKE
CODIFYING THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UAC AAG UGC GUG - 3’
H2N - MET SER TYR LYS CYS VAL - COOH
GENETIC CODE
CYTOPLASM
DNA
[GENOME/
EXOME]
RNA
[TRANSCRIPTOME]
PROTEIN
TRANSCRIPTION
TRANSLATION
ATGC STRING!
AUGC STRING!
21 LETTER STRING!
WHAT DO YOU DO WITH THE DATA?
▸ Try to explain/understand diseases
(especially rare/Mendelian ones)
▸ Identify family relationships
▸ Identify ethnic origin
▸ Carrier status
▸ Targeted drug prescription, and rational
prediction of side effects
▸ Identify patients at risk of diseases, and
“catch” them earlier
THE THEORY
EUROPEAN EXAMPLE EXTRA INFO
▸ Taken from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/
figure/F1/
▸ a, A statistical summary of genetic data from 1,387 Europeans based on
principal component axis one (PC1) and axis two (PC2). Small coloured
labels represent individuals and large coloured points represent median
PC1 and PC2 values for each country. The inset map provides a key to the
labels. The PC axes are rotated to emphasize the similarity to the
geographic map of Europe. AL, Albania; AT, Austria; BA, Bosnia-
Herzegovina; BE, Belgium; BG, Bulgaria; CH, Switzerland; CY, Cyprus; CZ,
Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR,
France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE,
Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NO, Norway; NL,
Netherlands; PL, Poland; PT, Portugal; RO, Romania; RS, Serbia and
Montenegro; RU, Russia, Sct, Scotland; SE, Sweden; SI, Slovenia; SK,
Slovakia; TR, Turkey; UA, Ukraine; YG, Yugoslavia. b, A magnification of
the area around Switzerland from a showing differentiation within
Switzerland by language. c, Genetic similarity versus geographic distance.
Median genetic correlation between pairs of individuals as a function of
geographic distance between their respective populations.
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
@ERR030890.1 HWI-BRUNOP16X_0001:3:2:1148:1061#0/1
NNCAATGCTACTCTCAACAAGTTCACAGAGGAACTTAAGAAGTATGGAGTGACGNNTTTGGNTCGNGTTTGTGAT
+
##++**++++FFFFF5::88:=???FFFFFFFFFFFFFFFFF=F<8?############################
“Read”, 10 - 100+ million of these per dataset. Can be paired.
https://en.wikipedia.org/wiki/FASTQ_format
+OR +
DNA
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
ERR030890.15421060272 chr1 564478 3 75M * 0 0
GTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTANN
1576:<F<FF=::??=5?DDFFFFF<FFF<?=?=;>>??=?=???66?=;FFFFFFFFFF=???6&)(*++**## AS:i:-2
XN:i:0 XM:i:2XO:i:0 XG:i:0 NM:i:2 MD:Z:73T0T0 YT:Z:UUXS:A:- NH:i:2 CC:Z:chrM CP:i:
3929 HI:i:0
Alignment programs (run independently) - bwa, bowtie2
Output: SAM file (sequence alignment/map)
# Example for 1 read:
https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment
http://genome.sph.umich.edu/wiki/SAM
Official (obtuse) documentation https://samtools.github.io/hts-specs/SAMv1.pdf
Reference == genome
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
For visualising SAM - use http://software.broadinstitute.org/software/igv/
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch Deletion [Insertion]
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Find difference to reference
https://usegalaxy.org/
3 - 5 million variants vs reference
BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
CHROMOSOMAL MODE OF INHERITANCE
60 new mutations per generation, with a 20-year-old father transmitting ~ 25 mutations to his child, a 40-year-old father transmitting
around 65 (Kong et al Nature 2012 DOI:10.1038/nature11396; Francioli et al 2015 Nature Genetics DOI:10.1038/ng.3292)
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Homozygous/heterozygous
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|
0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
VCF file
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Homozygous/heterozygous
Good tutorial on this (VLSCI)
https://docs.google.com/document/d/1lfDYNzHjfDA1pHTHd-0w3xHhg7L4TipT1gRfzgiV8es/pub
http://vlsci.github.io/lscc_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/
http://vlsci.github.io/lscc_docs/tutorials/var_detect_advanced/var_detect_advanced/
samtools pileup, GATK, FreeBayes -> Variant Call Format (VCF)
DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
What do the differences actually mean?
What we currently do:
1. See if any of the observed variants match disease-associated mutations we’ve seen before
(databases like OMIM, dbSNP, ClinVar, SNPedia)
2. Predict whether mutation would “break” protein by introducing a “STOP” earlier in the
sequence, or shift the frame, or change a critical amino acid
BIOLOGY 201: BUT …
BUT THERE ARE MANY CHALLENGES THAT NEED TO BE ADDRESSED
CASE STUDIES
CASE STUDIES
UK: GENOMICS ENGLAND
100 000 GENOMES FOR THE NHS
CASE STUDIES
UK: GENOMICS ENGLAND .100 000 GENOMES FOR THE NHS
JESSICA WRIGHT
▸ Epilepsy, movement disorders, developmental delay
▸ Standard testing: MRI, lumbar puncture, EEGs and other
testing (including invasive tests) did not pinpoint a cause
▸ Genomic sequencing identified a de novo mutation in
Glut1, which codes for a protein responsible for
transporting glucose from the blood into the brain
▸ => Ketogenic diet (low carbohydrate, high fat diet)
CASE STUDIES
23&ME DIRECT TO CONSUMER GENETICS
▸ 23andme
▸ Illumina HumanOmniExpress-24 array
▸ opt-in research
▸ 36 FDA approved tests + ancestry vs original kit: 254 diseases/conditions
▸ Manuel Corpas - sample data of himself and his family (23&Me, Exome
sequencing)
CASE STUDIES
23&ME DIRECT TO CONSUMER GENETICS
▸ “Genetic information can reveal that someone you thought you were
related to is not your biological relative. This happens most frequently in
the case of paternity.”
▸ “Learning that your genotype is associated with an increased risk of a
particular condition can be difficult, especially if you have seen a friend or
family member struggle with a similar issue.”
▸ “Because genetic information is hereditary, knowing something about
your genetics also tells you something about those closely related to you.
Your family may or may not want to know this information as well, and
relationships with others can be affected by learning about your DNA.”
▸ Link & Siblings and half-siblings & Genome view
CASE STUDIES
VERIFI/HARMONY GENETIC TESTS (AUSTRALIAN PATHOLOGY)
▸ $450 AUD
▸ Tests for chromosome
abnormalities: trisomy 21 (Down
syndrome), trisomy 18 (Edwards
syndrome) and trisomy13 (Patau
syndrome)
▸ Optional gender, Turner
(Monosomy X) and Klinefelter
(XXY) syndromes
▸ http://www.sonicgenetics.com.au/
nipt/patients/how-it-works/
BIOLOGY 201: BUT …
BUT THERE ARE MANY (PRACTICAL) CHALLENGES THAT NEED TO BE ADDRESSED
▸ Speed (of mappers, cleaners, collapsers, annotators) is a *major* problem - in the real world,
outside of the Ivory Tower
▸ Tools are not designed to work together
▸ Technical reproducibility between centres
▸ Data sharing issues, and lack of consistent nomenclature and file format (and chr) horrors
▸ Getting it wrong can have devastating consequences (pathogenic variant later reclassified as
benign in prenatal diagnosis; athletes deemed to be erroneously at risk of cardiac failure)
▸ Differences in interpretation between pathologists/ doctors - and hence different patient
outcomes
BEYOND THE GENOME
THE ONE SLIDE ABOUT WHAT I ACTUALLY DO…
▸ GENCODE 25
▸ hg38
ADDITIONAL RESOURCES
▸ Galaxy tutorials and work-throughs (for when you’re starting out) https://
wiki.galaxyproject.org/Learn/GalaxyNGS101
▸ Broad Institute (Harvard/MIT) Public Lectures
▸ Genomics England Youtube
▸ PyCon talk by Titus Brown, with example of how to run bcbio on Ashkenazi trio dataset
▸ Bcbio sample datasets and analyses, especially the exome and whole genome variant
analysis, tumour vs normal comparisons [Good for trying out variant analysis, not so
good for RNA at the moment]
IF YOU WANT TO TRY THIS AT HOME…
WHERE TO GET DATA, AND HOW TO PROCESS IT
▸ Look for research study you’re interested in pubmed, and find where they link to the raw data
(Methods section and supplementary tables, with “weird" identifiers, in fastq)
▸ Data from all research studies *[must be] is usually* deposited in the European Nucleotide Archive
(ENA), where you can download it in fastq format.
▸ First, try to process it to reproduce the authors’ results. Galaxy provides a web interface that runs
many standard command-line tools and allows you to look at the output - good as “leading strings”
▸ Frameworks such as bcbio provide managed environments for analysis
▸ Most biological software runs on linux, and can be chained together using bash. I would go from an
exploratory analysis in Galaxy to an analysis that chains together existing tools via bash or a
complex bioinformatics pipeline management system (Wikipedia)
IF YOU WANT TO TRY THIS AT HOME…
DANGER, WILL ROBINSON! DANGER!
▸ BUT: Because of the latest technologies, you as a programming-literate
individual are in a better position to understand this data than most
▸ Understanding and playing with this data is addictive - and beautiful…
▸ This is coming to in a hospital near your
OTHER “BIOLOGY” OF INTEREST…
▸ “Algorithms stuff” (Talk tomorrow!)
▸ Biological image analysis (fMRI, microscopy)
▸ Contribute to projects such as galaxy and bcbio
▸ Machine learning of patient records
▸ Integrating IOT and wearables with medical data and patient records
▸ Cool stuff in cataloguing the genetic diversity of life, choosing which areas should be
made into national parks based on data, or understanding disease spread (ex. flu
across Asia)
ACKNOWLEDGEMENTS (I.E. THE PEOPLE I WORK WITH, WHO ARE AWESOME)
CURE THE FUTURE
RASKOLAB.GITHUB.IO
QUESTIONS?
@dvanichkina
Slides & Questions
http://daryavanichkina.com/blog/pycon2016.html
Four domains of Big Data in 2025.
In each of the four domains, the projected annual storage and computing needs are presented across the data lifecycle.
Big Data: Astronomical or Genomical? http://dx.doi.org/10.1371/journal.pbio.1002195
IMAGES USED
▸ Genomics England
▸ https://www.genomicsengland.co.uk/wp-content/uploads/2016/05/
PhilMynott_004-1024x681.jpg
▸ NHGRI
▸ https://www.genome.gov/sequencingcostsdata/
▸ Lung tumour image: http://edoc.hu-berlin.de/dissertationen/pietas-
agnieszka-2004-11-22/HTML/chapter3.html
▸ Open Clip Art
IMAGES USED
▸ Spurious correlations http://www.tylervigen.com/spurious-correlations
▸ http://phys.org/news/2009-11-conquer-social-network-cells.html
▸ http://lobsangstudio.com/nc_pop.cfm?id=291
▸ BBC education - splicing http://www.bbc.co.uk/education/guides/zgrccdm/
revision/2
▸ https://www.dnastar.com/arraystar_help/index.html#!Documents/snptable.htm
▸ http://circgenetics.ahajournals.org/content/7/6/911/F2.expansion.html

Más contenido relacionado

La actualidad más candente

Panos_NHLBI_Final Report
Panos_NHLBI_Final ReportPanos_NHLBI_Final Report
Panos_NHLBI_Final ReportJoseph Panos
 
Supporting Genomics in the Practice of Medicine by Heidi Rehm
Supporting Genomics in the Practice of Medicine by Heidi RehmSupporting Genomics in the Practice of Medicine by Heidi Rehm
Supporting Genomics in the Practice of Medicine by Heidi RehmKnome_Inc
 
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...Maryam Rafati
 
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationDr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationJohn Blue
 
Justine McKittrick SURB poster 2015 FINAL
Justine McKittrick SURB poster 2015 FINALJustine McKittrick SURB poster 2015 FINAL
Justine McKittrick SURB poster 2015 FINALJustine McKittrick
 
Python meetup 2014
Python meetup 2014Python meetup 2014
Python meetup 2014eilosei
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeAndrew Su
 
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'Fundación Ramón Areces
 
Heimler Syndrome Paper
Heimler Syndrome PaperHeimler Syndrome Paper
Heimler Syndrome PaperNada Alsheqaih
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingJohn Blue
 
Phytothreats: WP4 overview
Phytothreats: WP4 overviewPhytothreats: WP4 overview
Phytothreats: WP4 overviewForest Research
 
Current approaches for African swine fever virus vaccine development
Current approaches for African swine fever virus vaccine developmentCurrent approaches for African swine fever virus vaccine development
Current approaches for African swine fever virus vaccine developmentILRI
 
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus Yordan Penev
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
Korc Poster Final 11 23 10
Korc Poster Final 11 23 10Korc Poster Final 11 23 10
Korc Poster Final 11 23 10Jack Crawford
 

La actualidad más candente (18)

Heimbruch 2015
Heimbruch 2015Heimbruch 2015
Heimbruch 2015
 
Panos_NHLBI_Final Report
Panos_NHLBI_Final ReportPanos_NHLBI_Final Report
Panos_NHLBI_Final Report
 
Supporting Genomics in the Practice of Medicine by Heidi Rehm
Supporting Genomics in the Practice of Medicine by Heidi RehmSupporting Genomics in the Practice of Medicine by Heidi Rehm
Supporting Genomics in the Practice of Medicine by Heidi Rehm
 
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...
Preimplantation Genetic Diagnosis using Next Generation Sequencing for Social...
 
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationDr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
 
Justine McKittrick SURB poster 2015 FINAL
Justine McKittrick SURB poster 2015 FINALJustine McKittrick SURB poster 2015 FINAL
Justine McKittrick SURB poster 2015 FINAL
 
Python meetup 2014
Python meetup 2014Python meetup 2014
Python meetup 2014
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'
Dr. Francesc Palau - 'Neuropatías periféricas hereditarias'
 
Biochemistry Poster
Biochemistry PosterBiochemistry Poster
Biochemistry Poster
 
Heimler Syndrome Paper
Heimler Syndrome PaperHeimler Syndrome Paper
Heimler Syndrome Paper
 
Pone.0034901
Pone.0034901Pone.0034901
Pone.0034901
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
 
Phytothreats: WP4 overview
Phytothreats: WP4 overviewPhytothreats: WP4 overview
Phytothreats: WP4 overview
 
Current approaches for African swine fever virus vaccine development
Current approaches for African swine fever virus vaccine developmentCurrent approaches for African swine fever virus vaccine development
Current approaches for African swine fever virus vaccine development
 
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Korc Poster Final 11 23 10
Korc Poster Final 11 23 10Korc Poster Final 11 23 10
Korc Poster Final 11 23 10
 

Similar a Big data biology for pythonistas: getting in on the genomics revolution

Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Chirag Patel
 
Evolutionary Genetics of Complex Genome
Evolutionary Genetics of Complex GenomeEvolutionary Genetics of Complex Genome
Evolutionary Genetics of Complex Genomejrossibarra
 
Human genome project(ibri)
Human genome project(ibri)Human genome project(ibri)
Human genome project(ibri)ajay vishwakrma
 
Monitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesMonitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesHealth Informatics New Zealand
 
Genomics
GenomicsGenomics
Genomicscphcosu
 
What can your dog teach you about Genetics?
What can your dog teach you about Genetics?What can your dog teach you about Genetics?
What can your dog teach you about Genetics?rlanchantin
 
Open Source Pharma /Genomics and clinical practice / Prof Hosur
Open Source Pharma /Genomics and clinical practice / Prof Hosur Open Source Pharma /Genomics and clinical practice / Prof Hosur
Open Source Pharma /Genomics and clinical practice / Prof Hosur opensourcepharmafound
 
Insilico analysis of pkd genes in polycystic kidney disease patients
Insilico analysis of pkd genes in polycystic kidney disease patientsInsilico analysis of pkd genes in polycystic kidney disease patients
Insilico analysis of pkd genes in polycystic kidney disease patientsVeeramuthumariPandia1
 
YEAR IN REVIEW - Genetics, Genomics, Epigenetics
YEAR IN REVIEW - Genetics, Genomics, EpigeneticsYEAR IN REVIEW - Genetics, Genomics, Epigenetics
YEAR IN REVIEW - Genetics, Genomics, EpigeneticsOARSI
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf7006ASWATHIRR
 
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderPersonalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderThe Hive
 
A New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicA New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicJoaquin Dopazo
 
The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein...
The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein...The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein...
The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein...Chris Southan
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceNikesh Shah
 
Genetic regulation of human brain aging
Genetic regulation of human brain agingGenetic regulation of human brain aging
Genetic regulation of human brain agingAlzforum
 

Similar a Big data biology for pythonistas: getting in on the genomics revolution (20)

2009 09 08 Wiltshire Ipit Seminar Slides
2009 09 08 Wiltshire Ipit Seminar Slides2009 09 08 Wiltshire Ipit Seminar Slides
2009 09 08 Wiltshire Ipit Seminar Slides
 
Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616
 
Evolutionary Genetics of Complex Genome
Evolutionary Genetics of Complex GenomeEvolutionary Genetics of Complex Genome
Evolutionary Genetics of Complex Genome
 
Human genome project(ibri)
Human genome project(ibri)Human genome project(ibri)
Human genome project(ibri)
 
Monitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomesMonitoring the quality of data in the clinical use of pathogen genomes
Monitoring the quality of data in the clinical use of pathogen genomes
 
Genomics
GenomicsGenomics
Genomics
 
What can your dog teach you about Genetics?
What can your dog teach you about Genetics?What can your dog teach you about Genetics?
What can your dog teach you about Genetics?
 
Open Source Pharma /Genomics and clinical practice / Prof Hosur
Open Source Pharma /Genomics and clinical practice / Prof Hosur Open Source Pharma /Genomics and clinical practice / Prof Hosur
Open Source Pharma /Genomics and clinical practice / Prof Hosur
 
20150115_JQO_NYAPopulationGenomics
20150115_JQO_NYAPopulationGenomics20150115_JQO_NYAPopulationGenomics
20150115_JQO_NYAPopulationGenomics
 
Insilico analysis of pkd genes in polycystic kidney disease patients
Insilico analysis of pkd genes in polycystic kidney disease patientsInsilico analysis of pkd genes in polycystic kidney disease patients
Insilico analysis of pkd genes in polycystic kidney disease patients
 
YEAR IN REVIEW - Genetics, Genomics, Epigenetics
YEAR IN REVIEW - Genetics, Genomics, EpigeneticsYEAR IN REVIEW - Genetics, Genomics, Epigenetics
YEAR IN REVIEW - Genetics, Genomics, Epigenetics
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf
 
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike SnyderPersonalized Medicine and the Omics Revolution by Professor Mike Snyder
Personalized Medicine and the Omics Revolution by Professor Mike Snyder
 
A New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicA New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The Clinic
 
The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein...
The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein...The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein...
The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein...
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidance
 
Genetic regulation of human brain aging
Genetic regulation of human brain agingGenetic regulation of human brain aging
Genetic regulation of human brain aging
 
QTLS......pptx
QTLS......pptxQTLS......pptx
QTLS......pptx
 
Dna microarray mehran- u of toronto
Dna microarray  mehran- u of torontoDna microarray  mehran- u of toronto
Dna microarray mehran- u of toronto
 
Biology Literature Review Example
Biology Literature Review ExampleBiology Literature Review Example
Biology Literature Review Example
 

Más de Darya Vanichkina

Sharing & Reusing Training Materials
Sharing & Reusing Training MaterialsSharing & Reusing Training Materials
Sharing & Reusing Training MaterialsDarya Vanichkina
 
Jumping into digital: Lessons learned while moving live coding machine learni...
Jumping into digital: Lessons learned while moving live coding machine learni...Jumping into digital: Lessons learned while moving live coding machine learni...
Jumping into digital: Lessons learned while moving live coding machine learni...Darya Vanichkina
 
Grammar of Graphics - Darya Vanichkina
Grammar of Graphics - Darya VanichkinaGrammar of Graphics - Darya Vanichkina
Grammar of Graphics - Darya VanichkinaDarya Vanichkina
 
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...Activity-dependent transcriptional dynamics in mouse primary cortical and hum...
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...Darya Vanichkina
 
Comparing the early ciRNA papers
Comparing the early ciRNA papers Comparing the early ciRNA papers
Comparing the early ciRNA papers Darya Vanichkina
 

Más de Darya Vanichkina (6)

Sharing & Reusing Training Materials
Sharing & Reusing Training MaterialsSharing & Reusing Training Materials
Sharing & Reusing Training Materials
 
Jumping into digital: Lessons learned while moving live coding machine learni...
Jumping into digital: Lessons learned while moving live coding machine learni...Jumping into digital: Lessons learned while moving live coding machine learni...
Jumping into digital: Lessons learned while moving live coding machine learni...
 
ANDS_TrainingTheTrainer
ANDS_TrainingTheTrainerANDS_TrainingTheTrainer
ANDS_TrainingTheTrainer
 
Grammar of Graphics - Darya Vanichkina
Grammar of Graphics - Darya VanichkinaGrammar of Graphics - Darya Vanichkina
Grammar of Graphics - Darya Vanichkina
 
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...Activity-dependent transcriptional dynamics in mouse primary cortical and hum...
Activity-dependent transcriptional dynamics in mouse primary cortical and hum...
 
Comparing the early ciRNA papers
Comparing the early ciRNA papers Comparing the early ciRNA papers
Comparing the early ciRNA papers
 

Último

Solid waste management_13_409_U1_2024.pptx
Solid waste management_13_409_U1_2024.pptxSolid waste management_13_409_U1_2024.pptx
Solid waste management_13_409_U1_2024.pptxkrishuchavda31032003
 
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...Ajay kamboj
 
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET
 
Skin: Structure and function of the skin
Skin: Structure and function of the skinSkin: Structure and function of the skin
Skin: Structure and function of the skinheenarahangdale01
 
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...ORAU
 
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes
 
Introduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinIntroduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinSowmiya
 
Geometric New Earth, Solarsystem, projection
Geometric New Earth, Solarsystem, projectionGeometric New Earth, Solarsystem, projection
Geometric New Earth, Solarsystem, projectionWim van Es
 
Theory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesTheory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesChimwemweGladysBanda
 
Zoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptxZoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptx2019n04898
 
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...klada0003
 
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERIC
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERICROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERIC
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERICsnehalraut2002
 
The deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfThe deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfSOCIEDAD JULIO GARAVITO
 
Project report on Fasciola hepatica.docx
Project report on Fasciola hepatica.docxProject report on Fasciola hepatica.docx
Project report on Fasciola hepatica.docxpriyanshimanchanda4
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...ORAU
 
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPirithiRaju
 
layers of the earths atmosphere.ppt slides for grade 9
layers of the earths atmosphere.ppt slides for grade 9layers of the earths atmosphere.ppt slides for grade 9
layers of the earths atmosphere.ppt slides for grade 9rolanaribato30
 
Development of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in DrivingDevelopment of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in Drivingstudiotelon
 
Basics Of Computers | The Computer System
Basics Of Computers | The Computer SystemBasics Of Computers | The Computer System
Basics Of Computers | The Computer SystemNehaRohtagi1
 
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYRHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYDnyandaBopche
 

Último (20)

Solid waste management_13_409_U1_2024.pptx
Solid waste management_13_409_U1_2024.pptxSolid waste management_13_409_U1_2024.pptx
Solid waste management_13_409_U1_2024.pptx
 
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
 
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
 
Skin: Structure and function of the skin
Skin: Structure and function of the skinSkin: Structure and function of the skin
Skin: Structure and function of the skin
 
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...
Mining Data for Ore Natural Language Processing to Identify Lithium Minerals ...
 
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
 
Introduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinIntroduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of protein
 
Geometric New Earth, Solarsystem, projection
Geometric New Earth, Solarsystem, projectionGeometric New Earth, Solarsystem, projection
Geometric New Earth, Solarsystem, projection
 
Theory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesTheory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theories
 
Zoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptxZoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptx
 
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...
1David Andress - The Oxford Handbook of the French Revolution-Oxford Universi...
 
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERIC
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERICROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERIC
ROLE OF HERBS IN COSMETIC SKIN CARE: ALOE AND TURMERIC
 
The deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfThe deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdf
 
Project report on Fasciola hepatica.docx
Project report on Fasciola hepatica.docxProject report on Fasciola hepatica.docx
Project report on Fasciola hepatica.docx
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
 
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
 
layers of the earths atmosphere.ppt slides for grade 9
layers of the earths atmosphere.ppt slides for grade 9layers of the earths atmosphere.ppt slides for grade 9
layers of the earths atmosphere.ppt slides for grade 9
 
Development of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in DrivingDevelopment of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in Driving
 
Basics Of Computers | The Computer System
Basics Of Computers | The Computer SystemBasics Of Computers | The Computer System
Basics Of Computers | The Computer System
 
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYRHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
 

Big data biology for pythonistas: getting in on the genomics revolution

  • 1. BIG DATA BIOLOGY FOR PYTHONISTAS: GETTING IN ON THE GENOMICS REVOLUTION DARYA VANICHKINA
  • 2. STRUCTURE OF MY TALK ▸ Whoami, and why now? ▸ The meaning biology of life ▸ The data ▸ The reality (case studies) ▸ Other areas that need development talent
  • 4. WHY SHOULD *YOU* CARE? - IF YOU’RE A HUMAN BEING IN THE XXI CENTURY
  • 5. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN THE CENTRAL DOGMA 5’ - ATG TCT TAC AAG TGC GTG - 3’ 3’ - TAC AGA ATG TTC ACG CAC - 5’ GENETIC CODE NUCLEUS DNA DOUBLE HELIX.
  • 6. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN THE CENTRAL DOGMA 5’ - ATG TCT TAC AAG TGC GTG - 3’ 3’ - TAC AGA ATG TTC ACG CAC - 5’ 5’ - AUG UCU UAC AAG UGC GUG - 3’ 5’ - AUG UCU UAC AAG UGC GUG - 3’ H2N - MET SER TYR LYS CYS VAL - COOH GENETIC CODE NUCLEUS CYTOPLASM DNA RNA PROTEIN TRANSCRIPTION TRANSLATION DOUBLE HELIX. ATGC. ~6 BILLION/HUMAN CELL. [37.2 TRILLION CELLS/BODY] PACKAGED IN 23 PAIRS OF CHROMOSOMES 20K CODING GENES
  • 7. BIOLOGY 201: A SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE [A BIT] BEYOND THE CENTRAL DOGMA 5’ - ATG TCT TAmC AAG TGC GTG - 3’ 3’ - TAC AGA ATG TTC ACG CAC - 5’ 5’ - AUG UCU UAC AAG UGC GUG - 3’ 5’ - AUG UCU UIC AAG UGC GUG - 3’ H2N - MET SER pTYR LYS CYS VAL - COOH NUCLEUS CYTOPLASM DNA RNA PROTEIN TRANSCRIPTION TRANSLATION 5’ - AUGUCUUUCTTAUGCGUG - 3’ NCRNA H2N - MET SER CYS LYS CYS VAL - COOH
  • 8. WHAT THE DATA LOOKS LIKE CODIFYING THE CENTRAL DOGMA 5’ - ATG TCT TAC AAG TGC GTG - 3’ 3’ - TAC AGA ATG TTC ACG CAC - 5’ 5’ - AUG UCU UAC AAG UGC GUG - 3’ 5’ - AUG UCU UAC AAG UGC GUG - 3’ H2N - MET SER TYR LYS CYS VAL - COOH GENETIC CODE CYTOPLASM DNA [GENOME/ EXOME] RNA [TRANSCRIPTOME] PROTEIN TRANSCRIPTION TRANSLATION ATGC STRING! AUGC STRING! 21 LETTER STRING!
  • 9. WHAT DO YOU DO WITH THE DATA? ▸ Try to explain/understand diseases (especially rare/Mendelian ones) ▸ Identify family relationships ▸ Identify ethnic origin ▸ Carrier status ▸ Targeted drug prescription, and rational prediction of side effects ▸ Identify patients at risk of diseases, and “catch” them earlier THE THEORY
  • 10. EUROPEAN EXAMPLE EXTRA INFO ▸ Taken from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/ figure/F1/ ▸ a, A statistical summary of genetic data from 1,387 Europeans based on principal component axis one (PC1) and axis two (PC2). Small coloured labels represent individuals and large coloured points represent median PC1 and PC2 values for each country. The inset map provides a key to the labels. The PC axes are rotated to emphasize the similarity to the geographic map of Europe. AL, Albania; AT, Austria; BA, Bosnia- Herzegovina; BE, Belgium; BG, Bulgaria; CH, Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO, Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE, Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG, Yugoslavia. b, A magnification of the area around Switzerland from a showing differentiation within Switzerland by language. c, Genetic similarity versus geographic distance. Median genetic correlation between pairs of individuals as a function of geographic distance between their respective populations.
  • 11. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET @ERR030890.1 HWI-BRUNOP16X_0001:3:2:1148:1061#0/1 NNCAATGCTACTCTCAACAAGTTCACAGAGGAACTTAAGAAGTATGGAGTGACGNNTTTGGNTCGNGTTTGTGAT + ##++**++++FFFFF5::88:=???FFFFFFFFFFFFFFFFF=F<8?############################ “Read”, 10 - 100+ million of these per dataset. Can be paired. https://en.wikipedia.org/wiki/FASTQ_format +OR + DNA
  • 12. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET ERR030890.15421060272 chr1 564478 3 75M * 0 0 GTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTANN 1576:<F<FF=::??=5?DDFFFFF<FFF<?=?=;>>??=?=???66?=;FFFFFFFFFF=???6&)(*++**## AS:i:-2 XN:i:0 XM:i:2XO:i:0 XG:i:0 NM:i:2 MD:Z:73T0T0 YT:Z:UUXS:A:- NH:i:2 CC:Z:chrM CP:i: 3929 HI:i:0 Alignment programs (run independently) - bwa, bowtie2 Output: SAM file (sequence alignment/map) # Example for 1 read: https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment http://genome.sph.umich.edu/wiki/SAM Official (obtuse) documentation https://samtools.github.io/hts-specs/SAMv1.pdf Reference == genome
  • 13. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5 For visualising SAM - use http://software.broadinstitute.org/software/igv/ CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5 Mismatch Deletion [Insertion]
  • 14. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5 CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5 Mismatch (SNV) Deletion [Insertion] Find difference to reference https://usegalaxy.org/ 3 - 5 million variants vs reference
  • 15. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN CHROMOSOMAL MODE OF INHERITANCE 60 new mutations per generation, with a 20-year-old father transmitting ~ 25 mutations to his child, a 40-year-old father transmitting around 65 (Kong et al Nature 2012 DOI:10.1038/nature11396; Francioli et al 2015 Nature Genetics DOI:10.1038/ng.3292)
  • 16. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5 CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5 Mismatch (SNV) Deletion [Insertion] Homozygous/heterozygous #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0| 0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. VCF file
  • 17. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5 CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5 Mismatch (SNV) Deletion [Insertion] Homozygous/heterozygous Good tutorial on this (VLSCI) https://docs.google.com/document/d/1lfDYNzHjfDA1pHTHd-0w3xHhg7L4TipT1gRfzgiV8es/pub http://vlsci.github.io/lscc_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/ http://vlsci.github.io/lscc_docs/tutorials/var_detect_advanced/var_detect_advanced/ samtools pileup, GATK, FreeBayes -> Variant Call Format (VCF)
  • 18. DATA ANALYSIS PIPELINE FOR PROCESSING GENOMIC DATA SEQUENCE GENOME MAP READS TO REFERENCE CALL VARIANTS INTERPRET What do the differences actually mean? What we currently do: 1. See if any of the observed variants match disease-associated mutations we’ve seen before (databases like OMIM, dbSNP, ClinVar, SNPedia) 2. Predict whether mutation would “break” protein by introducing a “STOP” earlier in the sequence, or shift the frame, or change a critical amino acid
  • 19. BIOLOGY 201: BUT … BUT THERE ARE MANY CHALLENGES THAT NEED TO BE ADDRESSED
  • 21. CASE STUDIES UK: GENOMICS ENGLAND 100 000 GENOMES FOR THE NHS
  • 22. CASE STUDIES UK: GENOMICS ENGLAND .100 000 GENOMES FOR THE NHS JESSICA WRIGHT ▸ Epilepsy, movement disorders, developmental delay ▸ Standard testing: MRI, lumbar puncture, EEGs and other testing (including invasive tests) did not pinpoint a cause ▸ Genomic sequencing identified a de novo mutation in Glut1, which codes for a protein responsible for transporting glucose from the blood into the brain ▸ => Ketogenic diet (low carbohydrate, high fat diet)
  • 23. CASE STUDIES 23&ME DIRECT TO CONSUMER GENETICS ▸ 23andme ▸ Illumina HumanOmniExpress-24 array ▸ opt-in research ▸ 36 FDA approved tests + ancestry vs original kit: 254 diseases/conditions ▸ Manuel Corpas - sample data of himself and his family (23&Me, Exome sequencing)
  • 24. CASE STUDIES 23&ME DIRECT TO CONSUMER GENETICS ▸ “Genetic information can reveal that someone you thought you were related to is not your biological relative. This happens most frequently in the case of paternity.” ▸ “Learning that your genotype is associated with an increased risk of a particular condition can be difficult, especially if you have seen a friend or family member struggle with a similar issue.” ▸ “Because genetic information is hereditary, knowing something about your genetics also tells you something about those closely related to you. Your family may or may not want to know this information as well, and relationships with others can be affected by learning about your DNA.” ▸ Link & Siblings and half-siblings & Genome view
  • 25. CASE STUDIES VERIFI/HARMONY GENETIC TESTS (AUSTRALIAN PATHOLOGY) ▸ $450 AUD ▸ Tests for chromosome abnormalities: trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome) and trisomy13 (Patau syndrome) ▸ Optional gender, Turner (Monosomy X) and Klinefelter (XXY) syndromes ▸ http://www.sonicgenetics.com.au/ nipt/patients/how-it-works/
  • 26. BIOLOGY 201: BUT … BUT THERE ARE MANY (PRACTICAL) CHALLENGES THAT NEED TO BE ADDRESSED ▸ Speed (of mappers, cleaners, collapsers, annotators) is a *major* problem - in the real world, outside of the Ivory Tower ▸ Tools are not designed to work together ▸ Technical reproducibility between centres ▸ Data sharing issues, and lack of consistent nomenclature and file format (and chr) horrors ▸ Getting it wrong can have devastating consequences (pathogenic variant later reclassified as benign in prenatal diagnosis; athletes deemed to be erroneously at risk of cardiac failure) ▸ Differences in interpretation between pathologists/ doctors - and hence different patient outcomes
  • 28. THE ONE SLIDE ABOUT WHAT I ACTUALLY DO… ▸ GENCODE 25 ▸ hg38
  • 29. ADDITIONAL RESOURCES ▸ Galaxy tutorials and work-throughs (for when you’re starting out) https:// wiki.galaxyproject.org/Learn/GalaxyNGS101 ▸ Broad Institute (Harvard/MIT) Public Lectures ▸ Genomics England Youtube ▸ PyCon talk by Titus Brown, with example of how to run bcbio on Ashkenazi trio dataset ▸ Bcbio sample datasets and analyses, especially the exome and whole genome variant analysis, tumour vs normal comparisons [Good for trying out variant analysis, not so good for RNA at the moment]
  • 30. IF YOU WANT TO TRY THIS AT HOME… WHERE TO GET DATA, AND HOW TO PROCESS IT ▸ Look for research study you’re interested in pubmed, and find where they link to the raw data (Methods section and supplementary tables, with “weird" identifiers, in fastq) ▸ Data from all research studies *[must be] is usually* deposited in the European Nucleotide Archive (ENA), where you can download it in fastq format. ▸ First, try to process it to reproduce the authors’ results. Galaxy provides a web interface that runs many standard command-line tools and allows you to look at the output - good as “leading strings” ▸ Frameworks such as bcbio provide managed environments for analysis ▸ Most biological software runs on linux, and can be chained together using bash. I would go from an exploratory analysis in Galaxy to an analysis that chains together existing tools via bash or a complex bioinformatics pipeline management system (Wikipedia)
  • 31. IF YOU WANT TO TRY THIS AT HOME… DANGER, WILL ROBINSON! DANGER! ▸ BUT: Because of the latest technologies, you as a programming-literate individual are in a better position to understand this data than most ▸ Understanding and playing with this data is addictive - and beautiful… ▸ This is coming to in a hospital near your
  • 32. OTHER “BIOLOGY” OF INTEREST… ▸ “Algorithms stuff” (Talk tomorrow!) ▸ Biological image analysis (fMRI, microscopy) ▸ Contribute to projects such as galaxy and bcbio ▸ Machine learning of patient records ▸ Integrating IOT and wearables with medical data and patient records ▸ Cool stuff in cataloguing the genetic diversity of life, choosing which areas should be made into national parks based on data, or understanding disease spread (ex. flu across Asia)
  • 33. ACKNOWLEDGEMENTS (I.E. THE PEOPLE I WORK WITH, WHO ARE AWESOME) CURE THE FUTURE RASKOLAB.GITHUB.IO
  • 34. QUESTIONS? @dvanichkina Slides & Questions http://daryavanichkina.com/blog/pycon2016.html Four domains of Big Data in 2025. In each of the four domains, the projected annual storage and computing needs are presented across the data lifecycle. Big Data: Astronomical or Genomical? http://dx.doi.org/10.1371/journal.pbio.1002195
  • 35. IMAGES USED ▸ Genomics England ▸ https://www.genomicsengland.co.uk/wp-content/uploads/2016/05/ PhilMynott_004-1024x681.jpg ▸ NHGRI ▸ https://www.genome.gov/sequencingcostsdata/ ▸ Lung tumour image: http://edoc.hu-berlin.de/dissertationen/pietas- agnieszka-2004-11-22/HTML/chapter3.html ▸ Open Clip Art
  • 36. IMAGES USED ▸ Spurious correlations http://www.tylervigen.com/spurious-correlations ▸ http://phys.org/news/2009-11-conquer-social-network-cells.html ▸ http://lobsangstudio.com/nc_pop.cfm?id=291 ▸ BBC education - splicing http://www.bbc.co.uk/education/guides/zgrccdm/ revision/2 ▸ https://www.dnastar.com/arraystar_help/index.html#!Documents/snptable.htm ▸ http://circgenetics.ahajournals.org/content/7/6/911/F2.expansion.html