SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Babelomics
NGS data Preprocessing
             Granada, June 2011


           Javier Santoyo-Lopez
                 jsantoyo@cipf.es
                http://bioinfo.cipf.es
               Genomics Department
   Centro de Investigacion Principe Felipe (CIPF)
                 (Valencia, Spain)
History of DNA Sequencing
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)



                              1870     Miescher: Discovers DNA

                              1940     Avery: Proposes DNA as ‘Genetic Material’
             Efficiency
                              1953     Watson & Crick: Double Helix Structure of DNA
          (bp/person/year)

            1                 1965     Holley: Sequences Yeast tRNAAla


            15
                              1970     Wu: Sequences λ Cohesive End DNA

            150

                                       Sanger: Dideoxy Chain Termination
                              1977     Gilbert: Chemical Degradation
            1,500

                              1980     Messing: M13 Cloning
            15,000


            25,000
                              1986     Hood et al.: Partial Automation


            50,000            1990      Cycle Sequencing
                                        Improved Sequencing Enzymes
            200,000                     Improved Fluorescent Detection Schemes



            50,000,000        2002

                                       Next Generation Sequencing
                                       Improved enzymes and chemistry
            100,000,000,000    2008    Improved image processing
Sequence Databases Trend
   EMBL database growth (March 2011)
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
NGS platforms comparison




06/14/11
                        Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
Adapted from John McPherson, OICR

Next-gen sequencers
                                                        Adapted from John McPherson, OICR




                                                         AB SOLiDv3
                                                         200Gb, 50 bp reads
                         100 Gb
                                                         Illumina HiSeq
 bases per machine run



                                                         150Gb, 100bp reads
                          10 Gb



                                                               454 GS FLX Titanium
                           1 Gb
                                                             0.4-1 Gb, 100-500 bp reads


                         100 Mb




                          10 Mb                                     ABI capillary sequencer
                                                                    (0.04-0.08 Mb,
                                                                    450-800 bp reads
                          1 Mb




                                  10 bp     100 bp               1,000 bp

                                          read length
Many Gbs of Sequences and...
• Data management becomes a challenge.
  – Moving data across file systems takes time (several hundred Gbs)


• What structure has the data?
  – Different sequencers output different files, but
  – There are some data formats that are being accepted widely (e.g.
    FastQ format)


• Raw sequence data formats
  –   SFF
  –   Fasta, csfasta
  –   Qual file
  –   Fastq
Fasta & Fastq formats
• FastA format (everybody knows about it)
  – Header line starts with “>” followed by a sequence ID
  – Sequence (string of nt).


• FastQ format
  – First is the sequence (like Fasta but starting with “@”)
  – Then “+” and sequence ID (optional) and in the following line are
    QVs encoded as single byte ASCII codes
    • Different quality encode variants


• Nearly all downstream analysis take FastQ as input
  sequence
Sequence to Variation Workflow
                  Raw
                  Data
                            FastQ



       IGV
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Sequence to Variation Workflow
                  Raw                        FastX
                                                        FastQ
                  Data                       Tookit
                            FastQ


                                                               BWA/
       IGV                                                    BWASW
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                 If something is wrong can I fix it?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                  If something is wrong can I fix it?
   Problem:
           HUGE files...
Why Quality Control and
         Preprocessing?
   Sequencer output:
            Reads + quality
            Is the quality of my sequenced data OK?
                   If something is wrong can I fix it?
   Problem:
            HUGE files... How do they look?
                @HWUSI­EAS460:2:1:368:1089#0/1
                TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG
                +HWUSI­EAS460:2:1:368:1089#0/1
                aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB

                @HWUSI­EAS460:2:1:368:528#0/1
                CTATTATAATATGACCGACCAGCTAGATCTACAGTC
                +HWUSI­EAS460:2:1:368:528#0/1
                abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_

   Files are flat files and are big... tens of Gbs (please... don't
    use MS word to see or edit them)
Sequence Quality
Per base Position

                                          Good data
                                         Consistent
                                          High quality along
                                          the read



  * The central red line is the median value
  * The yellow box represents the inter-quartile range (25-75%)
  * The upper and lower whiskers represent the 10% and 90% points
  * The blue line represents the mean quality
Sequence Quality
Per base Position


                  Bad data
                 High variance
                 Quality decrease
                  with length
Per Sequence Quality
     Distribution


                  Good data
                 Most are high-quality
                  sequences
Per Sequence Quality
       Distribution


                        Bad data
                       Not uniform
                        distribution


Low Quality Reads
Nucleotide Content
   per position


                  Good data
                 Smooth over
                  length
                 Organism
                  dependent (GC)
Nucleotide Content
   per position


                  Bad data
                 Sequence
                  position bias
GC Distribution



                  Good data
                 Fits with the
                  expected
                 Organism
                  dependent
Per sequence
GC Distribution


                 Bad data
                It does not fit
                 with expected
                Organism
                 dependent
                 Library
                 contamination?
Per base
GC Distribution


                      Good data
                     No variation
                      across read
                      sequence
Per base
GC Distribution


                      Bad data
                     Variation
                      across
                      read
                      sequence
Per base
N content


            It's not
            good if
            there are
            N bias per
            base
            position
Duplicated Sequences


      I don't expect high number
      of duplicated sequences:
              PCR artifact?
Distribution Length

   This is descriptive.
   Some sequencers
    output sequences of
    different length (e.g.
    454)
Overrepresented Sequences
Question:
       If you obtain the exact same sequence too
          many times      Do you have a problem?


Answer:
       Sometimes!


Examples                PCR primers (Illumina)
      GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT

      CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
K-mer Content



               Helps to detect
                problems
               Adapters?
Practical:
     FastQC and Fastx-toolkit

   Use FastQC to see your starting state.
   Use Fastx-toolkit to optimize different datasets
    and then visualize the result with FastQC to
    prove your success!

    Hints: Try trimming, clipping and quality filtering.



    Go to the tutorial and try the exercises...

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Protein database
Protein databaseProtein database
Protein database
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overview
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Whole genome sequencing
Whole genome sequencingWhole genome sequencing
Whole genome sequencing
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)
 
DNA Microarray
DNA MicroarrayDNA Microarray
DNA Microarray
 
Illumina infinium sequencing
Illumina infinium sequencingIllumina infinium sequencing
Illumina infinium sequencing
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
Prosite
PrositeProsite
Prosite
 
Swiss PROT
Swiss PROT Swiss PROT
Swiss PROT
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 

Destacado

File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicQIAGEN
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the CloudMatt Wood
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 

Destacado (20)

File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones Infographic
 
True Single Molecule Sequencing
True Single Molecule SequencingTrue Single Molecule Sequencing
True Single Molecule Sequencing
 
Clinical Application 2.0
Clinical Application 2.0Clinical Application 2.0
Clinical Application 2.0
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 

Similar a NGS Data Preprocessing

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewPatrick Merel
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nicolas Gobet
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Lex Nederbragt
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...confluent
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
 
Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...Lex Nederbragt
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)James Hadfield
 

Similar a NGS Data Preprocessing (20)

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies Overview
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
 
Xin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing PlenaryXin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing Plenary
 

Más de cursoNGS

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformaticscursoNGS
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformaticacursoNGS
 

Más de cursoNGS (8)

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformatics
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformatica
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

NGS Data Preprocessing

  • 1. Babelomics NGS data Preprocessing Granada, June 2011 Javier Santoyo-Lopez jsantoyo@cipf.es http://bioinfo.cipf.es Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)
  • 2. History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA 1940 Avery: Proposes DNA as ‘Genetic Material’ Efficiency 1953 Watson & Crick: Double Helix Structure of DNA (bp/person/year) 1 1965 Holley: Sequences Yeast tRNAAla 15 1970 Wu: Sequences λ Cohesive End DNA 150 Sanger: Dideoxy Chain Termination 1977 Gilbert: Chemical Degradation 1,500 1980 Messing: M13 Cloning 15,000 25,000 1986 Hood et al.: Partial Automation 50,000 1990 Cycle Sequencing Improved Sequencing Enzymes 200,000 Improved Fluorescent Detection Schemes 50,000,000 2002 Next Generation Sequencing Improved enzymes and chemistry 100,000,000,000 2008 Improved image processing
  • 3. Sequence Databases Trend EMBL database growth (March 2011)
  • 4. From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
  • 5. NGS platforms comparison 06/14/11 Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
  • 6. Adapted from John McPherson, OICR Next-gen sequencers Adapted from John McPherson, OICR AB SOLiDv3 200Gb, 50 bp reads 100 Gb Illumina HiSeq bases per machine run 150Gb, 100bp reads 10 Gb 454 GS FLX Titanium 1 Gb 0.4-1 Gb, 100-500 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
  • 7. Many Gbs of Sequences and... • Data management becomes a challenge. – Moving data across file systems takes time (several hundred Gbs) • What structure has the data? – Different sequencers output different files, but – There are some data formats that are being accepted widely (e.g. FastQ format) • Raw sequence data formats – SFF – Fasta, csfasta – Qual file – Fastq
  • 8. Fasta & Fastq formats • FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants • Nearly all downstream analysis take FastQ as input sequence
  • 9. Sequence to Variation Workflow Raw Data FastQ IGV BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 10. Sequence to Variation Workflow Raw FastX FastQ Data Tookit FastQ BWA/ IGV BWASW BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 11. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK?
  • 12. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?
  • 13. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files...
  • 14. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files... How do they look? @HWUSI­EAS460:2:1:368:1089#0/1 TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG +HWUSI­EAS460:2:1:368:1089#0/1 aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB @HWUSI­EAS460:2:1:368:528#0/1 CTATTATAATATGACCGACCAGCTAGATCTACAGTC +HWUSI­EAS460:2:1:368:528#0/1 abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_  Files are flat files and are big... tens of Gbs (please... don't use MS word to see or edit them)
  • 15. Sequence Quality Per base Position Good data  Consistent  High quality along the read * The central red line is the median value * The yellow box represents the inter-quartile range (25-75%) * The upper and lower whiskers represent the 10% and 90% points * The blue line represents the mean quality
  • 16. Sequence Quality Per base Position Bad data  High variance  Quality decrease with length
  • 17. Per Sequence Quality Distribution Good data  Most are high-quality sequences
  • 18. Per Sequence Quality Distribution Bad data  Not uniform distribution Low Quality Reads
  • 19. Nucleotide Content per position Good data  Smooth over length  Organism dependent (GC)
  • 20. Nucleotide Content per position Bad data  Sequence position bias
  • 21. GC Distribution Good data  Fits with the expected  Organism dependent
  • 22. Per sequence GC Distribution Bad data  It does not fit with expected  Organism dependent Library contamination?
  • 23. Per base GC Distribution Good data  No variation across read sequence
  • 24. Per base GC Distribution Bad data  Variation across read sequence
  • 25. Per base N content It's not good if there are N bias per base position
  • 26. Duplicated Sequences I don't expect high number of duplicated sequences:  PCR artifact?
  • 27. Distribution Length  This is descriptive.  Some sequencers output sequences of different length (e.g. 454)
  • 28. Overrepresented Sequences Question: If you obtain the exact same sequence too many times Do you have a problem? Answer: Sometimes! Examples PCR primers (Illumina)  GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT  CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
  • 29. K-mer Content  Helps to detect problems  Adapters?
  • 30. Practical: FastQC and Fastx-toolkit  Use FastQC to see your starting state.  Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! Hints: Try trimming, clipping and quality filtering. Go to the tutorial and try the exercises...