SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: https://bitly.com/BioinfoInternEx2014
Quality Control of NGS Data
1. Evaluation
2. Preprocessing
Quality Control of NGS Data
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Slide credit: Aureliano Bombarely
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as quality score and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Slide credit: Aureliano Bombarely
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq,
SRR404334_ch4.fq and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Slide credit: Aureliano Bombarely
Exercise 1:
4. Count number of sequences in each fastq file using
commands you learnt earlier.
5. Convert the fastq files to fasta.
6. Explore other tools in the FASTX toolkit.
7. Now count the number of sequences in fasta file and see
if the number of sequences has changed.
Evaluation
Tip: Use ‘grep’
Tip: Use ‘fastq_to_fasta -h’ to see help
Use Google if you are stuck
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Slide credit: Aureliano Bombarely
Evaluation: Sequence Quality
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 8
454
Pacific
Biosciences
Evaluation: Sequence Content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Evaluation: Sequence Content
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Duplication
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Evaluation: Duplication
7/8/2014 BTI PGRP Summer Internship Program 2014 12
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Overrepresented Sequences
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 13
Evaluation: Overrepresented Sequences
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 15
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 16
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 17
454
Pacific
Biosciences
Question 2.2: How many sequences there are per file in FastQC?
Question 2.3: Which is the length range for these reads?
Question 2.4: Which is the quality score range for these reads? Which
one looks best quality-wise?
Question 2.5: Do these datasets have read overrepresentation?
Question 2.6: Looking into the kmer content, do you think that the samples
have an adaptor?
Evaluation
Exercise 2:
1.Type ‘fastqc’ to start the FastQC program. Load the four
fastq sequence files in the program.
7/8/2014 BTI PGRP Summer Internship Program 2014 18
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 19
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a
dapters1.fa
• Run the read processing program over each of the datasets
using
• Min. qscore of 30
• Min. length of 40 bp
• Type ‘fastqc’ to start the FastQC program. Load the four
new fastq sequence files. Compare the results with the
previous datasets.
Preprocessing
Tip: Use ‘fastqc -h’ to see help
7/8/2014 BTI PGRP Summer Internship Program 2014 20
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 21
Solutions: https://bitly.com/BioinfoInternExSol2014

Más contenido relacionado

La actualidad más candente (20)

NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Blast
BlastBlast
Blast
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
FASTA
FASTAFASTA
FASTA
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
DNA Barcoding Ph.D Agriculture
DNA Barcoding  Ph.D AgricultureDNA Barcoding  Ph.D Agriculture
DNA Barcoding Ph.D Agriculture
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
Blast
BlastBlast
Blast
 
THIRD GEN SEQUENCING.pptx
THIRD GEN SEQUENCING.pptxTHIRD GEN SEQUENCING.pptx
THIRD GEN SEQUENCING.pptx
 
Data base searching tool
Data base searching toolData base searching tool
Data base searching tool
 
NCBI
NCBINCBI
NCBI
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
MEGA (Molecular Evolutionary Genetics Analysis)
MEGA (Molecular Evolutionary Genetics Analysis)MEGA (Molecular Evolutionary Genetics Analysis)
MEGA (Molecular Evolutionary Genetics Analysis)
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
Fasta
FastaFasta
Fasta
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 

Similar a Quality Control of NGS Data

Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data SolutionsSurya Saha
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data Surya Saha
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityTanu Malik
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Ivo Jimenez
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSTulipp. Eu
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesTanu Malik
 
information management Project.docx
information management Project.docxinformation management Project.docx
information management Project.docxssuser9a7e24
 
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File FormatsFrom Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File FormatsCharles Fracchia
 
Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0Diego Zuluaga
 
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep WebScalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep WebMadhav Sharan
 
Qtp-training A presentation for beginers
Qtp-training  A presentation for beginersQtp-training  A presentation for beginers
Qtp-training A presentation for beginersDhavamani Prakash
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptableszubin71
 
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterGetting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterDaesu Chung
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010Michael Price
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 

Similar a Quality Control of NGS Data (20)

Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data Solutions
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Gnocchi batching
Gnocchi batchingGnocchi batching
Gnocchi batching
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
 
information management Project.docx
information management Project.docxinformation management Project.docx
information management Project.docx
 
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File FormatsFrom Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
 
Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0
 
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep WebScalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
 
Qtp-training A presentation for beginers
Qtp-training  A presentation for beginersQtp-training  A presentation for beginers
Qtp-training A presentation for beginers
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
 
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterGetting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010
 
Sequencing
SequencingSequencing
Sequencing
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 

Más de Surya Saha

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...Surya Saha
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingSurya Saha
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingSurya Saha
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesSurya Saha
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Surya Saha
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Surya Saha
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017Surya Saha
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all OmicsSurya Saha
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Surya Saha
 
Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016Surya Saha
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Surya Saha
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Surya Saha
 
Quality Control of Sequencing Data
Quality Control of Sequencing DataQuality Control of Sequencing Data
Quality Control of Sequencing DataSurya Saha
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Surya Saha
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Surya Saha
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSurya Saha
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014Surya Saha
 

Más de Surya Saha (20)

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meeting
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all Omics
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
 
Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015
 
Quality Control of Sequencing Data
Quality Control of Sequencing DataQuality Control of Sequencing Data
Quality Control of Sequencing Data
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014
 

Último

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 

Último (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 

Quality Control of NGS Data

  • 1. Surya Saha ss2489@cornell.edu BTI PGRP Summer Internship Program 2014 Slides: https://bitly.com/BioinfoInternEx2014 Quality Control of NGS Data
  • 2. 1. Evaluation 2. Preprocessing Quality Control of NGS Data 7/8/2014 BTI PGRP Summer Internship Program 2014 2 Slide credit: Aureliano Bombarely
  • 3. Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as quality score and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3 Slide credit: Aureliano Bombarely
  • 4. Exercise 1: 1. Untar and Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4 Slide credit: Aureliano Bombarely
  • 5. Exercise 1: 4. Count number of sequences in each fastq file using commands you learnt earlier. 5. Convert the fastq files to fasta. 6. Explore other tools in the FASTX toolkit. 7. Now count the number of sequences in fasta file and see if the number of sequences has changed. Evaluation Tip: Use ‘grep’ Tip: Use ‘fastq_to_fasta -h’ to see help Use Google if you are stuck 7/8/2014 BTI PGRP Summer Internship Program 2014 5 Slide credit: Aureliano Bombarely
  • 6. Evaluation: Sequence Quality Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 7 Good Illumina dataset Poor Illumina dataset
  • 8. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 8 454 Pacific Biosciences
  • 9. Evaluation: Sequence Content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10. Evaluation: Sequence Content 7/8/2014 BTI PGRP Summer Internship Program 2014 10 Good Illumina dataset Poor Illumina dataset
  • 11. Evaluation: Duplication Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12. Evaluation: Duplication 7/8/2014 BTI PGRP Summer Internship Program 2014 12 Good Illumina dataset Poor Illumina dataset
  • 13. Evaluation: Overrepresented Sequences Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 13
  • 14. Evaluation: Overrepresented Sequences 7/8/2014 BTI PGRP Summer Internship Program 2014 14 Good Illumina dataset Poor Illumina dataset
  • 15. Evaluation: Kmer content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 15
  • 16. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 16 Good Illumina dataset Poor Illumina dataset
  • 17. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 17 454 Pacific Biosciences
  • 18. Question 2.2: How many sequences there are per file in FastQC? Question 2.3: Which is the length range for these reads? Question 2.4: Which is the quality score range for these reads? Which one looks best quality-wise? Question 2.5: Do these datasets have read overrepresentation? Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? Evaluation Exercise 2: 1.Type ‘fastqc’ to start the FastQC program. Load the four fastq sequence files in the program. 7/8/2014 BTI PGRP Summer Internship Program 2014 18
  • 19. Goal: Trim the low quality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 19
  • 20. Exercise 3: • Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a dapters1.fa • Run the read processing program over each of the datasets using • Min. qscore of 30 • Min. length of 40 bp • Type ‘fastqc’ to start the FastQC program. Load the four new fastq sequence files. Compare the results with the previous datasets. Preprocessing Tip: Use ‘fastqc -h’ to see help 7/8/2014 BTI PGRP Summer Internship Program 2014 20
  • 21. Need Help?? 7/8/2014 BTI PGRP Summer Internship Program 2014 21 Solutions: https://bitly.com/BioinfoInternExSol2014