SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Luca Cozzuto
Sarah Bonnin
Bioinformatics Core Facility
Additional topics (parsing
methods) for biologists with
a focus on ChIP-seq data
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Raw data, reads in FASTQ format
Raw data, reads in FASTQ format
@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Header Sequence Quality
Raw data, reads in FASTQ format
zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’
41103741
Counting fastq reads (the slow way)
Raw data, reads in FASTQ format
Phred quality score.
l Q=-10 log10p
l p = probability that the corresponding base call is
incorrect
l Example: p = 0.001 means a quality of 30
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0........................26...31.........41
Raw data, reads in FASTQ format
Analyzing the quality (FASTQC)
GOOD BAD
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome can be very long (human is 3 Giga
bases)
l We need ultra-fast mappers:
l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
l Bwa (http://bio-bwa.sourceforge.net/)
l GEM (https://github.com/smarco/gem3-mapper)
l …
Reference genome (Fasta file)
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Reference genome (Fasta file)
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Header
Reference genome (Fasta file)
zcat GRCm38.primary_assembly.genome.fa.gz | grep ">"
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
>chr9 9
>chr10 10
>chr11 11
>chr12 12
>chr13 13
>chr14 14
>chr15 15
>chr16 16
>chr17 17
>chr18 18
>chr19 19
>chrX X
>chrY Y
>chrM MT
Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
https://www.ensembl.org/info/website/upload/gff.html
Header
Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
Reference sequence // Source // Feature (gene, transcript, exon etc) //
Start // End // Score // Strand // Frame (0,1,2) //
Attributes separated by “;”
https://www.ensembl.org/info/website/upload/gff.html
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
?
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
l Problems with PCR artifacts (marking duplicates)
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE
EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46
XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU
PG:Z:MarkDuplicates
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Header
@HD: header line // VN: format version // SO: sorting order of alignments
@SQ: reference sequence dictionary // SN: sequence name // LN: length
@PG: program // ID: program name // VN: version // CL: command line
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Alignment
Query name // FLAG // Reference name // leftmost mapping position //
Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read //
Position of the mate // template length // sequence // quality
In this case FLAG 16 means: “read being reverse complemented”
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
https://software.broadinstitute.org/software/igv/
Quality control of the enrichment
https://deeptools.readthedocs.io/en/develop/index.html
Distribution of the signal (wiggle format)
https://deeptools.readthedocs.io/en/develop/index.html
variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5
...
Peak calling
https://software.broadinstitute.org/software/igv/
Peak calling
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-
Seq (MACS). Genome Biol. 2008;9(9):R137.
It is possible to infer the fragment size and use it for extending the reads to
get more reliable peaks (i.e. binding sites). The peak is in the middle.
Peak coordinates (Bed format)
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Chromosome // Start // End (3 fields BED)
+ Name // Score // Strand (6 fields BED)
+ thickStart // thickEnd // itemRgb
+ blockCount // blockSizes // blockStarts (12 fields BED)
track name=chipseq description=”IP of Ring1B TF"
1 3444977 3445551 peak_1 31 .
1 4773116 4774454 peak_2 114 .
1 4774530 4777431 peak_3 108 .
1 4786374 4786850 peak_4 80 .
1 4806806 4807288 peak_5 66 .
bigBed and bigWig format
https://genome.ucsc.edu/goldenpath/help/bigWig.html
https://genome.ucsc.edu/goldenpath/help/bigBed.html
Indexed binary format generated from bed and wiggle files.
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-b gencode.vM17.annotation.gtf 
-wa -wb -nonamecheck | 
awk '{if ($9 == "gene") print }'
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-b gencode.vM17.annotation.gtf 
-wa -wb -nonamecheck | 
awk '{if ($9 == "gene") print }'
chr1 3444977 3445551 peak_15 31 .
chr1 HAVANA gene -nonamecheck 3205901 3671498 . -
. gene_id "ENSMUSG00000051951.5"; gene_type
"protein_coding"; gene_name "Xkr4"; level 2; havana_gene
"OTTMUSG00000026353.2";
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf | 
closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-d -b -

Más contenido relacionado

Similar a Course on parsing methods for biologists with a focus on ChIP-seq data

The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
Sasha Goldshtein
 

Similar a Course on parsing methods for biologists with a focus on ChIP-seq data (20)

Михаил Епихин — Бутылочное горлышко. как найти узкие места сервиса и увеличит...
Михаил Епихин — Бутылочное горлышко. как найти узкие места сервиса и увеличит...Михаил Епихин — Бутылочное горлышко. как найти узкие места сервиса и увеличит...
Михаил Епихин — Бутылочное горлышко. как найти узкие места сервиса и увеличит...
 
Selection analysis using HyPhy
Selection analysis using HyPhySelection analysis using HyPhy
Selection analysis using HyPhy
 
How to Cisco ACI Multi-Pod
How to Cisco ACI Multi-PodHow to Cisco ACI Multi-Pod
How to Cisco ACI Multi-Pod
 
Advanced Computational Drug Design
Advanced Computational Drug DesignAdvanced Computational Drug Design
Advanced Computational Drug Design
 
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
Generating haplotype phased reference genomes  for the dikaryotic wheat strip...Generating haplotype phased reference genomes  for the dikaryotic wheat strip...
Generating haplotype phased reference genomes for the dikaryotic wheat strip...
 
Chan, Pak
Chan, PakChan, Pak
Chan, Pak
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
An open source framework for processing daily satellite images (AVHRR) over l...
An open source framework for processing daily satellite images (AVHRR) over l...An open source framework for processing daily satellite images (AVHRR) over l...
An open source framework for processing daily satellite images (AVHRR) over l...
 
Submitted sequence (strains)
Submitted sequence (strains)Submitted sequence (strains)
Submitted sequence (strains)
 
データ統合とサイバーインフラストラクチャ
データ統合とサイバーインフラストラクチャデータ統合とサイバーインフラストラクチャ
データ統合とサイバーインフラストラクチャ
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data
 
Ruegeria pomeroyi dss term
Ruegeria  pomeroyi  dss termRuegeria  pomeroyi  dss term
Ruegeria pomeroyi dss term
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
The Power of CSS
The Power of CSSThe Power of CSS
The Power of CSS
 
Edge trends mizuno
Edge trends mizunoEdge trends mizuno
Edge trends mizuno
 
【TECH×GAME COLLEGE#22】マイクリプトヒーローズの作り方
【TECH×GAME COLLEGE#22】マイクリプトヒーローズの作り方【TECH×GAME COLLEGE#22】マイクリプトヒーローズの作り方
【TECH×GAME COLLEGE#22】マイクリプトヒーローズの作り方
 
NetBioSIG2014-Talk by Ashwini Patil
NetBioSIG2014-Talk by Ashwini PatilNetBioSIG2014-Talk by Ashwini Patil
NetBioSIG2014-Talk by Ashwini Patil
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis Pipeline
 
2011 Rna Course Part 1
2011 Rna Course Part 12011 Rna Course Part 1
2011 Rna Course Part 1
 

Más de Luca Cozzuto

Más de Luca Cozzuto (6)

vectorQC: 'A pipeline for assembling and annotation of vectors'
vectorQC: 'A pipeline for assembling and annotation of vectors'vectorQC: 'A pipeline for assembling and annotation of vectors'
vectorQC: 'A pipeline for assembling and annotation of vectors'
 
From Zero to Nextflow 2017
From Zero to Nextflow 2017From Zero to Nextflow 2017
From Zero to Nextflow 2017
 
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
 
AnnoWiki
AnnoWikiAnnoWiki
AnnoWiki
 
Macs course
Macs courseMacs course
Macs course
 
Annotating nc-RNAs with Rfam
Annotating nc-RNAs with RfamAnnotating nc-RNAs with Rfam
Annotating nc-RNAs with Rfam
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Último (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 

Course on parsing methods for biologists with a focus on ChIP-seq data

  • 1. Luca Cozzuto Sarah Bonnin Bioinformatics Core Facility Additional topics (parsing methods) for biologists with a focus on ChIP-seq data
  • 2. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 3. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 4. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 5. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 7. Raw data, reads in FASTQ format @HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979 GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT + FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII @HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893 GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA + 12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?###### @HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624 AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT + BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9? Header Sequence Quality
  • 8. Raw data, reads in FASTQ format zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’ 41103741 Counting fastq reads (the slow way)
  • 9. Raw data, reads in FASTQ format Phred quality score. l Q=-10 log10p l p = probability that the corresponding base call is incorrect l Example: p = 0.001 means a quality of 30 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ 0........................26...31.........41
  • 10. Raw data, reads in FASTQ format Analyzing the quality (FASTQC) GOOD BAD https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 11. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome can be very long (human is 3 Giga bases) l We need ultra-fast mappers: l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) l Bwa (http://bio-bwa.sourceforge.net/) l GEM (https://github.com/smarco/gem3-mapper) l …
  • 12. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  • 13. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Header
  • 14. Reference genome (Fasta file) zcat GRCm38.primary_assembly.genome.fa.gz | grep ">" >chr1 1 >chr2 2 >chr3 3 >chr4 4 >chr5 5 >chr6 6 >chr7 7 >chr8 8 >chr9 9 >chr10 10 >chr11 11 >chr12 12 >chr13 13 >chr14 14 >chr15 15 >chr16 16 >chr17 17 >chr18 18 >chr19 19 >chrX X >chrY Y >chrM MT
  • 15. Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; https://www.ensembl.org/info/website/upload/gff.html
  • 16. Header Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; Reference sequence // Source // Feature (gene, transcript, exon etc) // Start // End // Score // Strand // Frame (0,1,2) // Attributes separated by “;” https://www.ensembl.org/info/website/upload/gff.html
  • 17. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences ?
  • 18. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences l Problems with PCR artifacts (marking duplicates)
  • 19. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates https://samtools.github.io/hts-specs/SAMv1.pdf
  • 20. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Header @HD: header line // VN: format version // SO: sorting order of alignments @SQ: reference sequence dictionary // SN: sequence name // LN: length @PG: program // ID: program name // VN: version // CL: command line https://samtools.github.io/hts-specs/SAMv1.pdf
  • 21. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Alignment Query name // FLAG // Reference name // leftmost mapping position // Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read // Position of the mate // template length // sequence // quality In this case FLAG 16 means: “read being reverse complemented” https://samtools.github.io/hts-specs/SAMv1.pdf
  • 22. Alignment (SAM / BAM format) https://software.broadinstitute.org/software/igv/
  • 23. Quality control of the enrichment https://deeptools.readthedocs.io/en/develop/index.html
  • 24. Distribution of the signal (wiggle format) https://deeptools.readthedocs.io/en/develop/index.html variableStep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 ...
  • 26. Peak calling Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP- Seq (MACS). Genome Biol. 2008;9(9):R137. It is possible to infer the fragment size and use it for extending the reads to get more reliable peaks (i.e. binding sites). The peak is in the middle.
  • 27. Peak coordinates (Bed format) https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Chromosome // Start // End (3 fields BED) + Name // Score // Strand (6 fields BED) + thickStart // thickEnd // itemRgb + blockCount // blockSizes // blockStarts (12 fields BED) track name=chipseq description=”IP of Ring1B TF" 1 3444977 3445551 peak_1 31 . 1 4773116 4774454 peak_2 114 . 1 4774530 4777431 peak_3 108 . 1 4786374 4786850 peak_4 80 . 1 4806806 4807288 peak_5 66 .
  • 28. bigBed and bigWig format https://genome.ucsc.edu/goldenpath/help/bigWig.html https://genome.ucsc.edu/goldenpath/help/bigBed.html Indexed binary format generated from bed and wiggle files.
  • 29. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }'
  • 30. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }' chr1 3444977 3445551 peak_15 31 . chr1 HAVANA gene -nonamecheck 3205901 3671498 . - . gene_id "ENSMUSG00000051951.5"; gene_type "protein_coding"; gene_name "Xkr4"; level 2; havana_gene "OTTMUSG00000026353.2";
  • 31. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf | closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -d -b -