Next Generation Sequencing Nantes, January 14 ,  2011 Pierre Lindenbaum PhD [email_address] Twitter:  @yokofakun Insititut du Thorax -  INSERM UMR915,_Rodin.jpg About me
This presentation will be posted on
Thank you Biostar (  Istvan Albert,Jeremy Leipzig... )
“Next” Generation ? Note: I don't remember where I found this awesome idea of mixing Star-Trek and NGS. 1977
3 Main Technologies Solid
Credit: Illumina
The development and impact of 454 sequencing  Jonathan M Rothberg & John H Leamon Nature Biotechnology 26, 1117 - 1124 (2008) Published online: 9 October 2008 doi:10.1038/nbt1485
Genome Biol. 2009; 10(3): R32. Published online 2009 March 27. doi: 10.1186/gb-2009-10-3-r32. Evaluation of next generation sequencing platforms for population targeted sequencing studies
Published online 20 November 2008 | Nature | doi:10.1038/news.2008.1245  Human genomes in minutes? Not yet, but biotechnology company is on track for 2013.
Sequencing technologies — the next generation  Michael L. Metzker Nature Reviews Genetics 11, 31-46 (January 2010) doi:10.1038/nrg2626
Genome Biol. 2010;11(5):207. Epub 2010 May 5.  The case for cloud computing in genome informatics.
The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q:  $Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10); Solexa/Illumina Read Format
Mapping the short reads on A reference genome
“ Running these accurate alignment algorithms as a full search of all possible places where the sequence may map is computationally infeasible.” Sense from sequence reads: methods for alignment and assembly  Paul Flicek & Ewan Birney  Nature Methods 6, S6 - S12 (2009) Published online: 15 October 2009 Corrected online: 6 May 2010  doi:10.1038/nmeth.1376
HashTable Sense from sequence reads: methods for alignment and assembly  Paul Flicek & Ewan Birney  Nature Methods 6, S6 - S12 (2009)  doi:10.1038/nmeth.1376
SOAP1 BFAST MOSAIK Hash Reads MAQ Illumina's ELAND Hash Reference
Burrows-Wheeler Sense from sequence reads: methods for alignment and assembly  Paul Flicek & Ewan Birney  Nature Methods 6, S6 - S12 (2009)  doi:10.1038/nmeth.1376
SOAP2 Bowtie BWA
Bruijn graphs Velvet: Algorithms for de novo short read assembly using de Bruijn graphs doi: 10.1101/gr.074492.107 Genome Res. 2008. 18: 821-829
Sense from sequence reads: methods for alignment and assembly  Paul Flicek & Ewan Birney  Nature Methods 6, S6 - S12 (2009)  doi:10.1038/nmeth.1376
CNV detection Genome Res. 2009 Sep;19(9):1586-92. Epub 2009 Aug 5. Sensitive and accurate detection of copy number variants using read depth of coverage.
RNA-SEQ gene regulation protein information
Exome Sequencing
SAM A generic nucleotide alignment format Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8. The Sequence Alignment/Map format and SAMtools.
human-readable, scriptable
Field 1: Query name Field 2: Flag Field 3: Reference sequence name Field 4: 1-based leftmost coordinate of the clipped sequence Field 5: Mapping quality Field 6: CIGAR strings Field 7: Mate reference sequence name Field 8: 1-based leftmost coordinate of the clipped sequence Field 9: Insert size (5’ to 5’) Field 10: Query sequence Field 11: Sequence qualities
1 name: SRR018111.1786  2 flag: 83 (read paired/mapped/reverse strand/first in pair) 3 refseq: chr22  4 position: 31232437 5 qual : 17 6 cigar: 76M 7 = 8 clipped pos: 31232403 9 insert size: -110 10 GGCCCTTAAAATCACAAACTATGCTCAACTCACTCTCTACAGCTCTCATAATTTCCAAAATCTATTTTCTT 11 41===@B=AA??B?B@A?BAAAABBBA@B@C<B>B@BBACBBBBBBCBBCABABBCCCBBBBCBABBBCBB 12 XT:A:U 13 NM:i:4 14 SM:i:17 15 AM:i:17 16 X0:i:1 17 X1:i:0 18 XM:i:4 19 XO:i:0 20 XG:i:0 21 MD:Z:6A34T0T8C24
Text vs. binary format
SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile); SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(), true, outputSamOrBamFile); for ( SAMRecord samRecord : inputSam) { samRecord.setReadName(samRecord.getReadName().toUpperCase()); outputSam.addAlignment(samRecord); } outputSam.close(); inputSam.close();
compact, indexed alignments
Is flexible enough to store all the alignment information generated by various alignment programs Is simple enough to be easily generated by alignment programs or converted from existing alignment formats Is compact in file size Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
CIGAR Compact Idiosyncratic Gapped Alignment Report format 'M' shows a match 'I' shows an insertion 'D' shows a deletions 'H' hard clipping 'S' soft clipping
0x0001 the read is paired in sequencing,  no matter whether it is mapped in a pair 0x0002 the read is mapped in a proper pair 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 1 0x0010 strand of the query (0 for forward; 1 for reverse strand) 0x0020 strand of the mate 1 0x0040 the read is the first read in a pair 1,2 0x0080 the read is the second read in a pair 1,2 0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records) 0x0200 the read fails platform/vendor quality checks 0x0400 the read is either a PCR duplicate or an optical duplicate SAM Flags
Pileup seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23  ,.$....,,.,.,...,,,.,...  7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22  ...T,,.,.,...,,,.,....  33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22  ....,,.,.,.C.,,,.,..G.  +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23  ....,,.,.,...,,,.,....^k.  %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23  A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< Chrom Position Ref Coverage Read bases Qualities
Genome  (re)sequencing (why ?)
Map to known sequence
Exome Sequencing: 30,508,378 reads * 55 bp = 1,677,960,790 bb VCF format
Visualizing the alignments
Samtools: TVIEW
Download FASTA sequence for chr22 (hg18)
curl --proxy ${PROXY} &quot;; |gunzip -c > chr22.fa
What's the length of chr22 ?
Index chr22 with samtools
${sam.bin} faidx chr22.fa
chr22 49691432  7 50  51
Get some FastQ files (simulation via samtools)
${sam.dir}/misc/wgsim chr22.fa reads_1.fastq reads_2.fastq > _rand.txt
Index chr22 for BWA
${bwa.bin} index -p chr22db -a bwtsw chr22.fa
5, 4 ,3 ,2 , 1 .... Align !
${bwa.bin} aln chr22db reads_1.fastq > aln1.sai ${bwa.bin} aln chr22db reads_2.fastq > aln2.sai
Generate alignments in the SAM format given paired-end reads
${bwa.bin} sampe chr22db aln1.sai aln2.sai reads_1.fastq  reads_2.fastq |  > aln.sam
Convert SAM to BAM
${sam.bin} view -b -T chr22.fa  aln.sam > aln.bam
Sort the alignments by position
${sam.bin} sort  aln.bam sorted1
Remove the PCR duplicates
${sam.bin}  rmdup sorted1.bam sorted2.bam
Index the alignment
${sam.bin} index  sorted2.bam
What's the coverage/depth ?
java -jar ${gatk.jar} -T DepthOfCoverage-o file.depth -R chr22.fa -I sorted2.bam
GATK: recalibration
GATK: local realignment
java -jar ${gatk.jar} -T RealignerTargetCreator -R chr22.fa -o outputs.intervals -I sorted2.bam java -jar  ${gatk.jar} -T IndelRealigner -I sorted2.bam -targetIntervals outputs.intervals -o $@ -R chr22.fa ....
Generate a pileup
${sam.bin} pileup -v -c -f chr22.fa realigned.bam > pileup.txt
Filter the pileup
${sam.dir}/misc/ varFilter -d 5 pileup.txt > pileup.filtered.txt
Create a VCF
${sam.dir}/misc/ -r chr22.fa < pileup.filtered.txt > pileup.vcf
View the alignment with tview
$1 Coordinates : 4,99981527,1,G/A $2 Codons : - $3 Transcript ID :  $4 Protein ID :  $5 Substitution : NA $6 Region : NON-GENIC $7 dbSNP ID : NA $8 SNP Type : NA $9 Prediction : Not scored $10 Score : NA $11 Median Info : NA $12 # Seqs at position : NA $13 Gene ID : !N/A $14 Gene Name : !N/A $15 Gene Desc : !N/A $16 Protein Family ID : !N/A $17 Protein Family Desc : !N/A $18 Transcript Status : !N/A $19 Protein Family Size : !N/A $20 OMIM Disease : !N/A $21 Average Allele Freqs : !N/A $22 CEU Allele Freqs : !N/A $23 User Comment : !N/A
$1 #o_snp_id : chr19:1779391.TC.uc010dsr.1  $2 snp_id : chr19:1779391.TC.uc010dsr.1  $3 acc : Q05DB0 $4 pos : 87 $5 aa1 : N $6 aa2 : D $7 prediction : benign $8 pph2_prob : 0.001 $9 pph2_FPR : 0.86 $10 pph2_TPR : 0.994 $11 Comments : !N/A
Give Galaxy a try Galaxy: A platform for interactive large-scale genome analysis: Genome Res. 2005. 15: 1451-1455
Use UCSC Table Browser to find the SNPs
Use UCSC mysql server to find the SNPs, the genes,...
Create a UCSC Custom Track
Wig example browser position chr19:59304200-59310700 browser hide all track type=wiggle_0 name=&quot;variableStep&quot; description=&quot;variableStep format&quot; visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5 59306301 15.0 59306691 12.5 59307871 10.0
Create a ROR database from the VCF file
mkdir -p RAILS rails RAILS/rails4pileup  awk -F ' ' 'BEGIN {printf(&quot; create table vcfs(id integer primary key,chrom varchar(50), position int, ref varchar(2), alt varchar(50),depth int);&quot;);} {printf(&quot;insert into vcfs(chrom,position,ref,alt,depth) values(amp;quot;%samp;quot;,%s,amp;quot;%samp;quot;,amp;quot;%samp;quot;,%s);&quot;,$$1,$$2,$$3,$$4,$$5);}' pileup.filtered.txt | sqlite3 RAILS/rails4pileup/db/vcf.sqlite3 ruby RAILS/rails4pileup/script/generate scafold vcf chrom:string position:int ref:string alt:string  depth:int cat RAILS/rails4pileup/config/database.yml | sed 's/testdevelopmentproductionsqlite3/vcf.sqlite3/' > /tmp/tmp.yml mv /tmp/tmp.yml RAILS/rails4pileup/config/database.yml echo &quot;http://localhost:3000/vcfs&quot;
The end.

