SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
CBGP, mars 2011




Transcriptomique haut-débit pour l'évolution
moléculaire et la génétique des populations


                      Nicolas Galtier



 UMR 5554 - Institut des Sciences de l'Evolution - Montpellier

                   galtier@univ-montp2.fr
Molecular evolution in the 21st century


We have:

 - an enormous amount of data (genomics)
 - a robust theoretical framework (population genetics)

     ⇒ we should understand molecular variation patterns



Yet we do not really know:

- why some species evolve (much) faster than other, proteome-wise
- why GC-content varies between and across genomes

- by how much population size determines genetic diversity

- etc…
Molecular evolution in the 21st century

 Why so many unsolved, basic questions?

   - lacking theory
   - biased sampling

                                  genes




species
PopPhyl goals


   Injecting species biology/ecology into comparative genomics

         Exploring the molecular diversity of nonmodel taxa

 Testing predictions of the population genetic theory genome-wide




  body mass                mutation rate
  generation time          population size          within-species
  abundance                selection                between species
  mating system            recombination



                        population genetic           genomic
life history traits
                           parameters              variation data
PopPhyl goals


  Injecting species biology/ecology into comparative genomics

       Exploring the molecular diversity of nonmodel taxa

Testing predictions of the population genetic theory genome-wide




 Some specific questions we want to address:

- Why are fast-evolving taxa fast? (mutation, selection)
- Are abundant species more polymorphic than scarce ones?
- Is selection less efficient in selfers than outcrossers?
- How does longevity influence mito vs nuclear DNA evolution?
- Who optimizes codon usage, who does gBGC, and why?
- Is the rate of selective sweeps higher in large populations?
How?



                                  coding sequences
- Target = transcriptome
                                  expression data



                                                        focal species
                                                       (10 individuals)
- Sampling scheme:                                                          X 30
                                                         outgroups
                                                     (1 or 2 individuals)




- Next-Generation Sequencing technology

              For each taxon:
               5.105 400 bp reads (454, pooled individuals)
               5.107 100 bp reads (illumina, tagged individuals)
Species sampling

Eponges
Demosponges
Cnidaires
Cténophores
Rotifères
Acanthocéphales
Entoproctes
Némertes
Plathelminthes
Annélides
Mollusques
Ectoproctes
Brachiopodes
Chaetognathes
Tardigrades
Onychophores
Arthropodes
Loricifères
Kinorhynches
Priapulides
Nématodes
Hémichordés
Echinodermes
Céphalochordés
Urochordés
Vertébrés
Why are tunicates fast-evolving, proteome-wise?




                    E
                    C


                    V



                                T




- higher mutation rate?
- more prevalent adaptive evolution ?
- relaxed selective constraint on housekeeping genes ?
Data analysis pipeline




                                           mapping
Solexa


                             reference               transcriptome
         assembling        transcriptome                 reads


                                           coding
 454                                                  SNP calling
                                           annot.


                        πN, πS, dN, dS                 SNPs and
                      allele frequencies               genotypes
Assembling transcriptomes from NGS data:
         a benchmark in Ciona




   Solexa


                           reference
            assembling   transcriptome



    454
454 reads               454 reads                454 reads




            Celera                   Mira                     Cap3


        A                        B                        C
s              c             s         c            s               c




    Illumina reads
                             c               c+s              c+s

                     Abyss       Cap3              Cap3                 D



                             s
454 reads    Illumina reads           454 reads       Illumina reads




                                                        Abyss

                                        Cap3                    c      s
         Abyss

                                                        Cap3
   s             c

                              s   C               c             c+s
          Cap3


                 c+s

          Cap3                                 Cap3


    E            c+s              -       F                 refine         F'
                                                  c+s
       merge reads                       merge contigs
de novo transcriptome assembly: quantitative assessment



                                                     median         assembly   touched
     data set      method        contigs   mean lg            N50
                                                       lg            lg (Mb)    genes


A   Ciona_454       Celera       25,669     491       438     491     12.6      7616


B   Ciona_454        Mira        33,196     635       526     650     21.1      7951


C   Ciona_454       Cap3         24,515     671       540     713     16.5      7945


D   Ciona_illu   Abyss+Cap3      27,426     574       380     769     15.8      7704


E   Ciona_mix    merge reads     29,097     571       399     721     16.6      7982


F   Ciona_mix    merge contigs   27,956     726       529     891     20.3      8207
0	
  
                                                                  500	
  
                                                                              1000	
  
                                                                                                                                    1500	
  
                                                                                                                                               2000	
  
                                                                                                                                                          2500	
  




                                                200	
  
                                                230	
  
                                                260	
  
                                                290	
  
                                                320	
  
                                                350	
  
                                                380	
  
                                                410	
  
                                                440	
  
                                                470	
  
                                                500	
  
                                                530	
  
                                                560	
  
                                                590	
  
                                                620	
  
                                                650	
  
                                                680	
  
                                                710	
  
                                                740	
  
                                                770	
  
                                                800	
  
                                                830	
  
                                                860	
  




Mix contigs
                                 454 contigs
                                                890	
  
                                                920	
  



              Illumina contigs
                                                950	
  
                                                980	
  
                                               1010	
  
                                               1040	
  
                                               1070	
  
                                               1100	
  
                                               1130	
  
                                               1160	
  
                                               1190	
  
                                               1220	
  
                                               1250	
  
                                               1280	
  
                                                                            Mix_con0gs	
  
                                                                                                                   454_Con0gs	
  
                                                                                             Illumina_con0gs	
  
140


120120


   100


 80 80
    60


 40 40
    20


     0


    1000      1500            2000


           454_contigs

           Illumina_contigs

           Mix_contigs
Assembling transcriptomes from NGS data:
  a benchmark using Ciona intestinalis



             predicted         reference
              contigs        transcriptome

                         BLAST
    no hit

    1→1

    m→1


    1→n


    m→n
no hit
        1→1
        m→1

        1→n

        m→n




          full            fragments
1→1 :             m→1 :
        partial            alleles



                               full or
        chimera                partial
1→n :             m→n :
         multi
                                 multi
de novo transcriptome assembly: qualitative assessment
Average contig length varies between categories
Improving assemblies by filtering according to length + coverage



     80%




correct




     60%



               4000          8000            12000

                         number of contigs
de novo transcriptome assembly from NGS data: conclusions




    - illumina > 454
     (454 useful yet)

    - existing programs differ substantially in performance
     (in PopPhyl we retain Cap3 and Abyss)

    - correct cDNA predictions are minoritary in typical assemblies


    - contig length + coverage is a reasonable quality criterion


    - somewhat variable across species
Data analysis pipeline




                                           mapping
Solexa


                             reference               transcriptome
         assembling        transcriptome                 reads


                                           coding
 454                                                  SNP calling
                                           annot.


                        πN, πS, dN, dS                 SNPs and
                      allele frequencies               genotypes
Calling SNPs and genotypes from transcriptome reads




>contig1
pos      ind1          ind2        ind3
1        5/0/9/0       0/0/8/0     10/0/0/0
2        0/4/0/0       0/7/0/0     0/17/0/0
3        1/0/0/17      0/0/0/6     0/0/0/22
…
>contig2
pos      ind1          ind2        ind3
1        0/0/0/4       0/0/0/8     0/2/0/11
2        34/1/13/0     52/0/45/0   4/0/8/0
…


                     reads
Calling SNPs and genotypes from transcriptome reads




>contig1
pos      ind1              ind2          ind3
1        5/0/9/0    AG     0/0/8/0 GG    6/0/0/0 AA
2        0/4/0/0    CC     0/7/0/0 CC    0/17/0/0 CC
3        1/0/0/17   TT     0/0/0/6 TT    0/0/0/5 TT
…
>contig2
pos      ind1              ind2          ind3
1        0/0/0/1    TT     0/0/0/8 TT    0/2/0/11 CT(90%)
2        14/1/9/0   AG     8/0/15/0 AG   12/0/0/0 AA
…


                         genotypes
Calling SNPs and genotypes from transcriptome reads



Model M1 : sequencing error ε
reads                              genotype




                           7 (1/2 ε/3)7
                                            [AG]
A:1   C:0      G:6   T:0
                                           [GG]
                           7 ε/3 (1-ε)6
Calling SNPs and genotypes from transcriptome reads


Model M2: sequencing error ε and allelic bias α
reads                                             genotype


A:0   C:3      G:12   T:0


A:8   C:0      G:2    T:1    7 [q' q''6/2 + q'' q'6/2]
                                                           [AG]
A:1   C:0      G:6    T:0
                                                          [GG]
                               7 ε (1-3ε)6
A:0   C:3      G:0    T:16


A:4   C:0      G:1    T:0


A:0   C:19     G:2    T:0
Population genomics of a fast-evolver

focal species:     Ciona intestinalis B         (8 individuals)
outgroup:          Ciona intestinalis A         (reference sequence)

1602 contigs (>10X in >5 individuals), of average length 138 codons



                              M1                         M2

       SNPs                  30020                     29544

     error rate       0.021 [0.012-0.038]        0.020 [0.011-0.035]

    allelic bias               0                     [0.08-0.5]

   stop codons             77 (0.26%)               117 (0.39%)

        FIT                  -0.017                    -0.054

  nb best model            70 (4.6%)                1532 (95.4%)
Population genomics of a fast-evolver

 focal species:      Ciona intestinalis B            (8 individuals)
 outgroup:           Ciona intestinalis A            (reference sequence)

 1602 contigs (>10X in >5 individuals), of average length 138 codons




average πS: 0.057 per site                  (a highly polymorphic species)

average πN: 0.0026 per site

πN/πS : 0.046                               (strong level of purifying selection)

dN/dS : 0.103                               (high impact of adaptive evolution)



 estimated proportion of adaptive non-synonymous substitutions: 54%
Why are tunicates fast-evolving, proteome-wise?



                               E
                               C


                               V


adaptive
                                          T
neutral
deleterious



           - higher mutation rate?                                YES
           - more prevalent adaptive evolution ?                  YES
           - relaxed selective constraint on housekeeping genes ? NO

                  → large Ne, large µ (per year)
Conclusions



- de novo population genomics from NGS transcriptome data is doable


- transcriptome assembly is probably the most tricky step


- major population genomic descriptors are robust to error models


- life history traits apparently impact molecular evolution to some extant



- long-lived, small population-sized species are the best choice for phylogenomics
VERTEBRES                     INSECTES




NEM.                     MOLLUSQUES                   NEMATODES




 CRUSTACES          ANNELIDES         UROCHORDES   CNID.   SPONG.
Subprojects we have started


- selfers vs outcrossers in snails and nematodes



- long-lived vs short-lived in insects



- big vs small in amniotes
  phylogeny of turtles


- fast proteic evolution in tunicates and nematodes



- extreme longevity
Thanks to:



    Philippe Gayral        CNRS
    Vincent Cahais
    Georgia Tsagkogeorga
    Marion Ballenghien
    Zef Melo Ferreira
    Ylenia Chiari
    Lucy Weinert
                           ISEM
    Sylvain Glémin
    Nico Bierne
    Khalid Belkhir
    Fred Delsuc
    Vincent Ranwez

    Guillaume Dugas
    Sébastien Harispe      ERC
    Caroline Benoist

Más contenido relacionado

Similar a Grenoble 2011 galtier

Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For Bioinfomatics
Cloudera, Inc.
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
jukais
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
Dongyan Zhao
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
Justin Johnson
 
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencingDecoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Thermo Fisher Scientific
 

Similar a Grenoble 2011 galtier (18)

Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For Bioinfomatics
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for Bioinformatics
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
 
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencingDecoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
 
2012 XLDB talk
2012 XLDB talk2012 XLDB talk
2012 XLDB talk
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
 
Barcelona sabatica
Barcelona sabaticaBarcelona sabatica
Barcelona sabatica
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
Cloning vectors
Cloning vectorsCloning vectors
Cloning vectors
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' world
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
D. genes and protein check your learning
D. genes and protein   check your learningD. genes and protein   check your learning
D. genes and protein check your learning
 
Bioinformatica t3-scoring matrices
Bioinformatica t3-scoring matricesBioinformatica t3-scoring matrices
Bioinformatica t3-scoring matrices
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Chapter 3 Recombinat DNA & Genomics.ppt
Chapter 3 Recombinat DNA & Genomics.pptChapter 3 Recombinat DNA & Genomics.ppt
Chapter 3 Recombinat DNA & Genomics.ppt
 

Más de Michael Blum

Presentation 5 march persyval
Presentation 5 march persyvalPresentation 5 march persyval
Presentation 5 march persyval
Michael Blum
 
Ae abc general_grenoble_29_juin_25_30_min
Ae abc general_grenoble_29_juin_25_30_minAe abc general_grenoble_29_juin_25_30_min
Ae abc general_grenoble_29_juin_25_30_min
Michael Blum
 
Chikhi grenoble bioinfo_biodiv_juin_2011
Chikhi grenoble bioinfo_biodiv_juin_2011Chikhi grenoble bioinfo_biodiv_juin_2011
Chikhi grenoble bioinfo_biodiv_juin_2011
Michael Blum
 
Gautier m grenoble_2011
Gautier m grenoble_2011Gautier m grenoble_2011
Gautier m grenoble_2011
Michael Blum
 
Jobim2011 o gimenez
Jobim2011 o gimenezJobim2011 o gimenez
Jobim2011 o gimenez
Michael Blum
 
Massol bio info2011
Massol bio info2011Massol bio info2011
Massol bio info2011
Michael Blum
 

Más de Michael Blum (10)

Presentation 5 march persyval
Presentation 5 march persyvalPresentation 5 march persyval
Presentation 5 march persyval
 
Daunizeau
DaunizeauDaunizeau
Daunizeau
 
Blum
BlumBlum
Blum
 
Aussem
AussemAussem
Aussem
 
Robert
RobertRobert
Robert
 
Ae abc general_grenoble_29_juin_25_30_min
Ae abc general_grenoble_29_juin_25_30_minAe abc general_grenoble_29_juin_25_30_min
Ae abc general_grenoble_29_juin_25_30_min
 
Chikhi grenoble bioinfo_biodiv_juin_2011
Chikhi grenoble bioinfo_biodiv_juin_2011Chikhi grenoble bioinfo_biodiv_juin_2011
Chikhi grenoble bioinfo_biodiv_juin_2011
 
Gautier m grenoble_2011
Gautier m grenoble_2011Gautier m grenoble_2011
Gautier m grenoble_2011
 
Jobim2011 o gimenez
Jobim2011 o gimenezJobim2011 o gimenez
Jobim2011 o gimenez
 
Massol bio info2011
Massol bio info2011Massol bio info2011
Massol bio info2011
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Grenoble 2011 galtier

  • 1. CBGP, mars 2011 Transcriptomique haut-débit pour l'évolution moléculaire et la génétique des populations Nicolas Galtier UMR 5554 - Institut des Sciences de l'Evolution - Montpellier galtier@univ-montp2.fr
  • 2. Molecular evolution in the 21st century We have: - an enormous amount of data (genomics) - a robust theoretical framework (population genetics) ⇒ we should understand molecular variation patterns Yet we do not really know: - why some species evolve (much) faster than other, proteome-wise - why GC-content varies between and across genomes - by how much population size determines genetic diversity - etc…
  • 3. Molecular evolution in the 21st century Why so many unsolved, basic questions? - lacking theory - biased sampling genes species
  • 4. PopPhyl goals Injecting species biology/ecology into comparative genomics Exploring the molecular diversity of nonmodel taxa Testing predictions of the population genetic theory genome-wide body mass mutation rate generation time population size within-species abundance selection between species mating system recombination population genetic genomic life history traits parameters variation data
  • 5. PopPhyl goals Injecting species biology/ecology into comparative genomics Exploring the molecular diversity of nonmodel taxa Testing predictions of the population genetic theory genome-wide Some specific questions we want to address: - Why are fast-evolving taxa fast? (mutation, selection) - Are abundant species more polymorphic than scarce ones? - Is selection less efficient in selfers than outcrossers? - How does longevity influence mito vs nuclear DNA evolution? - Who optimizes codon usage, who does gBGC, and why? - Is the rate of selective sweeps higher in large populations?
  • 6. How? coding sequences - Target = transcriptome expression data focal species (10 individuals) - Sampling scheme: X 30 outgroups (1 or 2 individuals) - Next-Generation Sequencing technology For each taxon: 5.105 400 bp reads (454, pooled individuals) 5.107 100 bp reads (illumina, tagged individuals)
  • 8. Why are tunicates fast-evolving, proteome-wise? E C V T - higher mutation rate? - more prevalent adaptive evolution ? - relaxed selective constraint on housekeeping genes ?
  • 9. Data analysis pipeline mapping Solexa reference transcriptome assembling transcriptome reads coding 454 SNP calling annot. πN, πS, dN, dS SNPs and allele frequencies genotypes
  • 10. Assembling transcriptomes from NGS data: a benchmark in Ciona Solexa reference assembling transcriptome 454
  • 11. 454 reads 454 reads 454 reads Celera Mira Cap3 A B C s c s c s c Illumina reads c c+s c+s Abyss Cap3 Cap3 D s
  • 12. 454 reads Illumina reads 454 reads Illumina reads Abyss Cap3 c s Abyss Cap3 s c s C c c+s Cap3 c+s Cap3 Cap3 E c+s - F refine F' c+s merge reads merge contigs
  • 13. de novo transcriptome assembly: quantitative assessment median assembly touched data set method contigs mean lg N50 lg lg (Mb) genes A Ciona_454 Celera 25,669 491 438 491 12.6 7616 B Ciona_454 Mira 33,196 635 526 650 21.1 7951 C Ciona_454 Cap3 24,515 671 540 713 16.5 7945 D Ciona_illu Abyss+Cap3 27,426 574 380 769 15.8 7704 E Ciona_mix merge reads 29,097 571 399 721 16.6 7982 F Ciona_mix merge contigs 27,956 726 529 891 20.3 8207
  • 14. 0   500   1000   1500   2000   2500   200   230   260   290   320   350   380   410   440   470   500   530   560   590   620   650   680   710   740   770   800   830   860   Mix contigs 454 contigs 890   920   Illumina contigs 950   980   1010   1040   1070   1100   1130   1160   1190   1220   1250   1280   Mix_con0gs   454_Con0gs   Illumina_con0gs  
  • 15. 140 120120 100 80 80 60 40 40 20 0 1000 1500 2000 454_contigs Illumina_contigs Mix_contigs
  • 16. Assembling transcriptomes from NGS data: a benchmark using Ciona intestinalis predicted reference contigs transcriptome BLAST no hit 1→1 m→1 1→n m→n
  • 17. no hit 1→1 m→1 1→n m→n full fragments 1→1 : m→1 : partial alleles full or chimera partial 1→n : m→n : multi multi
  • 18. de novo transcriptome assembly: qualitative assessment
  • 19. Average contig length varies between categories
  • 20. Improving assemblies by filtering according to length + coverage 80% correct 60% 4000 8000 12000 number of contigs
  • 21. de novo transcriptome assembly from NGS data: conclusions - illumina > 454 (454 useful yet) - existing programs differ substantially in performance (in PopPhyl we retain Cap3 and Abyss) - correct cDNA predictions are minoritary in typical assemblies - contig length + coverage is a reasonable quality criterion - somewhat variable across species
  • 22. Data analysis pipeline mapping Solexa reference transcriptome assembling transcriptome reads coding 454 SNP calling annot. πN, πS, dN, dS SNPs and allele frequencies genotypes
  • 23. Calling SNPs and genotypes from transcriptome reads >contig1 pos ind1 ind2 ind3 1 5/0/9/0 0/0/8/0 10/0/0/0 2 0/4/0/0 0/7/0/0 0/17/0/0 3 1/0/0/17 0/0/0/6 0/0/0/22 … >contig2 pos ind1 ind2 ind3 1 0/0/0/4 0/0/0/8 0/2/0/11 2 34/1/13/0 52/0/45/0 4/0/8/0 … reads
  • 24. Calling SNPs and genotypes from transcriptome reads >contig1 pos ind1 ind2 ind3 1 5/0/9/0 AG 0/0/8/0 GG 6/0/0/0 AA 2 0/4/0/0 CC 0/7/0/0 CC 0/17/0/0 CC 3 1/0/0/17 TT 0/0/0/6 TT 0/0/0/5 TT … >contig2 pos ind1 ind2 ind3 1 0/0/0/1 TT 0/0/0/8 TT 0/2/0/11 CT(90%) 2 14/1/9/0 AG 8/0/15/0 AG 12/0/0/0 AA … genotypes
  • 25. Calling SNPs and genotypes from transcriptome reads Model M1 : sequencing error ε
  • 26. reads genotype 7 (1/2 ε/3)7 [AG] A:1 C:0 G:6 T:0 [GG] 7 ε/3 (1-ε)6
  • 27. Calling SNPs and genotypes from transcriptome reads Model M2: sequencing error ε and allelic bias α
  • 28. reads genotype A:0 C:3 G:12 T:0 A:8 C:0 G:2 T:1 7 [q' q''6/2 + q'' q'6/2] [AG] A:1 C:0 G:6 T:0 [GG] 7 ε (1-3ε)6 A:0 C:3 G:0 T:16 A:4 C:0 G:1 T:0 A:0 C:19 G:2 T:0
  • 29. Population genomics of a fast-evolver focal species: Ciona intestinalis B (8 individuals) outgroup: Ciona intestinalis A (reference sequence) 1602 contigs (>10X in >5 individuals), of average length 138 codons M1 M2 SNPs 30020 29544 error rate 0.021 [0.012-0.038] 0.020 [0.011-0.035] allelic bias 0 [0.08-0.5] stop codons 77 (0.26%) 117 (0.39%) FIT -0.017 -0.054 nb best model 70 (4.6%) 1532 (95.4%)
  • 30. Population genomics of a fast-evolver focal species: Ciona intestinalis B (8 individuals) outgroup: Ciona intestinalis A (reference sequence) 1602 contigs (>10X in >5 individuals), of average length 138 codons average πS: 0.057 per site (a highly polymorphic species) average πN: 0.0026 per site πN/πS : 0.046 (strong level of purifying selection) dN/dS : 0.103 (high impact of adaptive evolution) estimated proportion of adaptive non-synonymous substitutions: 54%
  • 31. Why are tunicates fast-evolving, proteome-wise? E C V adaptive T neutral deleterious - higher mutation rate? YES - more prevalent adaptive evolution ? YES - relaxed selective constraint on housekeeping genes ? NO → large Ne, large µ (per year)
  • 32. Conclusions - de novo population genomics from NGS transcriptome data is doable - transcriptome assembly is probably the most tricky step - major population genomic descriptors are robust to error models - life history traits apparently impact molecular evolution to some extant - long-lived, small population-sized species are the best choice for phylogenomics
  • 33. VERTEBRES INSECTES NEM. MOLLUSQUES NEMATODES CRUSTACES ANNELIDES UROCHORDES CNID. SPONG.
  • 34. Subprojects we have started - selfers vs outcrossers in snails and nematodes - long-lived vs short-lived in insects - big vs small in amniotes phylogeny of turtles - fast proteic evolution in tunicates and nematodes - extreme longevity
  • 35. Thanks to: Philippe Gayral CNRS Vincent Cahais Georgia Tsagkogeorga Marion Ballenghien Zef Melo Ferreira Ylenia Chiari Lucy Weinert ISEM Sylvain Glémin Nico Bierne Khalid Belkhir Fred Delsuc Vincent Ranwez Guillaume Dugas Sébastien Harispe ERC Caroline Benoist