Grenoble 2011 galtier

CBGP, mars 2011

Transcriptomique haut-débit pour l'évolution
moléculaire et la génétique des populations

Nicolas Galtier

UMR 5554 - Institut des Sciences de l'Evolution - Montpellier

galtier@univ-montp2.fr

Molecular evolution in the 21st century

We have:

- an enormous amount of data (genomics)
- a robust theoretical framework (population genetics)

⇒ we should understand molecular variation patterns

Yet we do not really know:

- why some species evolve (much) faster than other, proteome-wise
- why GC-content varies between and across genomes

- by how much population size determines genetic diversity

- etc…

Molecular evolution in the 21st century

Why so many unsolved, basic questions?

- lacking theory
- biased sampling

genes

species

PopPhyl goals

Injecting species biology/ecology into comparative genomics

Exploring the molecular diversity of nonmodel taxa

Testing predictions of the population genetic theory genome-wide

body mass mutation rate
generation time population size within-species
abundance selection between species
mating system recombination

population genetic genomic
life history traits
parameters variation data

PopPhyl goals

Injecting species biology/ecology into comparative genomics

Exploring the molecular diversity of nonmodel taxa

Testing predictions of the population genetic theory genome-wide

Some specific questions we want to address:

- Why are fast-evolving taxa fast? (mutation, selection)
- Are abundant species more polymorphic than scarce ones?
- Is selection less efficient in selfers than outcrossers?
- How does longevity influence mito vs nuclear DNA evolution?
- Who optimizes codon usage, who does gBGC, and why?
- Is the rate of selective sweeps higher in large populations?

How?

coding sequences
- Target = transcriptome
expression data

focal species
(10 individuals)
- Sampling scheme: X 30
outgroups
(1 or 2 individuals)

- Next-Generation Sequencing technology

For each taxon:
5.105 400 bp reads (454, pooled individuals)
5.107 100 bp reads (illumina, tagged individuals)

Species sampling

Eponges
Demosponges
Cnidaires
Cténophores
Rotifères
Acanthocéphales
Entoproctes
Némertes
Plathelminthes
Annélides
Mollusques
Ectoproctes
Brachiopodes
Chaetognathes
Tardigrades
Onychophores
Arthropodes
Loricifères
Kinorhynches
Priapulides
Nématodes
Hémichordés
Echinodermes
Céphalochordés
Urochordés
Vertébrés

Why are tunicates fast-evolving, proteome-wise?

E
C

V

T

- higher mutation rate?
- more prevalent adaptive evolution ?
- relaxed selective constraint on housekeeping genes ?

Data analysis pipeline

mapping
Solexa

reference transcriptome
assembling transcriptome reads

coding
454 SNP calling
annot.

πN, πS, dN, dS SNPs and
allele frequencies genotypes

Assembling transcriptomes from NGS data:
a benchmark in Ciona

Solexa

reference
assembling transcriptome

454

454 reads 454 reads 454 reads

Celera Mira Cap3

A B C
s c s c s c

Illumina reads
c c+s c+s

Abyss Cap3 Cap3 D

s

454 reads Illumina reads 454 reads Illumina reads

Abyss

Cap3 c s
Abyss

Cap3
s c

s C c c+s
Cap3

c+s

Cap3 Cap3

E c+s - F refine F'
c+s
merge reads merge contigs

de novo transcriptome assembly: quantitative assessment

median assembly touched
data set method contigs mean lg N50
lg lg (Mb) genes

A Ciona_454 Celera 25,669 491 438 491 12.6 7616

B Ciona_454 Mira 33,196 635 526 650 21.1 7951

C Ciona_454 Cap3 24,515 671 540 713 16.5 7945

D Ciona_illu Abyss+Cap3 27,426 574 380 769 15.8 7704

E Ciona_mix merge reads 29,097 571 399 721 16.6 7982

F Ciona_mix merge contigs 27,956 726 529 891 20.3 8207

0

500

1000

1500

2000

2500

200

230

260

290

320

350

380

410

440

470

500

530

560

590

620

650

680

710

740

770

800

830

860

Mix contigs
454 contigs
890

920

Illumina contigs
950

980

1010

1040

1070

1100

1130

1160

1190

1220

1250

1280

Mix_con0gs

454_Con0gs

Illumina_con0gs

140

120120

100

80 80
60

40 40
20

0

1000 1500 2000

454_contigs

Illumina_contigs

Mix_contigs

Assembling transcriptomes from NGS data:
a benchmark using Ciona intestinalis

predicted reference
contigs transcriptome

BLAST
no hit

1→1

m→1

1→n

m→n

no hit
1→1
m→1

1→n

m→n

full fragments
1→1 : m→1 :
partial alleles

full or
chimera partial
1→n : m→n :
multi
multi

de novo transcriptome assembly: qualitative assessment

Average contig length varies between categories

Improving assemblies by filtering according to length + coverage

80%

correct

60%

4000 8000 12000

number of contigs

de novo transcriptome assembly from NGS data: conclusions

- illumina > 454
(454 useful yet)

- existing programs differ substantially in performance
(in PopPhyl we retain Cap3 and Abyss)

- correct cDNA predictions are minoritary in typical assemblies

- contig length + coverage is a reasonable quality criterion

- somewhat variable across species

Calling SNPs and genotypes from transcriptome reads

>contig1
pos ind1 ind2 ind3
1 5/0/9/0 0/0/8/0 10/0/0/0
2 0/4/0/0 0/7/0/0 0/17/0/0
3 1/0/0/17 0/0/0/6 0/0/0/22
…
>contig2
pos ind1 ind2 ind3
1 0/0/0/4 0/0/0/8 0/2/0/11
2 34/1/13/0 52/0/45/0 4/0/8/0
…

reads


>contig1
pos ind1 ind2 ind3
1 5/0/9/0 AG 0/0/8/0 GG 6/0/0/0 AA
2 0/4/0/0 CC 0/7/0/0 CC 0/17/0/0 CC
3 1/0/0/17 TT 0/0/0/6 TT 0/0/0/5 TT
…
>contig2
pos ind1 ind2 ind3
1 0/0/0/1 TT 0/0/0/8 TT 0/2/0/11 CT(90%)
2 14/1/9/0 AG 8/0/15/0 AG 12/0/0/0 AA
…

genotypes


Model M1 : sequencing error ε

reads genotype

7 (1/2 ε/3)7
[AG]
A:1 C:0 G:6 T:0
[GG]
7 ε/3 (1-ε)6


Model M2: sequencing error ε and allelic bias α

reads genotype

A:0 C:3 G:12 T:0

A:8 C:0 G:2 T:1 7 [q' q''6/2 + q'' q'6/2]
[AG]
A:1 C:0 G:6 T:0
[GG]
7 ε (1-3ε)6
A:0 C:3 G:0 T:16

A:4 C:0 G:1 T:0

A:0 C:19 G:2 T:0

Population genomics of a fast-evolver

focal species: Ciona intestinalis B (8 individuals)
outgroup: Ciona intestinalis A (reference sequence)

1602 contigs (>10X in >5 individuals), of average length 138 codons

M1 M2

SNPs 30020 29544

error rate 0.021 [0.012-0.038] 0.020 [0.011-0.035]

allelic bias 0 [0.08-0.5]

stop codons 77 (0.26%) 117 (0.39%)

FIT -0.017 -0.054

nb best model 70 (4.6%) 1532 (95.4%)

Population genomics of a fast-evolver

focal species: Ciona intestinalis B (8 individuals)
outgroup: Ciona intestinalis A (reference sequence)

1602 contigs (>10X in >5 individuals), of average length 138 codons

average πS: 0.057 per site (a highly polymorphic species)

average πN: 0.0026 per site

πN/πS : 0.046 (strong level of purifying selection)

dN/dS : 0.103 (high impact of adaptive evolution)

estimated proportion of adaptive non-synonymous substitutions: 54%

Why are tunicates fast-evolving, proteome-wise?

E
C

V

adaptive
T
neutral
deleterious

- higher mutation rate? YES
- more prevalent adaptive evolution ? YES
- relaxed selective constraint on housekeeping genes ? NO

→ large Ne, large µ (per year)

Conclusions

- de novo population genomics from NGS transcriptome data is doable

- transcriptome assembly is probably the most tricky step

- major population genomic descriptors are robust to error models

- life history traits apparently impact molecular evolution to some extant

- long-lived, small population-sized species are the best choice for phylogenomics

VERTEBRES INSECTES

NEM. MOLLUSQUES NEMATODES

CRUSTACES ANNELIDES UROCHORDES CNID. SPONG.

Subprojects we have started

- selfers vs outcrossers in snails and nematodes

- long-lived vs short-lived in insects

- big vs small in amniotes
phylogeny of turtles

- fast proteic evolution in tunicates and nematodes

- extreme longevity

Thanks to:

Philippe Gayral CNRS
Vincent Cahais
Georgia Tsagkogeorga
Marion Ballenghien
Zef Melo Ferreira
Ylenia Chiari
Lucy Weinert
ISEM
Sylvain Glémin
Nico Bierne
Khalid Belkhir
Fred Delsuc
Vincent Ranwez

Guillaume Dugas
Sébastien Harispe ERC
Caroline Benoist

Grenoble 2011 galtier

Recomendados

Recomendados

Más contenido relacionado

Similar a Grenoble 2011 galtier

Similar a Grenoble 2011 galtier (18)

Más de Michael Blum

Más de Michael Blum (10)

Último

Último (20)

Grenoble 2011 galtier