Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica

Statistical analysis of
gene expression data

Alex Sánchez
Unitat d'Estadística i Bioinformàtica (VHIR)
Statistics Department (UB)

Outline
• Basic principles of experimental design
• The microarray data analysis process

Basic principles of
Experimental Design

Research

• Researcher’s first goal: understand a process,
(to understand, control, modify, reproduce … it)

• To reach this goal researchers perform studies.
• Experiments are a central part of many studies.

What characterizes an experiment?

1. The treatments to be used

2. The experimental units to be used

3. The way that treatments levels are assigned
to experimental units (or visa-versa):
The Experimental Design

4. The responses that are measured

How can we obtain a
good experimental design?
• Try to apply some good, general, relatively
overlapping rules
1. Rely on an Experimental Design checklist
2. Follow a good Experimental Design Process
3. Rely on basic principles of Experimental Design
Randomization, replication, local control

• But also
• Plan design and analysis at the same time
• Involve your favourite statistician from the beginning
(or before)

What characterizes a
good experimental design?
• It avoids systematic error – systematic error leads to
bias when estimating differences in responses
between (i.e., comparing) treatments

• It allows for precise estimation – achieves a
relatively small random error,

• It has broad validity
• the experimental units are a sample of the
population study
• The conclusions obtained on the sample can be
extrapolated to the population.

To obtain a good experimental design (1)
Plan the experiments (Checklist)

1. Define the objectives of the experiment
2. Identify all potential sources of variation
3. Select an appropriate Experimental Design.
4. Specify the experimental process
5. Conduct a pilot study
6. Specify the hypothesized model
7. Outline the analyses to be conducted
8. Estimate the required sample size using results
from the pilot study
9. Review your decisions in Steps 1 – 8 and make
necessary revisions

To obtain a good experimental design (2):
Follow the experimental design process

To obtain a good experimental design (3)
Follow Experimental Design Principles

The basic principles of
Experimental Design

• Good experimental designs share common
traits.
• Apart of wishful thinking there is general
agreement that relying on experimental
design principles yields good (if not best)
experimental designs.
• These are
• Randomization
• Replication
• Blocking or Local control

1. Randomization
• Randomly assigning samples to groups to
eliminate unspecific disturbances
– Randomly assign individuals to treatments.
– Randomize order in which experiments are performed.
• Randomization required to
– Ensure validity of statistical procedures.
– Ensure that no preferential allocation of treatment to
experimental units is made
• E.g: Assign strongest treatment to patients in worse health state
– Ensure that the efects of confounding variables are
minimized
• Eg: Assign tretament to patients older than controls

Randomization software

• Saying “randomly assign…” is
sometimes easier to say than to do,
especially in complex designs.
• Some tools may help
– R, of course
– Research Randomizer
http://www.randomizer.org/
– Interactive Statistical Calculation pages
http://statpages.org/
(look por “Experimental design”)

2. Replication
• There is general agreement about the need to apply each
treatment independently to several experimental units.
• Helps to establish reproducibility of results.
• Protects against eventual abnormal/unusual results.
• Provides a way to estimate the error variance in the absence of
systematic differences among experimental units. (This is important
because treatment differences are judged against this variance estimate.)
• Provides the capacity to increase the precision for estimates of
treatment means.
• By itself, does not guarantee valid estimates of experimental
error or treatment differences.

Replication precision and power

• The number of replications r is directly related
to the precision of the experiment
(*)
1/var(mean) = r/ σ2
• An efficient design has greater power to detect
differences between treatment effects.
• From (*) it follows that
– the greater r
– the smaller σ2
the greater the power attained by a design.

How many replications?

• Formulae for computing sample size given:
– effect size,
– significance level (P Error type I)
– power (1-P error type II)
can be derived for most common analyses.

• While the derivation is rough, the application
– is straightforward, especially if using calculators
– attention must be paid to application conditions.

Sample size calculators

• R: package power
– http://www.statmethods.net/stats/power.html

• Statistical calculators
– http://hedwig.mgh.harvard.edu/sample_size/size.html
– http://www.stat.uiowa.edu/~rlenth/Power/

• Interactive Statistical Calculation pages
http://statpages.org
(look for “Power and Sample Size”)

Biological vs Technical
Replicates

σB
2

σA
2

σe
2

@ Nature reviews & G. Churchill (2002)

3. Blocking
• Assume we wish to perform an experiment to
compare two treatments.
• The samples or their processing may not be
homogeneous: There are blocks
• Subjects: Male/Female
• Arrays produced in two lots (February, March)
• If there are systematic differences between blocks
the effects of interest (e.g. tretament) may be
confounded
• Observed differences are attributable to treatment effect or
to confounding factors?
• Local control or blocking is the way to minimize the
effect of existing (unavoidable?) blocks.

Local Control
• Group EUs so that the variability of units
within the groups is less than that among all
units prior to grouping 
– Differences among treatments are not confused with
differences among experimental units.
– EE is reduced by the variability associated with
environmental differences among groups of units.
– Effects of nuisance factors which contribute
systematic variation to the differences among EUs
can be eliminated.
– Analysis is more sensitive.

Confounding block with treatment effects

Awful design Balanced design
Sample Treatment Sex Batch Sample Treatment Sex Batch
1 A Male 1 1 A Male 1
2 A Male 1 2 A Female 2
3 A Male 1 3 A Male 2
4 A Male 1 4 A Female 1
5 B Female 2 5 B Male 1
6 B Female 2 6 B Female 2
7 B Female 2 7 B Male 2
8 B Female 2 8 B Female 1

• Two alternative designs to investigate treatment effects
– Left: Treatment effects confounded with Sex and Batch effect
– Right: Treatments are balanced between blocks
• Influence of blocks is automatically compensated
• Statistical analysis may separate block from treatment efefect

Allocating samples to treatments
• A key point in any experiment is the way that
experimental units are allocated to treatments
– It must be chosen so that random variability is as
small as possible

– It must be chosen so that the best local control is
achieved.

– It implicitly defines the analysis model, so it must be
chosen so that the analysis can be performed and
validity conditions hold.

In summary
• Good experimental design is essential to
perform good experiments.

• Experimental design means planning
ahead
– Should be done before the experiment starts
– Should consider all the steps: from sampling
to data analysis.

And Fisher said…

To consult the statistician after an
experiment is finished is often merely
to ask him to conduct a post mortem
examination.
He can perhaps say what the experiment
died of.

Sir Ronald A. Fisher
Father of modern Mathematical Statistics
and Developer of Experimental Design
and ANOVA

Introduction to microarray
data analysis

Esquema de la presentación

 Introducción y objetivos
 Análisis de datos de microarrays
 Tipos de datos y Tipos de estudios. Herramientas.
 El proceso de análisis. Ejemplos
 Críticas, consensos, consejos y “estado del
arte”
 Críticas a los microarrays
 Consensos y consejos (“dos and don’ts”)
 MAQC-I, MAQC-II
 De los microarrays al diagnóstico
 ¿Porque está siempre por llegar?

Para aprender más …

http://www.ub.es/stat/docencia/bioinformatica/microarrays/ADM/

Y muchos más …
 Time Course
 Perfiles de expresión a lo largo del tiempo
 Pathway Analysis-(Systems Biology)
 Reconstrucción de redes metabólicas a
partir de datos de expressión
 Whole Genome, CGH, Alternative
Splicing
 Estudios con datos de distintos tipos
 Fusión o Integración de datos

Herramientas para el análisis

Programas de análisis de datos

 Multitud de herramientas
 Gratuítas / Comerciales
 [R, BRB, MeV, dChip…] / [Partek, GeneSpring, Ingenuity]
 Descargables / En-linea
 [R, BRB, MeV…] / [Babelomics,…]
 Aísladas / Parte de “suites” o de sitios
 [BRB, dChip] / [MeV (TM4), OntoTools]
 Review: Tools for managing and analyzing
microarray data
 http://bib.oxfordjournals.org/content/13/1/46.abstra
ct?keytype=ref&ijkey=g74sTv2xGt5kOpU

Análisis de un experimento con microarrays

(1) Imágenes
(Datos crudos)

(2) C. de calidad
(bajo nivel)

(3) Preprocesado

(4) Exploración
de la Matriz
de Expresión

(5) Análisis

(6) Significación
Biológica

(0) Diseño experimental
• Variabilidad
– Sistemática
• Calibrar/Normalizar
– Aleatoria
• Diseño Experimental
• Inferencia
• Decidir acerca de
Awful design :-( Balanced design :-)
– Réplicas,
Sample
1
Treatment Sex
A Male
Batch
1
Sample
1
Treatment Sex
A Male
– Lotes (“Batch effect”)
2
3
A
A
Male
Male
1
1
2
3
A
A
Female
Male
– Pools …
4 A Male 1 4 A Female
5 B Female 2 5 B Male
6 B Female 2 6 B Female
7 B Female 2 7 B Male
8 B Female 2 8 B Female

(1) Obtención de la imagen
… • Entra: Microarrays
• Salen:
– Imágenes (1/chip)
– Ficheros de imagen
• Información para
cada sonda individual
• Datos para el análisis
de bajo nivel
… – Control de calidad
– Preprocesado
1.cel, 1.chp 2.cel, 2.chp
– Sumarización

(2) Control de calidad de bajo nivel

• Entra:
… – Imágenes (.CEL, ...)
1.cel, 1.chp 2.cel, 2.chp
• Proceso
– Diagnósticos y
Control de calidad
– Análisis basado en
modelos (PLM)
• Salen:
– Gráficos
– Estadísticos de
control de calidad

(3) Preprocesado

… • Entra:
– Fichero de Imágenes
1.cel, 1.chp 2.cel, 2.chp (datos del escaner)
• Proceso
– Eliminación de ruido
– Normalización
– Sumarización
C01-001.CEL C02-001.CEL C03-
– Filtrado
• Sale:
001.CEL
1415670_at 8.954387 9.088924 8.833863
1415671_at 10.700876 10.639307 10.610953

– Matriz de expresión
1415672_at 10.377266 10.510106 10.461701
1415673_at 7.320335 7.252635 7.112313
1415674_a_at 8.381129 8.332256 8.393718
1415675_at 8.120937 8.082713 8.051514
1415676_a_at 10.322229 10.287371 10.282812
1415677_at 9.038344 8.979641 8.905711

(4) Exploración
C01-001.CEL C02-001.CEL C03-001.CEL
1415670_at
1415671_at
8.954387
10.700876
9.088924
10.639307
8.833863
10.610953
• Entra
1415672_at
1415673_at
10.377266
7.320335
10.510106
7.252635
10.461701
7.112313 – Matriz de expresión
• Proceso
1415674_a_at 8.381129 8.332256 8.393718
1415675_at 8.120937 8.082713 8.051514
1415676_a_at 10.322229 10.287371 10.282812
1415677_at 9.038344 8.979641 8.905711 – PCA, Cluster, MDS
– Representaciones en
2D/3D
– Agrupaciones
• Sale
– Detectado efectos
batch
– Verificación calidad

(5) Análisis estadístico (i):
Selección de genes diferencialmente expresados

C01-001.CEL C02-001.CEL C03-001.CEL • Entra:
1415670_at 8.954387 9.088924 8.833863
1415671_at
1415672_at
10.700876
10.377266
10.639307
10.510106
10.610953
10.461701
– Matriz expresión
– Modelo de
1415673_at 7.320335 7.252635 7.112313
1415674_a_at 8.381129 8.332256 8.393718

análisis
1415675_at 8.120937 8.082713 8.051514
1415676_a_at 10.322229 10.287371 10.282812
1415677_at 9.038344 8.979641 8.905711

• Proceso
– t-tests, ANOVA
• Ajustes de p-valores
• Sale
ProbeSet gene ID logFC t P.Value adj.P.Val B
1450826_a_at
1457644_s_at
1415904_at
1449450_at
Saa3
Cxcl1
Lpl
Ptges
1450826_a_at
1457644_s_at
1415904_at
1449450_at
4,911
4,286
-4,132
5,164
63,544
53,015
-50,455
49,483
6,21E-14
3,52E-13
5,66E-13
6,82E-13
2,80E-10
7,69E-10
7,69E-10
7,69E-10
22,244
20,791
20,373
20,207
– Listas de genes
• Fold change, p.values
1419209_at Cxcl1 1419209_at 5,037 47,175 1,08E-12 9,71E-10 19,794
1416576_at Socs3 1416576_at 3,372 42,107 3,19E-12 2,08E-09 18,784
1450330_at Il10 1450330_at 4,519 42,056 3,23E-12 2,08E-09 18,773
1455899_x_at Socs3 1455899_x_at 3,648 40,821 4,29E-12 2,12E-09 18,502

– Gráficos
1419681_a_at Prok2 1419681_a_at 3,709 40,645 4,48E-12 2,12E-09 18,463
1436555_at Slc7a2 1436555_at 3,724 40,081 5,12E-12 2,12E-09 18,335

– Perfiles de expresión

(5) Análisis estadístico (ii):
Construcción & validación de un predictor

• Entra:
– Matriz expresión
• Proceso
– Selección variables
– Ajuste modelo
– Validación
• Sale
– Modelos predictivos
– Medidas de fiabilidad
/reproducibilidad

(6) Significación biologica
ProbeSet gene ID logFC
1450826_a_at Saa3 1450826_a_at 4,911
1457644_s_at
1415904_at
1449450_at
Cxcl1
Lpl
Ptges
1457644_s_at
1415904_at
1449450_at
4,286
-4,132
5,164
• Entra
1419209_at
1416576_at
Cxcl1
Socs3
1419209_at
1416576_at
5,037
3,372 – Listas de genes
1450330_at Il10 1450330_at 4,519
1455899_x_at
1419681_a_at
Socs3
Prok2
1455899_x_at
1419681_a_at
3,648
3,709 • Proceso
1436555_at Slc7a2 1436555_at 3,724
– GEA, GSEA, …
• Sale:
– Clases GO /
Grupos de Genes
Pathways
especialmente
representados

Ejemplo de análisis de datos

Comparación de perfiles de expresión
entre tumores BRCA1/BRCA2 y
Construcción de un predictor que
permita distinguir entre ambos.

Fuente del ejemplo
 Gene Expression Profiles in Hereditary
Breast Cancer
• Hedenfalk, I, et. al., NEJM, Vol. 344,
No. 8, pp 539-548.
 Objetivo: Encontrar un predictor basado
en perfiles de expresión para diferenciar
tumores asociados a BRCA1 y BRCA2

Esquema del análisis
• Diseño experimental y datos para el
análisis
• Preprocesado
• Exploración
• Selección de genes
• Construcción de varios predictores y
selección del más apropiado

Diseño experimental
BRCA1 v
Patient BRCA2 v • RNA extraido de
Array PID Sporadic
s1321 20 Sporadic
– 7 pacientess. BRCA1
s1996 1 BRCA1 – 8 pacients BRCA2
s1822 5 BRCA1 – 7 con cancer “esporádico”
s1714 3 BRCA1
• 6512 sondas
s1224 7 BRCA1
s1252 2 BRCA1 – 5361 genes
s1510 4 BRCA1 • 3226 retenidos para el
s1900 10 BRCA2 análisis
s1787 9 BRCA2
• Diseño de referencia
s1721 8 BRCA2
s1486 22 BRCA2
– Cada muestra comparada
s1572 16 Sporadic
contra linea celular no
s1324 17 Sporadic
tumorgénica (MCF-104)
s1649 15 Sporadic
s1320 18 Sporadic
s1542 19 Sporadic
s1281 21 Sporadic
s1905 6 BRCA1
s1816 13 BRCA2

Preprocesado:
Filtrado y Normalización

Análisis (1). Selección de genes
(class comparison)

• BRCA1 vs noBRCA1
• Usamos un t-test y
un cutoff de 0.0001
– es decir declaramos
diferencialmenete
expresados los genes
cuyo p-valor sea
inferior a 0.0001
• No hacemos ajustes
– Mínimo FC
– Multiple testing

Resultados (1): Lista de genes
Parametric
Order p-value FDR Fold-change Unique id Description Clone
1 1.66e-05 0.0198 2.24 HV34H7 ESTs 247818
2 2.17e-05 0.0198 2.03 UG5G3 minichromosome maintenance deficient (S. cerevisiae) 7 46019
3 2.3e-05 0.0198 0.31 HV17G6 keratin 8 897781
4 3.37e-05 0.0198 1.89 HV18E8 SELENOPHOSPHATE SYNTHETASE ; Human selenium donor protein 840702
5 3.63e-05 0.0198 2.21 HV32C7 ESTs 307843
6 4.32e-05 0.0198 1.57 UG1F1 very low density lipoprotein receptor 26082
7 4.5e-05 0.0198 1.67 HV24F5 chromobox homolog 3 (Drosophila HP1 gamma) 566887
8 4.92e-05 0.0198 2.02 LO3F1 butyrate response factor 1 (EGF-response factor 1) 366647
9 9.43e-05 0.0338 1.85 HV9E3 "tumor protein p53-binding protein, 2" 212198

Análisis (2):
Construcción de un predictor

• Construímos
predictores por 6
métodos distintos.
• Genes candidatos por
class-comparison.
• Elegimos el que
presente menor tasa
de error de predicción
(estimada por leave
one out)

Resultados (2i)
Compound Diagonal Linear 1-Nearest 3-Nearest Nearest Support Bayesian
Covariate Discriminant Neighbor Neighbors Centroid Vector Compound
Array id Class label Predictor Analysis Machines Covariate
Predictor
Correct? Correct? Correct? Correct? Correct? Correct? Correct?
s1224 BRCA1 YES YES YES YES YES YES YES
s1252 BRCA1 YES YES NO NO YES YES YES
s1510 BRCA1 NO YES NO NO NO NO NO
s1714 BRCA1 NO YES NO NO NO NO NO
s1996 BRCA1 YES YES NO YES YES YES NA
s1063 notBRCA1 YES YES YES YES YES YES YES
s1281 notBRCA1 YES YES YES YES YES YES NA
s1320 notBRCA1 NO YES YES YES YES YES YES
s1321 notBRCA1 NO NO NO NO NO NO NO
82% 95% 77% 82% 86% 86% 85%

Resultados (2ii)
Performance of the Diagonal Linear Discriminant Analysis Classifier:

Class Sensitivity Specificity PPV NPV
BRCA1 1 0.933 0.875 1
notBRCA1 0.933 1 1 0.875

Final classifier: coeficients and criteria

A sample is classified to the class BRCA1 if the sum is greater than the threshold
That is, ∑iwi xi > threshold.
The threshold for the Diagonal Linear Discriminant predictor is 91.124

1 2 3 …. 51 52
Genes HK1A11 HV10D8 HV11A6 …. HV28G8 HV2B1
Coeficients 2,57 3,31 2,79 …. 3,01 5,52

Resumiendo…
 El análisis de microarrays puede visualizarse
como un proceso.

 Es importante conocer
 Los métodos apropiados para cada problrma,
 los parámetros, el significado, las limitaciones de
cada paso.

 Una aplicación adecuada del proceso
proporciona información relevante como...
 una lista de genes diferencialmente expresados
(biomarcadores).
 un modelo con capacidad de predecir (firma)

Limitaciones del método

Críticas, consejos, consensos y
“estado del arte”

Limitaciones de los microarrays

An array of problems?
• Poca reproducibilidad entre estudios
– Poca coincidencia entre las listas de genes
– No reproducción de las predicciones en
nuevos conjuntos de test
• Falta de estándares
• Falta de consenso en los métodos
• El paso a la clínica siempre por llegar

• Mediados de la década: ¿Promesa o
realidad?

Algunos consensos (Allison 2006)

• Diseño
– Biological replication is essential
– There is strength in numbers: power & sample size
– Pooling biological samples can be useful

• Seleccion de genes diferencialmente expresados
– Using FC alone as a differential expression test is not valid
– 'Shrinkage' is a good thing
– FDR is a good alternative to conventional multiple-testing approaches

• Clasificación y Predicción
– Unsupervised classification is overused
– Unsupervised classification should be validated using
resampling-
– Supervised-classification requires independent cross-
validation

No todos los estudios se hacen bien...

• Dupuy & Simon estudian 90 publicaciones.
– Análisis detallado de los métodos usados en 42.

• Ecuentran algunos errores comunes
– Objetivos pobremente definidos.
– No hay control de la multiplicidad
104 genes  104 tests  P(Falso+) muy alta
– Ni se informa bien de la fiabilidad de un predictor.
– No se utiliza un conjunto de test independiente.
– Se abusa por doquier del análisis de clusters.

Aunque es posible hacerlo bien si...

• Se procura... (do’s) • Se evita... (don’t)
– Definir bien objetivos. – Basar la selección tan
– Combinar el p-valor y sólo en “Fold Change”
el FC al seleccionar – Usar p-valores de 0.05
genes. – Usar métodos de cluster
– Usar la FDR para el si lo que se deseara es
control de clasificar muestras.
multiplicidad. – Violar el principio básico
– Validar un predictor de la validación (no debe
con un conjunto de usarse el cjto de prueba
prueba independiente. antes de la validación).
– Contar con un
estadístico
... Hasta 40 “do’s” y “don’ts” en la tabla 3 de Dupuy y Simon (JNCI 99 (2): 147-157).

Resumiendo
• Los microarrays tienen algunas
limitaciones –razonables e intrínsecas-
• Un adecuado uso de los métodos de
análisis puede generar información útil,
fiable y reproducible.
• Aún así el paso de la clínica al
diagnóstico es más lento de lo que se
esperaba.

¿Por qué?

De la investigación básica a los
diagnóstico basados en microarrays

¿Para cuando?

Pero hay muy pocos kits de diagnóstico...

Algunas de las dificultades
• Se precisan estudios muy grandes para establecer la
potencia de un (kit) diagnóstico y validarlo en una
cohorte independiente y suficientemente amplio.

• Hacen falta estandarizaciones y sistemas de control de
calidad validados según criterios de laboratorios
clínicos.

• Los tests de perfiles de expresión han de cumplir las
normas de la Agencia Médica Europea y/o la FDA.

• Para justificar su desarrollo hay que hacer estudios de
coste efectividad que sugieran una clara mejora en el
tratamiento al paciente y retorno de inversión y
beneficios en el medio/largo plazo.

Estado de los diagnósticos basados en
microarrays

Lleno: , Vacío: 

Resumiendo
• Se espera que la creciente calidad y tamaño de los
estudios genere nuevos perfiles de expresión
transportables al diagnóstico.

• Aspectos como estandarización y automatización
(robótica) para minimizar la intervención humana
están cada vez mejor.

• Otros como la regulación por parte de las agencias y
las políticas de reembolso a los inversores y los
laboratorios deben de irse resolviendo.

• No es improbable un futuro en el que el “lab-on-a-
chip” forme parte de las herramientas de los clínicos.

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Similar a Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica

Similar a Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica (20)

Más de VHIR Vall d’Hebron Institut de Recerca

Más de VHIR Vall d’Hebron Institut de Recerca (20)

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica