Unblocking The Main Thread Solving ANRs and Frozen Frames
Lab Gene Expression Data Analysis
1. Lab#.
Data manipulation: Biostatistic & Gene expression data analysis
(Microarray, NGS & qRT-PCR)
Theme: Transcriptional Program in Response of Human
Fibroblasts to Serum.
Etienne Z. Gnimpieba
BRIN WS 2012
Sioux Falls, May 30 2012
Etienne.gnimpieba@usd.edu
2. Data manipulation Gene expression data analysis
OMIC World
Genomics DNA DNA
E Transcription
Degradation
mRNA
Transcriptomics
Translation
Functional Gene
Repression
Genomics Degradation
Proteomics E
Catalyse
Metabolomics S P
4. Data manipulation Gene expression data analysis
Excel used in genomics
• How to select columns
• How to use functions
• How to anchor a cell value in a function
• How to copy the function result and not the
function itself
• How to sort data by columns
• How to search and replace
• Frouin, V. & Gidrol, X. (2005) • Transcriptome ENS (France) Etienne Z. Gnimpieba
BRIN WS 2012
• CBB group (Berlin) Sioux Falls, May 31 2012
5. Data manipulation Gene expression data analysis
Excel used in genomics: Pre-treatment
Centering and scaling data
1. Open the file containing the experiment series (your expression matrix) in Excel software, using the
tabulation character as the column separator. Click on the second spreadsheet named Fibroblast real.
Look over this spreadsheet quickly. It is a realistic data set from a microarray experiment. Click back on
the first spreadsheet named Fibroblast lab. We will be using a condensed version.
2. For one column (corresponding to one DNA microarray experiment), calculate the mean value, using the
AVERAGE Excel function. Verify that the value obtained is equal to zero.
3. If it is not the case, from each experiment (15MIN, 30MIN, 2HR, etc…) remove the log2(Ratio) value from
the corresponding mean value by:
- subtract the average value for each column from the corresponding individual values (for the
first example, B2-$B$37). Place these values in the corresponding table on the right (R2). Use the
drag down box to quickly finish a column.
- Continue to center the data for each column (each DNA microarray experiment), filling in the
blank table to the right. Again use the AVERAGE function to find mean values for each column in
the new table. Each average should now be zero.
- Be careful, if there are missing values (empty cells), replace empty contents with the NULL or
NA command, in order to avoid introducing a zero value in Excel calculations in this cell. Indeed,
a missing value is different from a true null one!
- Be careful with decimal separator handling in Excel (dot or coma)!
• Frouin, V. & Gidrol, X. (2005) • Transcriptome ENS (France)
• CBB group (Berlin)
6. Data manipulation Gene expression data analysis
Excel used in genomics : Differential expression analysis (1)
SAM (Significance Analysis of Microarray), Excel macro allowing to search for differentially expressed
genes using a bootstrapping method. Website: http://www-stat.stanford.edu/~tibs/SAM/
Significance Analysis of Microarrays (SAM):
SAM is an Excel macro freely available for academics on the web. The use of SAM in Excel spreadsheet
makes this tool easier to use for most microarray users. Using SAM implies several modifications in your
data file:
The ratio or intensity values in the Excel sheet must not contain any comas but only points as decimal
separator.
The header line depends on the type of analysis you want to perform. You can refer to SAM manual
for more information. You must highlight your header if you don’t want to loose the experiment
information.
Two annotation columns are available. SAM always references its calculation to the line number in
the departure sheet.
• Frouin, V. & Gidrol, X. (2005) • Transcriptome ENS (France)
• CBB group (Berlin)
7. Data manipulation Gene expression data analysis
Excel used in genomics : Differential expression analysis (2)
Under the Add-Ins tab, view the “SAM” toolbar Command. Highlight from R2 to AF37. Now select
SAM. When SAM macro is launched in the tool bar, a setting window appears. For further
information on the various options you can choose, it is best to refer to the SAM manual. However,
the first important thing to do is to indicate if the data source has been transformed in log2 or not. In
this case we will select Unlogged. Then, as data bootstrapping uses a random generator, you need to
initialize it several times by selecting “Generate Random Seed”.
Click “OK”. Once all the chosen iterations have been done, SAM displays a plot representing each
gene in reference to its score in the real distribution compared to the random distributions.
Therefore, the differentially expressed genes are the ones moving away from the 45° slope line.
The table that appears indicates for each delta value, the number of putative differentially expressed
genes, the significant genes, and the number of false positive genes estimated using the False
Discovery Rate (FDR). The user can change the delta value according to the number of false positive
or significant genes he or she wants to obtain.
Choose a delta value by selecting “Manually Enter Delta”. Enter your own delta value between 0 and
0.25. Then if you select the “List Significant Genes” button, SAM displays the list of differentially
expressed genes in the “SAM output” sheet according to the delta value you chose.
This sheet summarizes the selected parameters and gives you the list of induced and repressed
genes.
• Frouin, V. & Gidrol, X. (2005) • Transcriptome ENS (France) Etienne Z. Gnimpieba
BRIN WS 2012
• CBB group (Berlin) Sioux Falls, May 31 2012
8. Data manipulation Gene expression data analysis
GEPAS: Gene Expression pattern Analysis suite
Review this section. Become familiar on your own by reviewing each section listed
under tools.
Verify that the data file FibroGEPAS.txt is in your folder
Open the file
Open GEPAS portal on
http://www.transcriptome.ens.fr/gepas/index.html
Click on “Tools”
Preprocessing
Preprocess DNA array data files: log-transformation, replicate
handling, missing value imputation, filtering and
normalization
Filtering
Viewing
Clustering
Differential expression
Classification
Data mining
• Frouin, V. & Gidrol, X. (2005) • Transcriptome ENS (France) Etienne Z. Gnimpieba
BRIN WS 2012
• CBB group (Berlin) Sioux Falls, May 31 2012
9. Gene Expression Data Analysis
Context
Statement of problem / Case study:
The temporal program of gene expression during a model physiological response of human cells, the response of fibroblasts to serum, was explored with a
complementary DNA microarray representing about 8600 different human genes. Genes could be clustered into groups on the basis of their temporal patterns of expression in
this program. Many features of the transcriptional program appeared to be related to the physiology of wound repair, suggesting that fibroblasts play a larger and richer role in
this complex multicellular response than had previously been appreciated.
Specification & aims Resolution process
Aim:
The purpose of this lab is to initiate a gene expression data analysis process. T1. Gene expression overview
We simulated the application on “Transcriptional Program in the Response of
Human Fibroblasts to Serum” . Now we can understand how a researcher can T1.1. Review of genomics place in OMIC- world
come to identify a significant expressed gene from microarray datasets. T1.2. Microarray data technics and process
T1.3. Data analysis cycle and tools
T2. Excel used in Genomics
Objective: use of basic excel functionalities to solve some gene
expression data analysis needs
T2.1. Column manipulation, functions used, anchor, copy with
function, sort data, search and replace
T2.2. Experiment comparison: Data pre-treatment
T1.3. Differential expressed gene from replicate experiments (SAM)
Target preparation Hybridization
Slide scanning
T2. GEPAS: Gene expression analysis pattern suite
Objective: use of the GEPAS suite to apply the whole microarray data
analyzing process on fibroblast data.
Preprocessing
Viewing
Clustering
Differential expression
Expression profile clustering
Data analysis Classification
Acquired skills Data mining
- Gene expression data overview
- Excel Used for genomics Conclusion: ?
- Microarray data analysis using GEPAS
16 Vishwanath R. Iyer, Scince, 1999 9
During this lab, we have:A brief review Lab’s templateGenome exploration practice…
Once you have your normalized data file, open it with Excel. You can filter out weak intensity spots (eliminate the weakest intensities in both channels) keep spot with ratio greater than 1 or lower than –1. Remember we are working with log2(ratio) so log2(2)=1. This method called “fold change” is the one used at the beginning of microarray analysis and is still useful if you do not have enough replicates to apply statistical treatments.The “fold change” method lack accuracy regarding the significant threshold to be fixed. That’s the reason why it is useful to apply a statistical method able to take into account intensity variations and most of all, the variability among experiments.Significance Analysis of Microarrays (SAM):SAM is an Excel macro freely available for academics on the web. The use of SAM in Excel spreadsheet makes this tool easier to use for most of microarray users. Using SAM implies several modifications in your data file:The ratio or intensity values in the Excel sheet must not contain any comas but only points as decimal separator.The header line depends on the type of analysis you want to perform. You can refer to SAM manual for more information. So you must duplicate your header if you don’t want to loose the experiment information (see image below).Two annotation columns are available. SAM always references its calculation to the line number in the departure sheet.Before launching the macro, it is necessary to select the data precisely because SAM rejects lines with too much missing values (such as empty lines).
Once you have your normalized data file, open it with Excel. You can filter out weak intensity spots (eliminate the weakest intensities in both channels) keep spot with ratio greater than 1 or lower than –1. Remember we are working with log2(ratio) so log2(2)=1. This method called “fold change” is the one used at the beginning of microarray analysis and is still useful if you do not have enough replicates to apply statistical treatments.The “fold change” method lack accuracy regarding the significant threshold to be fixed. That’s the reason why it is useful to apply a statistical method able to take into account intensity variations and most of all, the variability among experiments.Significance Analysis of Microarrays (SAM):SAM is an Excel macro freely available for academics on the web. The use of SAM in Excel spreadsheet makes this tool easier to use for most of microarray users. Using SAM implies several modifications in your data file:The ratio or intensity values in the Excel sheet must not contain any comas but only points as decimal separator.The header line depends on the type of analysis you want to perform. You can refer to SAM manual for more information. So you must duplicate your header if you don’t want to loose the experiment information (see image below).Two annotation columns are available. SAM always references its calculation to the line number in the departure sheet.Before launching the macro, it is necessary to select the data precisely because SAM rejects lines with too much missing values (such as empty lines).