SlideShare una empresa de Scribd logo
1 de 79
Descargar para leer sin conexión
Bioinformatics and Computational Biosciences Branch  
NIAID Office of Cyber Infrastructure and Computational Biology 
 
NIH Intranet: http://bioinformatics.niaid.nih.gov 
ScienceApps@niaid.nih.gov 
‘
Training Manual
Crash Course:
R & BioConductor
Jeff Skinner, M.S.
Sudhir Varma, Ph.D.
Download a copy of this manual and all related training materials at:
http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/
Crash Course: R & BioConductor
Table of Contents
Ch. 1. Introduction to R ......................................................................................................................................... 1 
1.1  What is R? What is BioConductor? ........................................................................................................ 1 
1.2  A Brief History of R................................................................................................................................. 1 
1.3  Download R and BioConductor............................................................................................................... 2 
1.4  Licensing Concerns.................................................................................................................................. 3 
1.5  Helpful Resources .................................................................................................................................... 3 
Ch. 2. Basics of Using R........................................................................................................................................ 4 
2.1  Computing Environments ........................................................................................................................ 4 
2.1  R GUI....................................................................................................................................................... 5 
2.3  Basic Arithmetic....................................................................................................................................... 8 
2.4  Searching the Help Menus ..................................................................................................................... 12 
2.5  Installing R Packages and Source Scripts .............................................................................................. 14 
2.6  Entering and Importing Data.................................................................................................................. 16 
2.7  Data Types.............................................................................................................................................. 22 
2.8  Manipulating Data in R.......................................................................................................................... 26 
2.9  Saving and Exporting Data .................................................................................................................... 31 
2.10  Changing Directories.......................................................................................................................... 32 
2.11  Sample Problems for Students............................................................................................................ 32 
Ch. 3. Graphics and Figures in R......................................................................................................................... 34 
3.1  Basic Types of Graphics and Figures..................................................................................................... 34 
3.2  Custom Titles, Subtitles and Axes Labels.............................................................................................. 40 
3.3  Custom Color and Layout Options......................................................................................................... 44 
3.4  Multi-step Graphics................................................................................................................................ 48 
3.5  Figure Legends and Overlaid Text......................................................................................................... 54 
3.6  Multi-panel Layouts............................................................................................................................... 58 
3.7  Exporting R Graphics............................................................................................................................. 60 
3.8  Sample Problems for Students ............................................................................................................... 60 
Ch. 4. Basic Statistical Tests and Analyses in R.................................................................................................. 62 
4.1  Student’s T-test ...................................................................................................................................... 62 
4.2  Linear Regression and ANOVA ............................................................................................................ 63 
4.3  R Commander ........................................................................................................................................ 68 
4.4  Sample Problems for Students ............................................................................................................... 69 
Ch. 5. Writing Basic Scripts in R......................................................................................................................... 70 
5.1  Text Editors............................................................................................................................................ 70 
5.2  Hello World! .......................................................................................................................................... 71 
5.3  Use Scripts to Automate and Save Workflows...................................................................................... 72 
5.4  Computation and Output Options .......................................................................................................... 73 
5.5  Sample Problems for Students ............................................................................................................... 77 
Literature Cited..................................................................................................................................................... 77 
Crash Course: R & BioConductor
1
Ch. 1. Introduction to R
1.1 What is R? What is BioConductor?
Many biologists and researchers have heard about the powerful analysis and visualization capabilities of
R and BioConductor, even if they have not used R or BioConductor themselves. It can be tempting to think of
R as a typical statistics software package, but that would belie its true power and capabilities. R combines an
open source software platform for statistics and data visualization with a powerful scripting language that can
be used to create new analyses and workflows. Both the software package itself and its scripting language are
called R. While most people use R for statistical analyses and data visualization, R can be used for matrix
algebra computations, data management and enterprise reporting. Advanced users will find that R interacts well
with databases, some commercial statistics software packages and many programming languages (e.g. Perl,
Python, Java, Fortran, C, HTML, TeX and LaTeX), so R can be utilized in many complicated computing
problems.
BioConductor is an open source software development project that creates new tools for the analysis and
comprehension of genomic. The BioConductor project is almost entirely concerned with the development and
distribution of R package libraries for the analysis of microarray and other genomic data. Packages are
available for doing various kinds of annotation, normalization, filtering, statistical analysis and visualization of
the experimental data from microarray and other genomic studies.
1.2 A Brief History of R
The history of R begins at AT&T Bell Laboratories in the 1970’s with the development of the S
statistics package by John Chambers, Richard Becker and others. In the early 1970’s, researchers and
statisticians at Bell labs were using a library of FORTRAN programs called Statistical Computing Subroutines
(SCS) to compute all their statistical analyses (Becker 1994). This FORTRAN library was preferable to the
commercial statistics packages available in the 1970’s, because the statisticians at Bell Labs were constantly
developing new statistical methods and they wanted specialized reports of their statistical results. However, this
SCS library was too cumbersome for many simple statistical analyses and graphs, like Student’s t-tests or linear
regression methods. The S statistics package was created to provide an interactive programming language and
computing environment to simplify the procedures in the SCS FORTRAN library, while still providing a
flexible platform to program and develop new statistical and graphical methods.
To make statistical computing more interactive, the S programming language was designed to have the
most natural grammar and syntax possible. The goal was to create a higher-level programming language that
would be similar to regular English. Most users would write their S code using basic function statements in this
higher-level S language, while more advanced users and developers could still create new code in lower-level
languages like FORTRAN. The original S language featured advanced text editing and powerful graphics, with
the usual statistical tests.
The S software was used internally at Bell labs in the late 1970’s and it was distributed publically by the
early 1980’s. The first textbook about the S language was published in 1984 (Becker and Chambers 1984) and
the S software was made publically available through AT&T’s software sales group. Later, the S software was
rewritten in C and combined with another quantitative computing project at Bell Labs to create New S (Becker
et al. 1988). By the early 1990’s, then S statistics software had found thousands of users and a handful of books
had been published on the S language.
The R statistical software package was initially published and released in 1996 (Ihaka and Gentleman
1996). The goal was to create a flexible statistics software and programming language that utilized the best
features of the S statistics software package and a functional programming language called Scheme (Sussman
and Steele 1975). The name R was chosen to represent the first names of its developers, Ross Ihaka and Robert
Gentleman, and also as a play on the name of the S software and programming language (Hornik 2008).
Crash Course: R & BioConductor
2
Shortly after its release, the Comprehensive R Archive Network (CRAN) was opened and R became an official
part of the GNU Project (http://www.gnu.org). The first stable release of R was offered in February, 2000.
1.3 Download R and BioConductor
If you need to download R, visit the Comprehensive R Archive Network (CRAN) website (http://cran.r-
project.org/) and look for the download links (Figure 1). There are at least three platform-specific download
links for Linux, Mac OSX and Windows operating systems. Click these links to download a ready-to-use
installation of R software. There are additional links to download individual source or binary files, so expert
users can build their own custom installation of R. These links to the source and binary files also include “daily
snapshots” of future versions of R. Remember that R is open source software, so everyone is welcome to
modify its code and contribute to upcoming versions of the software. Most biology researchers should probably
use the platform-specific download links, but remember that the custom installation options are available.
After you have downloaded R, you may want to download the BioConductor packages from their
website (Figure 2). If you follow the installation instructions from the BioConductor website, you need to type
the commands
> source("http://bioconductor.org/biocLite.R")
> biocLite()
into the R command line. These two commands will download and install all of the basic BioConductor
packages on your computer. There are many additional BioConductor packages available for download, but
you will want to download them as needed. You can also download the biocLite() packages one at a time,
but many of the biocLite() packages are required more advanced BioConductor packages therefore it makes
sense to download biocLite() now. Installing R packages will be covered in greater detail in Section 2.5 of
the manual.
Figure 1. The Comprehensive R Archive Network (CRAN) website.
Crash Course: R & BioConductor
3
Figure 2. The BioConductor website.
1.4 Licensing Concerns
It is important to remember that R is open source software, distributed under the GNU Public License
(GPL). It may be a good idea to review the terms of the GPL before delving into a project using R. If you just
intend to use existing R packages to analyze and visualize data in published experiments, then you probably do
not have much to worry about. However, if you want to modify and distributed R software, or more
importantly if you want to use R software as a part of a patented process or product, you should review the
license very carefully.
1.5 Helpful Resources
Because R is a free, open source software program, there is no corporate office to call or email for
technical support. However, there are many resources available to help users learn to use R. Visit the R project
website (http://www.R-project.org) to find free manuals, a FAQ page, a list of published books on R, the R
Wiki and various mailing lists. You can find extensive documentation of individual R functions by using its
help() commands, as demonstrated in section 2.4 of this manual. Some historic books on the S and R software
packages include “the blue book” (Becker et al. 1988), “the white book” (Chambers and Hastie 1992) and “the
green book” (Chambers 1998), but there are now dozens of statistics and programming text books for the R and
S languages. The R-help mailing list is sometimes your best bet for person-to-person help with R and its
functions, but it is important to read their posting guide before posting new messages to the mailing list.
Crash Course: R & BioConductor
4
Ch. 2. Basics of Using R
2.1 Computing Environments
Most biologists or researchers will use R and BioConductor on a Windows PC or Mac computer using
the standard R GUI interface for their platform. However, R and BioConductor can also be run as command
line applications on a PC, Mac, Linux or Unix machine, a server or even a high performance parallel computing
cluster. There is no real advantage or disadvantage to using R from the command line or the R GUI. The
features of R remain the same, no matter how you choose to access R. However, some users with programming
experience may feel more comfortable using R from the command line.
2.1.1 MS Windows Command Prompt
On a Windows PC, you can access R from the command line by opening the MS Windows Command
Prompt (Figure 3). For many Windows PC users, the command line can be found by clicking > Start > All
Programs > Accessories > Command Prompt. At the Command Prompt, type the capital letter “R” and hit
the <Return> key to open the R software package. You should immediately see a message from R, reporting
your version number and the license information. If R does not open at the Command Prompt, you may need to
specify a path within the MS Windows operating system.
Figure 3. The MS Windows Command Prompt.
2.1.2 Mac OSX Terminal
On an Apple Macintosh computer, you can access R from the command line by opening the Terminal
(Figure 4). You will likely find Terminal.app in the Utilities folder within your Applications folder in the OSX
Finder. If you cannot find Terminal.app, try searching for “terminal” in the OSX Spotlight. At the Terminal,
type the capital letter “R” and hit the <Return> key to open the R software package. You should immediately
see a message from R, reporting your version number and the license information.
2.1.3 UNIX shells and SSH clients
Linux and UNIX users can access R from the command line in the Bourne shell (sh), Bourne-Again
shell (bash), C shell (csh) or other command line terminals. Type the capital letter “R” and hit the <Return>
key at the command line to open the R software package, and you should immediately see the message from R
Crash Course: R & BioConductor
5
Figure 4. The Apple Macintosh OSX 10.5.5 Terminal.
to report your version number and license information. Another possible option is to access R from a secure
shell (SSH) client. This option allows you to install R on a powerful UNIX machine or server, then access the
R software on this powerful machine from another machine connected to the internet.
2.1 R GUI
If you are not comfortable accessing R from the command line, you can access R from the R GUI that is
included in the usual Windows or Mac download. The R GUI provides a few point-and-click buttons and
menus to help you open and save files, download new R packages or even edit R scripts. All the features from
these point-and-click buttons and menus can be accessed from the command line, but some users may prefer to
have these commonly used features accessible from a GUI button instead memorizing their specific commands.
Note that specific R GUI features differ slightly between the MS Windows GUI and the Apple Mac OSX GUI.
Both are described below.
2.2.1 Windows PC GUI
The current R GUI in MS Windows features seven clickable menus and eight clickable buttons to help
you access commonly used features (Figure 5). The File menu allows you to Source R code, create a New
script, Open script…, Display file(s)…, Load workspace…, Save workspace…, Load history…, Save history…,
Change dir…, Print…, Save to file… and Exit. Note the menu options Open script…, Load workspace… and
Save workspace… are also available in the first three clickable buttons from the left on the Windows R GUI,
while the Print option is available as the last button on the right of the R GUI. The Edit menu allows users to
Copy, Paste, Paste commands only, Copy and Paste, Select all, Clear console, open the Data editor… or
change the GUI preferences…, if necessary. Note the Copy, Paste and Copy and Paste commands are also
available as the fourth, fifth and sixth clickable buttons from the left. The View menu is used to hide or display
the Toolbar of buttons at the top of the R GUI window and the Statusbar of system messages at the bottom of
the R GUI (Figure 6). The Misc menu is used to Stop current computation, Stop all computations, Buffered
output, Word completion, File name completion, List objects, Remove all objects and List search path. The
Buffered output option prevents R from printing any messages while a command is running. The Word
completion and File name completion options allow you to complete the names of R commands and file names,
Crash Course: R & BioConductor
6
Figure 5. The MS Windows R GUI.
Figure 6. The Toolbar and Statusbar of the Windows R GUI.
respectively, by hitting the <TAB> key after typing the first letters of a command or filename. The List objects,
Remove all objects and List search path options are equivalent to the ls(), rm(list = ls()) and
search()commands, respectively. Note the option to Stop current computation can also be accessed by the
second clickable button from the right.
2.2.2 Mac OSX GUI
The R GUI in Mac OSX includes a menu bar at the top of your screen (Figure 7), and the GUI itself
includes 10 clickable icons to provide access to commonly used features or features specific to the Mac OSX R
GUI (Figure 8). The stop sign icon allows you to stop processing the most recently submitted R command, or
Interupt current R computation. This is a useful feature, because some R procedures can require lengthy
processing times that may stall or freeze some computers. The R icon is used to Source script or load data in R.
If you would like to write your own scripts to automate analysis workflows or create new analyses, this button
will allow you to quickly load the source files for your scripts. It also provides an easy way to load data. The
bar chart icon allows you to Open a new Quartz device window. On a Mac computer, the quartz window is
used to produce all graphics figures, like scatterplots and histograms. Opening a Quartz graphics device will
allow you to view your graphics figures as you build them. The X11 icon allows you to open an X11 window in
Mac OSX. This is a critical feature for the R GUI in Mac OSX, because many important R functions require an
open X11 window to work properly. The lock icon is used to Authorize R to run system commands as root.
This allows you to overwrite protected files and directories, so use this option with extreme caution. The table
Crash Course: R & BioConductor
7
Figure 7. The Mac OSX R GUI Menu Bar.
Figure 8. The Mac OSX R GUI console.
icon is used to Show / Hide R command history (Figure 9). The command history allow all of the commands
recently submitted to R, which can be a useful feature during lengthy R sessions when older commands may run
off the screen. The color wheel icon is used to Set R console colors (Figure 10). These options allow you to
change the color schemes within the R GUI console, the R GUI editor and the R GUI Quartz window. The R
sheet and blank sheet icons are used to Open document in editor and Create new, empty document in the editor,
repectively. The R GUI editor is a text editor environment within the Mac OSX R GUI that is typically used to
edit R source scripts. The R GUI editor for Mac OSX includes some helpful automatic text formatting features
to help you write and edit R code. The print icon is used to Print this document, which will print the R console.
Be careful with the print button, because the R console could contain hundreds of statements and produce a
lengthy printout. The switch icon is used to Quit R, which closes the current R session and the R GUI.
Crash Course: R & BioConductor
8
Figure 9. The R command history window for the R GUI in Mac OSX.
Figure 10. The Set R console colors menu.
2.3 Basic Arithmetic
2.3.1 Addition, subtraction and other basic operations
Before you open your first R data set, it may be useful to explore some basic arithmetic operations in R.
Type a simple addition statement (e.g. 3 + 4) in the R prompt and hit <Return> to view the result (Figure 11).
Crash Course: R & BioConductor
9
Figure 11. An arithmetic operation entered into the R GUI.
From this point forward, I will describe all R commands in boxed Courier New text as shown below:
> 3 + 4
[1] 7
>
Note that user entered commands will be preceded by the “>” character, while output from R will typically be
preceded by an index number (e.g. [1]) or it will be displayed with special formatting.
Some keyboard characters are reserved for special functions in R. One special character in R is the “#”
symbol, which is used to add notes to R commands, scripts and code. In most situations, all text or code
preceded by the the “#” symbol will be ignored in R. Try it for yourself by entering
> # 3 + 4
>
into the R prompt. Notice the addition statement was not evaluated as before, because the sum was not
calculated. You can enter any kind of information after the “#” symbol, without any fear of an error or ruined R
scripts. You can even add these notes after a valid command in the same line of code, as seen below:
> 3 + 4 # This command produces the sum of 3 and 4
[1] 7
Now, enter the following commands into R to explore some basic arithmetic operations:
Crash Course: R & BioConductor
10
> 3 + 4 # Addition
[1] 7
> 3 - 4 # Subtraction
[1] -1
> 3*4 # Multiplication
[1] 12
> 3/4 # Division
[1] 0.75
> 3^4 # Exponents
[1] 81
> 3**4 # Another way to enter exponents
[1] 81
> log(3) # Natural logarithm (i.e. log base e)
[1] 1.098612
> log10(3) # Log base 10
[1] 0.4771213
> log2(3) # Log base 2
[1] 1.584963
> log(81,base=3) # Logarithms computed to any other base
[1] 4
> exp(1) # Base of the natural logarithm
[1] 2.718282
> pi # The constant pi
[1] 3.141593
You can use the equal sign “=” or an left-facing arrow “<-“ to define variables and compute simple
algebraic expressions in R. Note the equal sign is sometimes used inconsistently in R and sometimes creates
problems in lengthy scripts.
> a = 4 # Define a = 4
> b <- 3 # Define b = 3 (alternative coding)
> a*b # Multiply a*b
[1] 12
2.3.2 Inequalities
Beyond these simple mathematical operations, R can be used to evaluate many mathematical
inequalities:
> 3 < 4 # Strictly less than or greater than
[1] TRUE
> 3**4 >= 100 # Greater than or equal to
[1] FALSE
> log(81,3) == 4 # Equal to
[1] TRUE
Note that the double equal sign command “==” is used to evaluate an equality statement, because the single
equal sign command “=” is used to define variables. Soon, you will see that inequalities can provide a powerful
means to identify subsets of data, when used with an index.
2.3.3 Matrix algebra
More advanced users may want to use R for matrix algebra computations, like matrix multiplication or
matrix inversions. If matrix computations interest you, I believe you will find that R is a very powerful
platform for matrix computations that nearly rivals Matlab and similar platforms.
Crash Course: R & BioConductor
11
> aa <- c(1,2,3,4) # Define a column vector “aa”
> aa # Display the column vector “aa”
[1] 1 2 3 4
> t(aa) # Transpose “aa”
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
The command c() is used to create a column vector in R. In the example above, the command aa <-
c(1,2,3,4) was used to define a column vector “aa” with entries 1, 2, 3, and 4. The command t() is used to
transpose vectors and matrices, so t(aa) will convert “aa” from a column vector to a row vector. Notice the
difference between how the column vector aa and the row vector t(aa) are displayed. The column vector aa is
displayed in one line of output, where the vector is preceded by [1] and its entries are only separated by spaces.
The row vector t(aa) is displayed on two separate lines, which report its column and row entries. The output
[1,] denotes an entry on the first row of a matrix, and the output [,1] denotes an entry on the first column of a
matrix.
> aa*t(aa) # Element-wise multiplication
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
> t(aa)*aa # Element-wise multiplication
[,1] [,2] [,3] [,4]
[1,] 1 4 9 16
> aa%*%t(aa) # Matrix multiplication
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
[4,] 4 8 12 16
> t(aa)%*%aa # Matrix multiplication
[,1]
[1,] 30
The usual multiplication operator “*” is not used for matrix multiplication. The “*” operator multiplies
vectors and matrices element-wise, while the “%*%” operator is used for matrix multiplications. Both functions
can be useful, but be careful to use the correct operator symbol for your calculations.
># Define a 3 x 3 matrix
> bb <- matrix(c(1,2,3,4,0,5,1,3,1),nrow=3,ncol=3)
>#
># Display the matrix
>#
> bb
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 0 3
[3,] 3 5 1
>#
># Invert the matrix
>#
> solve(bb)
[,1] [,2] [,3]
[1,] -0.6521739 0.04347826 0.52173913
[2,] 0.3043478 -0.08695652 -0.04347826
[3,] 0.4347826 0.30434783 -0.34782609
The matrix() command is used to define a matrix from a vector. The matrix command is very robust
and can be used to create simple or complicated matrices with a few keystrokes. E.g. the command
Crash Course: R & BioConductor
12
matrix(4,nrow=3,ncol=5) would create a 3 x 5 matrix with every entry equal to 4. More details on these
techniques will be given in section 2.6. The command solve() will invert most symmetric matrices, but more
complicated Cholesky inverse and generalized inverse methods are also available.
2.4 Searching the Help Menus
There are many books, guides and manuals available to help you learn how to use R, but inevitably
every R user must search the help menus. The R help menus can help you find new functions or provide more
detailed explanations of the inputs and outputs of a function you have already used. There are several useful
help commands in R to find the documentation that you need.
2.4.1 Help documentation with help() and ?
> help(t.test) # Find documentation for the function t.test
> ?t.test # (same as above)
The two functions above are used to find help documentation for a specific function, when you already
know the function’s command. Try entering help(log) or ?log for another example. These help commands
are most useful if you need a detailed explanation of a function you already use or if you would like to
investigate a function you found in a paper or on the internet. The two commands are equivalent. Both produce
an HTML-formatted manual for the specified function (Figure 12). You will notice that most of the help
documentation for R functions follows very strict formatting. Each help page provides you the command
information to call the function, the function’s name, a description of the function purpose, details about its
usage within R, details about its arguments, details about its output and typically a specific example that you
copy-and-paste into R for demonstration purposes.
Figure 12. Help documentation for the function t.test.
Crash Course: R & BioConductor
13
2.4.2 Keyword searches with help.search()
Another type of search is required when you do not know the command of a specific function. Type
help.search(“keyword”) to search for keywords and find all the command names of functions related to your
keyword. For example, the command help.search(“students t test”) will produce a list of all functions
related to the student’s t-test (Figure 13). In this example, only the t.test() function is found by our search,
but other requests may generate many results.
Figure 13. List of help files from help.search keyword search.
2.4.3 Google and other search engines
One final suggestion is to use Google.com or other search engines to find help with R. I recommend
that you always include the keyword “CRAN” or “BioConductor” in your R-related Google searches, because it
can help direct you to search results that are most directly related to R packages and concepts (Figure 14). The
search engine Rseek (http://www.rseek.org/) is a search engine that only queries the R help files and related
websites.
Crash Course: R & BioConductor
14
Figure 14. Searching Google for help with R packages.
2.5 Installing R Packages and Source Scripts
Help searches and basic arithmetic functions are included in the base R software package, but often
researchers need to use specialized research tools that are not included in the base R software. These
specialized tools are often available as free downloadable packages or source scripts in R. Packages are user-
submitted R scripts and functions that have are made been posted online by CRAN or BioConductor. Packages
are downloaded and installed from the R GUI or the command line. The code for these packages is typically
downloaded as a source file, written in the R programming language, or as a binary, written in a compiled
language like C or FORTRAN. Some functions and scripts have not been submitted as packages to CRAN or
BioConductor, but they still may be loaded into your installation of R as a source file.
2.5.1 R packages
Click > Packages > Install Package(s)… to install packages from the Windows R GUI. If you are
installing packages for the first time, you may be prompted to Set CRAN mirror… or Select
Repositories…before you continue (Figure 15). Remember, the CRAN mirror site is a server that contains the
most recent R software downloads and packages. You want to choose one of the CRAN mirrors nearest you for
convenience. The package repositories are specific lists of packages from CRAN and BioConductor. Choose
only the repositories you need, because selecting more repositories will create a longer list of packages for you
to browse.
Crash Course: R & BioConductor
15
Figure 15. Set CRAN mirror…, Select Repositories… and R packages menus in Windows R GUI.
Use the scroll bar to browse through the list of R and BioConductor packages and select the packages
you need for installation (Figure 15). You can hold the Ctrl key to select multiple packages, if necessary. Note
the list of packages can be very long, especially if several repositories were selected. Once you have selected
the R packages you need, click OK to download and install the packages.
Alternatively, if you know the name and repository address of the packages you need to download, you
can download and install a package from the command line using the command install.packages(). There
is no advantage or disadvantage to using the R GUI or the command line to install a package, but the
install.packages() command can be very helpful when scripting. Using the install.packages()
command in your source code will ensure any users of your script will have all the necessary R packages. As
the packages download, you may see some log messages in your R console to keep you informed of the
download progress and any potential errors. After the packages have finished downloading, you will want to
enter a library() or require() command for the package to load the contents of the package into your R
workspace. Both commands have the same function, but the require() is preferred for use within R functions.
> install.packages("gtools",repos="http://cran.r-project.org")
trying URL 'http://cran.r-project.org/bin/windows/contrib/2.6/gtools_2.4.0.zip'
Content type 'application/zip' length 157621 bytes (153 Kb)
opened URL
downloaded 153 Kb
package 'gtools' successfully unpacked and MD5 sums checked
The downloaded packages are in
C:Documents and SettingsskinnerjLocal
SettingsTempRtmpIQ1HTcdownloaded_packages
updating HTML package descriptions
> library(gtools)
Crash Course: R & BioConductor
16
2.5.2 Source scripts
This manual will introduce the idea of R source scripts in Chapter 5, but keep in mind that you can also
upload new functions using R source scripts. If you have found a R source script that you need to upload, click
> File > Source R code… on the Windows R GUI or > File > Source File… on the Mac OSX R GUI. Use the
command source(), if you prefer to load the source script from the command line. Most casual R users will
only use the file parameter of the source() command.
> # Load a source script file (.R extension)
> source("~/example.R")
2.6 Entering and Importing Data
There are dozens of ways to enter data into R. Many famous and historical datasets are already
uploaded and available for use in the base R software or in an R package. There are hundreds of functions that
make it easy to type and enter small data sets manually into R. Other functions can help to generate huge
amounts of random or simulated data. Finally, there are a variety of functions available to help you upload your
own data files, whether they are stored as R workspace data (.Rdata), as plain text files (.txt, .csv, …), in
proprietary data file formats from other popular statistics packages (.sav, .sas7DAT, …), as MS Excel
spreadsheets (.xls, .xlsx, …) or even tables from a database (e.g. MS Access, MySQL, …).
2.6.1 Base and package data in R
Enter the command data() or click > Packages & Data > Data Manager in the Mac OSX R GUI to
view a list of the datasets currently available on your installation of R (Figure 16). Try to find the data set “iris”
in the list. Select the “iris” data set from the list, or enter the help command ?iris, to read the documentation
describing this data set. The “iris” data set in R is the famous “Fisher’s iris data”, originally collected by
biologist Edgar Anderson. This classic data set has been used in countless statistics and biology textbooks, and
I will use it later in this manual.
Figure 16. A list of R data sets.
Crash Course: R & BioConductor
17
Enter the command iris to view the Fisher’s iris data set, already stored in R.
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
>
The iris data set includes data for 150 iris plants, with 50 plants each from the Iris setosa, I. versicolor L.
and I. virginica L. species. Four measurements were taken from each of the 150 plants to record their sepal
length, sepal width, petal length and petal width. This data set is often used as an example in classification and
clustering problems, where statisticians use the data as a training set to predict the species of an iris plant based
on its sepal length, sepal width, petal length and petal width measurements.
2.6.2 Entering data manually
The base and package data available in R can be a useful resource, but most R users need to upload their
own data into R. Most researchers will already have their data stored in a large data file. However, for some
small data sets, it may be easiest to enter the data manually from the command line. In other situations,
researchers may need to simulate large amounts of data using procedures from the command line. Some
common R procedures will be used to generate a small dataset concerning the Alpha-fetoprotein (AFP) levels of
20 medical patients.
> # generate a list of subject IDs, numbered from 1 to 20
> #
> subject <- 1:20
> #
> # create 10 entries for male subjects
> #
> males <- rep("male",10)
> #
> # create 10 entries for female subjects
> #
> females <- rep("female",10)
> #
> # combine male and female entries into one column vector
> #
> gender <- c(males,females)
> #
> # bind subjectID and gender columns together
> #
> afp.data <- data.frame(subjectID,gender)
> afp.data
Recall the command subject <- 1:20 can be used to generate the sequence of integers from 1 to 20.
Alternatively, subject <- seq(from = 1, to = 20, by = 1) could have been used to generate the same
sequence. These numeric sequences could be used to specify a series of patient ID’s or subject ID’s. Here the
Crash Course: R & BioConductor
18
ID’s have been stored as a variable named subject. The command males <- rep("male",10) is used to
generate a vector of 10 replicated string values to label the male patients. A similar replicated vector is created
to identify 10 female patients, then the male and female labels are combined with the column vector command
c() to store a variable named gender. These two variables can be joined together to create the subject ID’s and
gender labels of a new data set named afp.data using the command data.frame(), which will be defined
later.
> # generate 10 male and 10 female random normal heights
> #
> height <- c(rnorm(10,70,2.5),rnorm(10,64,2.2))
> #
> # generate 10 male and 10 female random uniform weights
> #
> weight <- c(runif(10,155,320),runif(10,95,210))
> #
> # compute body mass index (BMI) for 10 men and 10 women
> #
> BMI <- (weight*703)/(height**2)
> #
> # enter five treatment levels of a new drug (ng/mL)
> #
> drug <- rep(x = seq(from = 0, to = 20, by = 5), times = 4)
> #
> # manually enter Alpha-fetoprotein (AFP) levels for 20 patients
> #
> AFP.before <- c(0.8,2.3,1.1,4.8,3.7,12.5,0.3,4.4,4.9,0.0,1.8,
2.4,23.6,8.9,0.7,3.3,3.1,0.5,2.7,4.5)
> #
> afp.data <- data.frame(afp.data,height,weight,BMI,AFP.before)
> afp.data
We can use random data procedures to create additional factors for the afp.data data set in R. Suppose
we want to randomly generate height and weight values for our male and female patients. Furthermore, assume
male heights are normally distributed with mean 70 inches (i.e. 5 foot 10 inches and standard deviation 2.5
inches, and female heights are normally distributed with mean 64 inches and standard deviation 2.2 inches. The
command rnorm(10,70,2.5) is used to randomly generate 10 new observations from a Gaussian normal
distribution with mean 70 and standard deviation 2.5 to represent the heights of our male patients, while
rnorm(10,64,2.2) will generate the data for our female patients. Next, we want to generate weight values for
male and female patients, assuming male weights are uniformly distributed between 155 pounds and 320
pounds (i.e. each mass between 155 lbs. and 320 lbs is equally likely), while female weights are uniformly
distributed between 95 lbs and 210 lbs. The commands runif(10,155,320) and runif(10,95,210) will
generate the male and female data, respectively. We could use the height and weight variables to compute a
new variable to represent body mass index (BMI) from the usual formula BMI = (703 * mass (lbs) ) / (height2
).
Next, suppose we want to add a variable to represent five increasing concentrations of a new drug, from 0
ng/mL to 20 ng/mL. The seq() command can be used to generate the sequence of drug concentrations 0
ng/mL, 5 ng/mL, 10 ng/mL, 15 ng/mL and 20 ng/mL, while the rep() statement will repeat that sequence of
drug concentrations four times to fill out the rest of the column. Finally, we could enter a vector of pre-
treatment AFP values to complete the data set. Join all the new variable columns together with the earlier
afp.data data set using the data.frame() command to finish, then view the results by entering the data set
name afp.data at the command line.
Crash Course: R & BioConductor
19
2.6.3 Importing previously saved R data workspaces (.RData)
Not surprisingly, it is possible to save and load R data sets in a their own file format (file extension
.RData). If you need to import a previously saved .RData file, you can use the load() command. The load()
command only includes two parameters, the file parameter used to specify the filepath of the .RData file that
will be imported and the more complicated envir parameter that specifies an environment for the uploaded R
workspace. Briefly, an R environment is a collection of named objects in R. The user’s R session workspace
is an environment, for example. If during one session of use, an R user defines the variable aa = 42.7, then
any reference to the variable aa will return the value 42.7 until the variable aa is redefined or until the
workspace is closed. If you load or import a previously saved R data workspace, all the variables and objects
defined in that workspace will be retained.
2.6.4 Text file data (.txt, .csv) with read.table() and scan()
The most efficient way to import data into R is to upload a text file. Text files are typically smaller than
proprietary data formats, like MS Excel spreadsheets. Since plain text files are not organized by columns, rows
and cells like a MS Excel spreadsheet, users will need to specify a character to separate the values of different
fields (i.e. a delimiter to separate columns) and a character to mark the end of each line of text (i.e. the end of a
row). Often, the first row of a text data file is used to name the fields (i.e. columns) of a data table. It is also
common to enclose strings of character data with single or double quotation marks to avoid possible conflicts
with delimiter symbols and numeric fields. All these issues will be addressed in the parameters of the text file
import procedures.
The most popular way to import a plain text data file is with the read.table() command. The
read.table() command is the most general method to read table style data from a plain text data file. Suppose
you have two different text file data sets. The first data set is a tab-delimited text file named . The second data
set is a comma separated value (.csv) plain text file named . Both files can be opened with the read.table()
command.
> # Import a tab-delimited text for data set named Expression
> #
> Expression <- read.table(file = “~/expression.txt,
header = TRUE, sep = “t”,
nrows = 40000,
stringsAsFactors = FALSE)
> Expression
> #
> # Import a comma separated value text file data set named ‘AE’
> #
> AE <- read.table(file = "~/AdverseEvents.csv", header = TRUE,
sep = ",", stringsAsFactors = TRUE)
> AE
The read.table() procedure include the parameter file to specify the quoted file path of the data file
that will be imported. The header = TRUE parameter indicates that the data file has a header line to define the
column (i.e. variable) names of the data table. If the header is not specified, column names can be entered using
the parameter col.names. The parameter sep = “t” indicates that the Expression data file should be
imported as a tab-delimited text files (i.e. columns are separated by tabs). Likewise, the parameter sep = “,”
indicates that fields are separated by commas in the AE data. The parameter nrows specifies the number of
rows of data in the text file to help speed up the import process for a large microarray expression data set.
When stringsAsFactors = FALSE, all string variables in the data file will be stored as character data rather
than factor data. Many statistical and graphing procedures require character data to be specified as factors, but
it may be easier to modify data tables if the string variables are stored as character data. The read.table
Crash Course: R & BioConductor
20
command includes many other parameters that could be useful. For example, the parameter dec = “,” could
be used to specify that numbers are recorded with European-style “comma” decimals (e.g. 2.63 = 2,63).
The read.csv() and read.csv2() commands are equivalent to read.table() with default parameters
optimized for comma separated value text files. Specifically, the read.csv() command is optimized for
‘American’ formatted .csv files, where fields are separated by commas and decimals are separated from integers
with a period. ‘European’ formatted .csv files, where fields are separated by semicolons and decimals are
separated from integers with a comma, should be imported with read.csv2(). Similarly, the read.delim()
and read.delim2() commands are equivalent to read.table() with default parameters optimized for
importing tab-delimited text files. Specifically, the read.delim() command is optimized for ‘American’
formatted .txt files, where fields are separated by tabs and decimals are separated from integers with a period.
‘European’ formatted .txt files, where fields are separated by tabs and decimals are separated from integers with
a comma, should be imported with the read.delim2() command.
The command scan() can also be used to import data tables from text files. The primary differences
between the read.table() commands and the scan() command is that the scan command reads data as a
single large vector or list that must be shaped into a data table or data frame later. The scan() command can be
more difficult to use than read.table() and similar commands when the text file data is already formatted in
some way. The command read.fwf() is used to import text data files that are stored in fixed width format,
where fields are not separated by a specific character like tab or comma, but instead each field is read from a
specified number of characters from left to right in the text file (e.g. characters 1-6 store the first field,
characters 7-8 store the second field, characters 9-12 store the third field, …). In other words, fields might be
separated by 0 or more space characters.
2.6.5 MS Excel files and other proprietary formats
Historically, it has been difficult to import data from MS Excel spreadsheets into R. Most people will
convert their MS Excel spreadsheets into tab-delimited text files (.txt) or comma-separated value text files
(.csv), then import these text files into R using the scan(), read.table() or read.csv() commands.
However, converting MS Excel spreadsheets into text files may be tiresome, if dozens of files need to be
converted, and some users may not have access to MS Excel to convert .xls spreadsheets into .txt files. It is
now possible to import MS Excel spreadsheets directly using the read.xls() command from the xlsReadWrite
package library.
First, try uploading MS Excel data by converting the MS Excel spreadsheet into a tab-delimited .txt file
or a .csv file. Start with a simple MS Excel data file (Figure 17).
Figure 17. A MS Excel spreadsheet data set
Crash Course: R & BioConductor
21
Click > File > Save As… to open up the Save As menu (Figure 18) and use the Save as type: drop down
menu to save your file as Text (Tab delimited) or CSV (Comma delimited). This will covert your MS Excel
spreadsheet into a tab-delimited text file or comma separated value text file that can be uploaded easily into R.
Figure 18. Saving a MS Excel spreadsheet as a tab-delimited text file.
Next, open R and import the text file with the read.table() or read.csv() statements seen below:
aa <- read.table(file = “C:sample.txt”, header = TRUE, sep = “t”)
Here, the statement aa <- read.table() implies that we are defining a data set named aa. The parameter
statement file = “H:BCBB tipssample.txt” specifies the file path of our tab-delimited text file
containing the data. The parameter statement header = TRUE specifies that the first row of data contains the
column headings of our data set, while the statement sep = “t” specifies that the different columns (or fields)
of our data table are separated by tab characters (i.e. the file is tab-delimited). You can use a similar command
to upload a .CSV file with the read.csv() command.
Now, open the same file using directly from MS Excel. Click > Packages > Install package(s)… on
the MS Windows R GUI or click > Packages & Data > Package Installer on the Mac OSX R GUI to find and
install the xlsReadWrite package library. Enter the command library(xls.ReadWrite) to load the package
library to your workspace. Upload your MS Excel data using the command:
bb <- read.xls(file=“C: sample.xls”,colNames=TRUE,sheet=1)
Note, the statement bb <- read.xls() implies that we are defining a data set named bb, just like the previous
example. Similarly, the parameter statements file = “H:BCBB tipssample.xls”, colNames = TRUE
and sheet = 1 indicate the file path of the MS Excel spreadsheet, choice to read column names from the first
row of data and the choice to only read data from the first sheet of the MS Excel file, respectively.
Crash Course: R & BioConductor
22
Other R package libraries are available to open data files created or saved using commercial statistics
software packages like SAS or SPSS. For example, SAS datafiles and SAS XPORT format libraries can be
imported with the commands read.ssd() and read.xport() from the foreign package library. It is also
possible to import SAS data sets and SAS Transport files using the sas.get() and sasxport.get()
commands from the Hmisc package library. Similarly, spss.get() and read.spss() from the package
libraries Hmisc and foreign, respectively, can both be used to open SPSS data files (.sav file extension) in R.
The command stata.get() from the Hmisc package library can be used to import Stata datasets into R, etc.
Data sets from most commercial statistics software packages can be imported directly into R.
2.7 Data Types
2.7.1 Simple object types (E.g. numeric, character and logical)
One potentially frustrating problem with R is that you must carefully specify and manage how data is
stored within R. Consider the following R statements:
> a = 4.23 # Define a numeric object “a”
> b = "Fred Flintstone" # Define a character object “b”
> c = TRUE # Define a logical object “c”
The R commands above define three variables: a, b and c. Each of these three variables are stored as an object
within the R framework. All objects in R have a specific storage mode within R. The variable a was defined to
be the real number 4.23, so the variable a will be stored as a numeric object in R. Likewise, the variable b was
defined to be the character string “Fred Flintstone”, therefore it will be stored as character object in R. Note
that character string objects are entered within double quotes in R. Finally, the variable c was defined as the
logical outcome TRUE, so it will be stored as a logical object in R. Note the logical values TRUE and FALSE
can be entered and stored without quote marks in R to create a logical object, but the character strings “TRUE”,
“False” and other variations within double quotes would be stored as character objects in R. More specific
classes of objects exist, like complex() R objects for storing complex numbers, integer() objects for storing
integer numeric data or factor() objects for character string data that will be used as factor effects in statistical
tests and graphs.
Objects from different storage modes have different properties within R. If an R user tried to compute
the sum 4.23 + “Fred Flintstone”, then R would return an error message. This is a good thing, because
obviously the sum 4.23 + “Fred Flintstone” does not make sense. However, the different properties of these
object types can sometimes cause conflicts, especially when data gets entered incorrectly. For example, the
numeric value 4.23 can also be entered as the character string “4.23”. Numeric data sometimes gets into R, or
other software programs, as character string data because of differences in how various software programs
handle missing values or other problems. Incorrectly storing R objects can create unwanted errors in R
statistical and graphing procedures, so it is often helpful to check the storage mode of an R object using the
storage.mode() command. You can test whether an object belongs to a specific class using commands like
is.numeric() or is.character(). You can also find the storage mode of an object using the command
class(). Objects can often be coerced from one class into another using commands like as.numeric() or
as.character().
> a = 4.23 # Define a numeric object “a”
> is.numeric(a) # Test if “a” is a numeric object
[1] TRUE
> a = as.character(a) # Coerce “a” into a character object
> a # View the object “a”
[1] "4.23"
Crash Course: R & BioConductor
23
> is.numeric(a) # Test if “a” is a numeric object
[1] FALSE
> class(a) # Find the storage mode class of “a”
[1] “character”
2.7.2 Larger object types (E.g. data frames, matrices and lists)
Larger entries of multiple values are also considered objects within R. For example, the data sets iris,
afp.data, Expression and AE defined in Section 2.6 are all R objects. Many collections of values can be
stored as vector(), matrix(), array(), table(), data.frame() or list() objects within the R framework.
Again, these object storage modes each have unique properties within R to ensure proper handling of different
types of data within R. You can also identify the storage mode of larger R objects, like arrays or data frames,
with the command class() or with specific tests like is.data.frame(). You can coerce these larger R
objects into different storage modes using commands like as.list() or as.numeric().
A vector is a one-dimensional, mathematical list of values that can be used in linear algebra (or matrix
algebra) operations. The vector() and c() commands in R are used to define column vectors; row vectors
must be created using the transpose command t() or by defining a 1 x n matrix, where n is the length of the
row vector. A matrix is a two-dimensional list of values organized into m rows and n columns. Generally,
vectors and matrices in R should only used for matrix algebra manipulation of numeric data. It is possible to
enter character or logical data into a vector or matrix, but it is not possible to mix different types of data (e.g.
numeric, character, logical,…) in the same vector or matrix. An array is similar to a vector or matrix, except the
array can exist in higher dimensions. The higher dimensions of an n-dimensional array could be described as
panels, pages, chapters, books, etc. For example, a six-dimensional array might have 3 books, 10 chapters, 150
pages, 16 panels, 8 rows and 45 columns.
> # Display a vector with length = 6
> Vector <- c(45,50,53,47,44,52)
> Vector
[1] 45 50 53 47 44 52
> # Report the length of Vector
> length(Vector)
[1] 6
> # Display a matrix with 4 rows and 6 columns
> Matrix <- matrix(x = x, nrow = 4, ncol = 6, byrow = TRUE)
> Matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 54 49 47 57 53
[2,] 49 45 50 50 49 46
[3,] 45 56 54 52 46 51
[4,] 47 46 48 48 55 48
> # Report the number of rows and columns of matrix
> nrow(Matrix)
[1] 4
> ncol(Matrix)
[1] 6
> # Display an array with 4 rows, 6 columns and 2 panels
> Array <- array(data = x, dim = c(4,6,2))
> Array
, , 1
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 50 48 52 52 48 44
[2,] 52 46 45 56 44 59
[3,] 45 41 51 51 48 44
[4,] 46 52 53 52 45 51
Crash Course: R & BioConductor
24
, , 2
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 40 50 50 55 51
[2,] 44 50 46 50 53 57
[3,] 49 49 50 54 49 51
[4,] 56 55 45 49 51 47
> # Report the dimensions of Array
> dim(Array)
[1] 4 6 2
The combine function c() can be used to create column vectors in R. Even though vectors are
displayed in rows when printed in the R workspace, mathematically they will behave as column vectors. The
size of a vector is reported using the length() command in R. The matrix() command builds a matrix from a
vector x, using the parameters nrow and ncol to specify the number of rows and columns. The parameter
byrow is used to determine whether the matrix will be filled with the vector values row-by-row or column-by-
column. The number of rows and columns in a matrix can be reported with the functions nrow() and ncol(),
respectively. The array() command builds an array from a vector data, where the dim parameter is used to
specify both the number and length of its dimensions. For example, the entry dim = c(4,6,2) indicates the
array should have three dimensions of lengths 4, 6 and 2, respectively. The function dim() is used to report the
dimensions of an array object.
The table() storage mode is used to store n-dimensional tables of frequency data for two or more
categorical variables. The tables might be reported directly to display the frequency distribution among two or
more factors, or the tables could be used as the data format for specific statistical tests and graphical methods.
The table() comma is helpful, because it computes cell frequencies automatically. Unfortunately, the
table() command does not compute statistics (e.g. mean, sum, …) for numeric arguments, so it cannot be used
to compute pivot tables or statistical summaries.
> # Build a two-way table from AE data
> Table <- table(Gender,Severity)
> Table
Mild Moderate Severe
Female 6 16 9
Male 14 15 4
> # Build a three-way table from AE data
> Table <- table(Region,Severity,Gender)
> Table
, , = Female
Mild Moderate Severe
Midwest 1 6 1
Northeast 0 3 7
Northwest 0 4 0
Southeast 0 3 1
Southwest 5 0 0
, , = Male
Mild Moderate Severe
Midwest 0 7 1
Northeast 1 5 1
Northwest 0 1 0
Southeast 0 2 2
Southwest 13 0 0
Crash Course: R & BioConductor
25
The table() command uses two or more vectors of factor() or character() data to generate an n-
way table. The numbers in each cell are counts representing the number of observations with each combination
of factor levels. For example, a two-way table of gender versus severity reveals that there are 6 female patients
with mild AE symptoms. Breaking the data down further into a three-way table with region, severity and
gender, there is only one female patient with mild symptoms from the northeast region and there are 5 female
patients with mild symptoms in the southwest region. No additional parameters were specified. However, one
interesting parameter is exclude, which allows you to hide the results for specified factor levels. For example,
exclude = “Midwest” would hide all the results for the midwest region.
Most R data should be stored with the data.frame() storage mode. The vector, matrix, array and table
objects can only store individual values from one storage type (e.g. all numeric data, all character data, …). A
data frame is very useful, because it can store data sets with multiple columns (i.e. variable) that maintain their
own unique storage modes (e.g. separate numeric, character and logical variables). The only limitation of
the data frame is that each variable, or column, must be the same length. Commands like read.table() and
read.csv() will automatically store imported data as a data frame. Data frames share some of the properties of
the matrix, array and list storage modes, but you want to be careful about using data frames in matrix algebra
calculations and other methods. The properties of data frames will not always work with commands that
require matrix data, and vice versa.
> # Define a data frame from AFP data
> afp.data <- data.frame(subjectID,gender,stringsAsFactors=FALSE)
> gender
[1] "female" "female" "female" "female" "female" "female"
[7] "female" "female" "female" "female" "male" "male” "male"
[14] "male" "male" "male" "male" "male" "male" "male"
> # Define a data frame from AFP data with strings as factors
> afp.data <- data.frame(subjectID,gender,stringsAsFactors=TRUE)
> gender
[1] female female female female female female female female
[9] female female male male male male male male male male male male
Levels: female male
The data.frame() command is used to join several vectors of the same length together to form a single
data set. Again, individual vectors can have different storage modes, such as numeric or character. The only
limitation is that the vectors must share the same length. The stringsAsFactors parameter of the
data.frame() command is used to store any character string variables in the data frame as factor() objects.
The diffence between a vector of character objects and a factor object is shown above. A vector of character
data is simply a collection of character strings. If we wanted to add new rows of data to the character variable
gender, then we could add a new character string to the variable (e.g. “Did not report” or “intersex”). However,
a factor object is a variable of string or numeric data with a fixed number of levels or outcomes. When gender
is stored as a factor, there are only two possible levels, female and male. If a new character string (e.g. “Did not
report”) were added to the factor variable gender, it would produce an error. Many statistical and graphical
procedures in R require factor data.
The list() storage mode is used to store a collection of ordered or named objects in R. A list shares
some properties in common with a vector, except the list can be used to store a collection objects with different
storage modes. For example, a single list could contain one numeric entry, three character entries, one logical
entry and an entry that is itself a vector or matrix. Lists can also have entries that are named, as well as ordered.
Finally, a list can be built one element at a time. For this reason, lists can be a handy way to store the output
from a statistical function, since tests often produce diagnostic results and data that may not be immediately
useful to all users (e.g. lists can store the residuals from a linear regression analysis).
Crash Course: R & BioConductor
26
> # Build a list of four numeric entries
> List <- list(45,13,21,87)
> List
[[1]]
[1] 45
[[2]]
[1] 13
[[3]]
[1] 21
[[4]]
[1] 87
> # Build a list of character, numeric and vector objects
> List <- list(Day = "Tuesday", Temperature = 70, WinningLotto = c(17,23,44,39,7))
> List
$Day
[1] "Tuesday"
$Temperature
[1] 70
$WinningLotto
[1] 17 23 44 39 7
> # Add a new list element to the list
> List[["Time"]] <- "4:30"
> List
$Day
[1] "Tuesday"
$Temperature
[1] 70
$WinningLotto
[1] 17 23 44 39 7
$Time
[1] "4:30"
The list() command allows you enter a series of values or R objects to build the list, similar to the
vector() command. The first example shows a list of four numeric values. When the list is displayed in the R
workspace, each entry of list is identified first in double brackets, then the individual rows of the entry are
displayed on the line below with the entry values on that row. For example, the third entry of the first list is
identified [[3]] with the value [1] 21. Since all of the entries are numeric, this list is equivalent to a numeric
vector of length 4. In the second example, the entries of the list are assigned names using the = operator.
Notice that the entries include a character object, a numeric object and a vector of numbers. Finally, new
entries to the list can be added after it has been defined, by adding a new name and value, e.g. List[["Time"]]
<- "4:30".
2.8 Manipulating Data in R
Once you have a data set imported and stored correctly in R, you may still need to manipulate the data
set to add data, remove data or to meet the formatting requirements of a graph or statistical test. You may want
to remove outlier values, or rename rows and columns of data, or maybe merge two data sets together. The R
command line language includes a wide variety of procedures for these needs. Understanding these commands
Crash Course: R & BioConductor
27
can be crucial, because R data is not stored in a viewable spreadsheet format, with simple copy-cut-and-paste
functions, like MS Excel and other programs.
2.8.1 Indexing
One of the most important concepts in R is the idea of indexing, because it applies to so many types of R
objects. Vectors, matrices, data frames, arrays and lists can all be indexed using similar command notations.
The index of an R object refers to the specific location of a value in a vector, matrix, array, data frame or list.
You can generalize this concept by thinking of the index as the row and column number of any value entry in a
spreadsheet, but remember that some R objects can have more than two dimensions or fewer than two
dimensions. Here are some examples:
> # Report the third entry from a vector of length = 6
> Vector[3]
[1] 53
> # Report the entry from the 2nd row and 5th column of a matrix
> Matrix[2,5]
[1] 49
> # Report the 3rd row, 2nd column and 2nd panel of an array
> Array[3,2,2]
[1] 49
> # Report the 3rd row, 2nd column and 'Female' gender of a table
> Table[3,2,"Female"]
[1] 4
> # Report the 1st entry of the 1st column from afp.data
> afp.data[1,1]
[1] 1
> # Report the 2nd entry from the 'WinningLotto' vector in a list
> List[["WinningLotto"]][2]
[1] 23
Generally, you refer to an indexed entry of an R object by adding square brackets after the objects name
(e.g. Vector[3] refers to the 3rd
entry of the object Vector). The dimensions of an object are separated by
commas (e.g. Matrix[2,5] refers to the 2nd
row and 5th
column of the object Matrix). If the dimensions of an
object are named instead of numbered, then those dimensions can be specified with a quoted character string
(e.g. specify the "Female" of the gender dimension). The examples above use indexing to report single values
from vectors, matrices, arrays, tables, data frames and lists, but an index can be used in more complicated ways.
> # Rows 2-3 and columns 1, 2 and 6 of a matrix
> Matrix[2:3,c(1,2,6)]
[,1] [,2] [,3]
[1,] 49 45 46
[2,] 45 56 51
> # Overwrite one value from a matrix
> Matrix[3,3] <- NA
> Matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49 54 49 47 57 53
[2,] 49 45 50 50 49 46
[3,] 45 56 NA 52 46 51
[4,] 47 46 48 48 55 48
Crash Course: R & BioConductor
28
> # Identify observations with % Body fat less than 10%
> AE[Percent.Body.Fat <= 10,]
Region Gender Severity Age Weight Percent.Body.Fat
2 Southwest Male Mild 34 148.5672 7
30 Southwest Male Mild 36 155.3823 8
49 Midwest Male Moderate 34 151.3767 9
A sequence of row or column numbers can be entered into an index to view more than one row or
column from a data table. These sequences can be entered using colon symbol notation (e.g. 1:5) or the
combine function (e.g. c(1,4,7)) and other methods. The individual indexed values of a matrix, array or data
frame can be overwritten without affecting any other values in the matrix, array or data frame. Sequences of
row numbers or column numbers can be generated with an inequality or conditional statement to find special
subsets of data. Indexing is a very powerful tool within R.
2.8.2 Column references and attach()
Indexing can be a great way to create, view or modify subsets of your data, but often it might be more
helpful to refer to specific columns, or variables, within a large data frame. We have already seen that the
objects in a list can be called by their names using double square brackets and the quoted name (e.g.
List[[“Time”]] yields the value “4:30”). We can also call a specific column of a data frame using the
reserved dollar sign symbol (e.g. AE$Gender yields the gender column of the AE data set). Column references
and list name references can be simplified using the attach() command to “attach” a specific data frame or list
to the current R workspace. Once the object has been attached to the workspace, individual variables or list
items can be called by name.
2.8.3 Binding rows and columns
Indexing and column references allow you to manipulate smaller subsets of a large data set. The
functions cbind() and rbind() allow you to add columns and rows to data sets, respectively. It is also
possible to add columns to a data frame recursively, by redefining the data frame with its original data and the
new columns of data Frame = data.frame(Frame,NewColumn). Obviously, new elements can be added to
lists at any time, by adding new named elements.
2.8.4 Sort and order data
It is often helpful to sort the results of a vector, array or data frame to reorganize a data set or result for
better insights. For example, you might need to sort the results from several statistical tests by their p-values, so
the most statistically significant results are easy to identify. The sort() function is used to sort the actual
values of a single vector or list in ascending or descending order. The order() command is used to generate an
index of the sorted rows numbers from a data frame sorted by one or more variables in either ascending or
descending order.
> # Sort a single vector of values
> a
[1] 4 4 0 5 4 1 0 1
> sort(a)
[1] 0 0 1 1 4 4 4 5
> sort(a,descending = TRUE)
[1] 5 4 4 4 1 1 0 0
> index = order(a,b,c)
> index
[1] 3 7 6 8 2 5 1 4
Crash Course: R & BioConductor
29
> frame[index,]
a b c
3 0 0.0 89
7 0 0.6 92
6 1 0.2 99
8 1 0.8 83
2 4 0.0 84
5 4 0.4 100
1 4 0.9 81
4 5 0.1 92
> index = order(b,c,a)
> frame[index,]
a b c
2 4 0.0 84
3 0 0.0 89
4 5 0.1 92
6 1 0.2 99
5 4 0.4 100
7 0 0.6 92
8 1 0.8 83
1 4 0.9 81
2.8.5 Replace values
The replace() command is used to replace the values of a matrix, array or data frame according to an
index list, which identifies the values that need to replaces, and a vector, matrix or array of replacement
values. For example, you could use an equality statement to recode the values of a character or factor variable
(e.g. recode “m” and “f” to “male” and “female”). Alternatively, you could use an inequality to identify and
remove outliers from a numeric variable.
> # display a simple data frame
> frame
a b
1 f 3
2 f 7
3 m 1
4 f 1200
5 m 6
> # generate an index to identify “f” values for replacement
> indx.f <- frame == "f"
> # replace “f” values with “female”
> frame <- replace(frame,indx.f,"female")
> # generate an index to identify possible outlier values of b
> indx <- frame$b > 1000
> # replace outliers with NA to remove outliers
> frame$b <- replace(frame$b,indx,NA)
> # view results
> frame
a b
1 female 3
2 female 7
3 m 1
4 female NA
5 m 6
2.8.6 Stack, unstack and reshape data
Often data sets need to be stacked or split to reorganize data for use in statistical tests or to simply the
recording of new observations. For example, suppose you were to record blood pressure, cholesterol and other
Crash Course: R & BioConductor
30
medical results for 6 patients on Monday, Tuesday and Wednesday of one week. It would make sense for the
doctor to record the measurements in three separate columns for Monday, Tuesday and Wednesday. But most
statistical tests would need one column of dates and one single column for each type of measurement (e.g. one
column of blood pressure measurements). The stack() and unstack() commands are used to stack and split
these kinds of data sets. More complex stack and split operations can be performed using reshape().
> # Display a data frame
> frame
Monday Tuesday Wednesday
1 96 76 156
2 100 78 163
3 102 80 163
4 106 82 159
5 105 82 153
6 103 78 162
# Stack the data from Monday, Tuesday and Wednesday
> frame = stack(frame)
> frame
Date Measurement
1 Monday 96
2 Monday 100
3 Monday 102
...
7 Tuesday 76
8 Tuesday 78
9 Tuesday 80
...
16 Wednesday 159
17 Wednesday 153
18 Wednesday 162
> # Unstack the measurement data by dates
> frame = unstack(frame)
> frame
Monday Tuesday Wednesday
1 96 76 156
2 100 78 163
...
2.8.7 Merge data sets
Two data frames can be joined together using the merge() command. The default option is to join the
data frames using any columns that share the same name among all the data frames. However, specific columns
can be matched to one another using the by, by.x or by.y parameters. For example, it might be necessary to
combine the medical records of a general practitioner, a cardiologist, a dentist and a psychologist according to
their patient id numbers or patient names. The merge command can also be used for more complicated join
operations among database tables.
> psych
id ssri therepy
1 1 Y Y
2 1 Y N
3 2 N N
4 2 N Y
5 3 N Y
6 3 Y Y
7 4 N N
8 4 N Y
Crash Course: R & BioConductor
31
> cardio
id exercise lipids
1 1 Y high
2 1 Y high
3 1 Y norm
4 1 Y norm
5 3 Y norm
6 3 Y norm
7 4 Y norm
> records = merge(psych,cardio)
> records
id ssri therepy exercise lipids
1 1 Y Y Y high
2 1 Y Y Y high
3 1 Y Y Y norm
4 1 Y Y Y norm
5 1 Y N Y high
6 1 Y N Y high
7 1 Y N Y norm
8 1 Y N Y norm
9 3 N Y Y norm
10 3 N Y Y norm
11 3 Y Y Y norm
12 3 Y Y Y norm
13 4 N N Y norm
14 4 N Y Y norm
2.9 Saving and Exporting Data
2.9.1 Save workspace data with save()
With any software program, it is important to save your data. There are several options available to save
your data in R. Most of the options should be familiar from the data import commands in section 2.6 of this
manual. Click > File > Save Workspace... on the MS Windows R GUI or click > Workspace > Save
Workspace File... on the Mac OSX R GUI to save the entire R workspace. Alternatively, you could enter the
command save() or save.image() to save the R workspace from the command line. If you use the save()
command, you can specify a list of R objects to be saved (e.g. save(AE) to save only the AE data set),
otherwise the save() command will save all the R objects defined in the R workspace. Use the command ls()
to view the objects in your R workspace, and use the command rm() to remove individual objects from the
workspace. The command rm(list=ls()) will remove all objects from the R workspace.
> ls()
[1] "AFP.after" "AFP.before" "BMI"
[4] "InternetTest" "Monday" "R2HTML.test"
[7] "Tuesday" "Wednesday" "a"
[10] "aa" "afp.data" "b"
[13] "biocLite" "biocinstall" “biocinstallPkgGroups"
[16] "biocinstallRepos" "c" "cardio"
...
> rm(a,aa,b,c)
> save(file="~/workspace.RData")
2.9.2 Save data tables with write.table()
You can save an R data frame with the commands write.table(), write.csv() or write.delim().
The parameter options are similar to the read.table() commands, but here you will choose whether the save
text file should have a header or row names, if the fields should be separated by commas or tabs, etc. The na
Crash Course: R & BioConductor
32
parameter is important, because it controls how missing values will be saved in the text files; you may want to
be carefully choose the symbol for missing data, if you want to open the text file in another software package
like MS Excel, SAS, etc. Another powerful option in write.table is the append parameter, which allows you to
add new data to an existing text file. This can be a useful option when you need to save large amounts of data
from a script or analysis that involves long computations. It is often helpful to save a data file one-piece-at-a-
time to avoid losing data during lengthy computations or to avoid problems when trying to save one gigantic
data file.
> # Save AE data as tab-delimited text optimized for SAS import
> write.table(AE,file="~/ae.txt",sep="t",na=".",row.names=FALSE)
> # Save the AFP data as tab delimited text using write.delim()
> write.delim(afp.data)
2.10 Changing Directories
When opening and saving files from R, it may be helpful to change the working directory. Changing the
directory will often allow you to specify only a file name, rather than a complete file path, when opening and
saving data files or source scripts. Change directories from the MS Windows GUI by clicking > File > Change
dir... on the menu bar; click > Misc > Change Working Directory... to change directories on the Mac OSX R
GUI. The Mac OSX GUI also allows you to click > Misc > Get Working Directory to find the current
working directory. The commands getwd() and setwd() allow you to find the current directory and change
the working directory, respectively.
2.11 Sample Problems for Students
#1. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data as an example to test
his new linear discriminant statistical model. Now, the iris data set is used as a historical example for
new statistical classification models.
A) Search the help menu for the keyword “linear discriminant”, then report the names of the
functions and packages you find.
B) Search the help menus or a search engine for additional classification models that could be tested
with the iris data.
C) The measurements from the iris data set were made in centimeters, but suppose a researcher
wanted to compare the performance of their classifier for measurements in both cm and inches.
Remember 1 cm = 0.3937 inch and create a new iris data set with measurements in inches.
D) Use indexing to verify the 77th
plant (i.e. row 77) has petal length of approximately 1.89 inches.
#2. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver
cancer and other cancer types. A researcher might be interested in AFP levels
before and after taking a new drug in one of four concentrations.
A) The example in section 2.7.2 of the manual provided a list of 20 AFP levels before drug
treatment. Use your own methods to enter a new column of 20 AFP levels after drug treatment,
then enter another column with the difference between the pre- and post-treatment AFP levels
B) Verify the storage mode of the data set afp.data. Verify the storage mode of the variable drug.
Verify the storage mode of the variable gender. Convert the storage mode of drug to factor.
Crash Course: R & BioConductor
33
C) Create a subset of the AFP data that only includes male patients with BMI > 25.5 or weight >
180 lbs. How many men are included in the data subset?
D) Sort the entire data subset created in part C) by the BMI variable in an descending order. What
is the row ordering of the sorted data subset? Save the data subset as a comma separated value
(.csv) text file, then remove the subset from your R workspace.
#3. {AE data} Doctors, epidemiologists and other researchers look at adverse events to explore the
symptoms and medical conditions affecting patients. A researcher might choose to look for associations
between adverse events and diet.
A) One of the adverse events in the data table is “Malaise”. Recode the AE data table, such that all
entries for “Malaise” read “Discomfort” instead.
B) Look at the results of your recoded adverse events. How many different types of adverse events
are there? Look through their names. Do you see any potential problems? Fix any problems
that you might find.
C) Create an adverse event table to examine relationship between different adverse event symptoms
and their severities. Make sure the “Discomfort” AE shows up in the table, instead of “Malaise”.
D) Search the help menus for the functions rowSums and colSums. Use these functions to count up
the number of patients with each adverse event and the number of patients with mild, moderate
and severe symptoms.
E) Define a new variable AEmatrix by converting the AE table into the matrix storage mode.
Define two new matrix variables using the commands LL = matrix(1,1,17) and RR = c(1,1,1).
Look at all these new matrices. Compute the products of LL by AEmatrix; AEmatrix by RR;
and LL by AEmatrix by RR.
Crash Course: R & BioConductor
34
Ch. 3. Graphics and Figures in R
3.1 Basic Types of Graphics and Figures
You can use R to produce dozens or hundreds of different kinds of graphics and figures. Many popular
types of graphs, like pie charts and histograms, have their own dedicated commands and procedures in the
graphics package library. Other types of graphs, like multifactor XY scatterplots, are most easily produced
using multiple commands from general graphing utilities, like plot() and legend(). Often, specialized
package libraries will include graphics commands that can help streamline the graphing process. Other graphs
can only be produced in the context of the appropriate statistical analysis. Several simple examples are
provided below.
3.1.1 Pie charts
Pie charts are used to quickly display the frequencies of each outcome of a single categorical variable.
The relative size of each slice of the pie chart represents the relative frequency of its respective outcome in the
sample. For example, we could use a pie chart to examine the proportion of samples from each iris species in
Edgar Anderson’s iris data (Figure 19). We could also use a pie chart to explore the frequencies of each
adverse event in our AE data set (Figure 20).
Figure 19. Pie chart of Edgar Anderson’s iris species. Figure 20. Pie chart of the adverse events (AE) data.
># Create the labels for the iris data pie chart
> labels <- levels(iris$Species)
># Create a vector with all three species counts
> counts <- summary(iris$Species)
># Define a vector with three color choices for the pie chart
> colors <- c("red","blue","yellow")
># Define a main title for the pie chart
> main <- "Pie Chart of Iris Species"
># Call the pie() command to produce the pie chart
> pie(x = counts, labels = labels, col = colors, main = main)
Crash Course: R & BioConductor
35
># Create the labels for the adverse events (AE) data pie chart
> labels <- levels(as.factor(AE$Adverse.Event))
># Create a vector with counts for all adverse events
> counts <- summary(as.factor(AE$Adverse.Event))
># Define a main title for the pie chart
> main <- "Adverse Events Pie Chart"
># Call the pie() command to produce the pie chart
> pie(x = counts, labels = labels, main = main)
The pie() command includes the parameter x to define the counts or frequencies in each slice of the pie
chart, the parameter labels to define the text labels in each slice of the pie chart and the parameter col to
define the colors of each slice in the pie chart. You can also add generic graphing parameters, like main and
others, to customize the pie chart with a main title and other features. Notice that the labels and the x (counts
or frequencies) parameters could have been computed and entered manually, but instead the commands
levels() and summary() were used to define labels and x, respectively. The levels() command lists all the
outcomes of a factor variable, while the summary() command adds up all the counts for each outcome of a
factor variable. Note, the species variable of the iris data set was already defined as a factor variable, while the
adverse events variable from the AE data set needed to be converted to a factor variable first, using the
as.factor() command. Also notice, in the second pie chart, that the col parameter was left undefined and R
automatically generated the color choices for each of the 18 adverse event slices.
3.1.2 Histograms
Histograms are used to quickly display the distribution of a single continuous numeric variable. Often
researchers want to determine if a variable might be normally distributed or non-normally distributed. Other
researchers want to estimate descriptive statistics like means, medians, variances or ranges. A key issue in the
construction of a histogram is the choice of the histogram “bins” or groupings. If too many bins are used, the
true shape of the distribution will be lost because the histogram will be too sparse, but if too few bins are used,
the true shape of the distribution will be lost because the bins are too dense to remain informative. The location
of the bin mid points and break points can also be important to the shape of the histogram. The importance of
binning is shown in two histograms of the height measurements from the AFP dataset shown below (Figure 21
and Figure 22).
Figure 21. Histogram of height from the AFP data set
with default number of bins.
Figure 22. Histogram of BMI from the AFP data set
with a larger number of bins.
Crash Course: R & BioConductor
36
> # Define a vector of BMI data
> height <- as.numeric(afp.data[,3])
> # Define a main title for the histogram
> main <- "Histogram of height from AFP data"
> # Call the hist() command to produce the histogram
> hist(x=height,xlab="height (inches)",main=main,col="wheat")
> # Call hist() command with extra breaks for a second histogram
> hist(x=height,breaks=30,xlim=c(15,45),...)
The hist() command includes many parameter options. The parameter x must be specified, to identify
the sample of continuous data displayed in the histogram. The breaks parameter specifies the number of bins
used in the histogram. The number of histogram bins can be specified using one of three automated binning
algorithm choices (i.e. “Sturges”, “Scott” or “Freedman-Diaconis”), a single number (i.e. breaks = 30 will
produce 31 bins), a vector of specific break points or a formula. In the first histogram, the default “Sturges”
method produced a histogram with six bins, which appears to show a normal distribution. In the second
histogram, the command breaks = 30 specified that 31 bins should be used, and the resulting histogram was
sparse and uninformative. The command xlab specifies the label for the x-axis. As before, the commands
main and col specify the main title and the color of the plotted bars, respectively.
3.1.3 Box plots
Box plots are an alternative to the histogram for researchers who want to quickly summarize the
distribution of continuous numeric variables. Box plots were introduced by statistician John Tukey in his
historic book Exploratory Data Analysis (Tukey 1977). The box plot is a graphical representation of the five
number summary, where the central line in the box plot represents the median of a sample, the outer edges of
the box in the box plot represent the 25th
and 75th
percentiles of the sample and the whiskers of the box plot
represent the minimum and maximum of a sample. Alternate versions of the box plot often use dots or asterisks
to identify outliers beyond the whiskers, which might represent the 5th
and 95th
percentile of a distribution or the
smallest and largest “non-outlier” values of a distribution. Generally, a single box plot (Figure 23) provides less
information about the shape of a distribution than an analogous histogram. For example, a box plot cannot be
used to identify a bimodal distribution of female and male heights, while a histogram can. However, box plots
are often more appropriate than histograms when researchers want to compare the distributions of several
samples in the same figure (Figure 24).
> # Define a vector of height data
> height <- as.numeric(afp.data[,3])
> # Define a main title for the boxplot
> main <- "Boxplot of height from AFP data"
> # Call boxplot() command for boxplot of height from AFP data
> boxplot(x=height,main=main,xlab=”height (inches)”,col=”wheat”)
> #
> # Call boxplot() command for boxplot of calories from AE data
> boxplot(formula=Calories~Region,data=AE,range=1.5,...)
The boxplot() function can be used in at least two different ways, with a single vector of continuous
data or with a formula to produce side-by-side box plots. A simple box plot of the height variable from the AFP
data set is produced using the boxplot() command with parameter x = height to specify a single vector of
numeric data for the Calories ~ Region was used to create a graph with side-by-side box plots of the calories
variable for each of the five regions of the categorical region variable. The parameter data = AE is used to
specify that we only want to use variables from the AE data set, which is why we could call the calories and
region variables with out defining them as vectors before the boxplot() command. The range parameter is
Crash Course: R & BioConductor
37
Figure 23. Box plot of patient height from AFP data Figure 24. Box plot of calories among five regions.
used to identify outliers in the box plot. The parameters main, xlab and col were used to specify the main title,
x-axis label and the color of the boxplot, respectively, as seen in the previous examples. In the second box plot,
the parameter formula = on the box plot figure. Here, range = 1.5 implies that any calorie measurement
smaller than Q1 – 1.5*IQR and any measurement larger than Q3 + 1.5*IQR will be identified as an outlier,
where Q1 represents the 25th
percentile, Q3 represents the 75th
percentile and IQR represents the interquartile
range (i.e. Q3 – Q1, the middle 50% of the data). Another parameter ylab = “Calories” was entered to
specify the y-axis label, while the main and col were used to define a main title and the box plot color as above.
3.1.4 Simple bar charts
Researchers and statisticians often use the phrase “bar chart” to describe two subtly different types of
graphs. Sometimes a bar chart is used as an alternative to the pie chart to display the relative frequencies of
different outcomes from a categorical variable (e.g. gender or region), but in other situations a bar chart with
error bars might be used to display the mean response levels of a continuous variable (e.g. weight,
concentration) and its standard error among several categories as an alternative to a box plot. E.g. a bar chart
can be used to plot the frequencies of each adverse event in the AE data set (Figure 25), or a bar chart could be
used to plot the mean BMI levels for male and female patients in the AFP data set (Figure 26).
> # Create a vector of counts for the AE bar chart
> counts <- summary(AE$Adverse.Event)
> # Define a main title for the bar chart
> main <- "Bar chart of Adverse Events"
> # Call the barplot() command for a bar chart of adverse events
> barplot(height=counts,main=main,ylab=”Counts”)
Crash Course: R & BioConductor
38
Figure 25. A bar chart of adverse events from the AE data set
Figure 26. A bar chart of female and male BMI from the AFP data set.
> # compute mean BMI for male and female patients from AFP data
> BMI.females <-mean(AFP[AFP$gender=="female",5])
> BMI.males <-mean(AFP[AFP$gender=="male",5])
> mean.BMI <- c(BMI.females,BMI.males)
> # define labels for female and male bars
> names(mean.BMI) <- levels(as.factor(AFP$gender))
> # Define colors for female and male bars
> colors <- c("pink","sky blue")
> # Specify a main title for the bar chart graph
> main <- "Bar chart of mean BMI by gender for AFP data"
> # Call the barplot() command for a bar chart of BMI responses
> barplot(height = mean.BMI,ylab="BMI", main=main,col=colors)
Crash Course: R & BioConductor
39
Before creating a bar chart of mean BMI levels for male and female patients from the AFP data, we first
need to compute the individual BMI means for male and female patients using subscripting. Names were
assigned to the vector of BMI means using the names() function, so the appropriate gender labels will show up
below the male and female BMI data. Colors and a main title were defined for the chart, and the barplot()
function was called with the appropriate options.
Note that more advanced bar charts of several categorical variables can be easily created with the
barchart() command from the lattice package library. Multiple categorical variables can be summarized
with the table() command, then the table of categorical variables is entered into the barchart() command
for easy clustered, stacked or paneled bar charts. Notice that numeric variables will not work appropriately in
the barchart() command. However, quick and easy bar charts with error bars can be created with the
bargraph.CI() command from the sciplot package library. Other helpful packages may exist to create more
variations on these types of bar charts.
3.1.5 Simple scatter plots and line plots
Scatter plots are used to display the relationship between two continuous variables that might be
analyzed using linear regression or nonlinear regression models. E.g. an XY scatter plot might be used to
examine the relationship between % body fat and weight (lbs) from the AE dataset (Figure 27). Line plots are
often used to plot survival curves, probability density functions (PDFs), cumulative distribution functions
(CDFs) and other continuous functions of interest. E.g. you might need a plot of the standard normal curve for
a class lecture or a statistics textbook (Figure 28).
Figure 27. Scatter plot of % Body Fat vs Weight (lbs) Figure 28. Plot of (Gaussian) normal density function.
> # Define a main title for a scatter plot
> main <- "Simple scatter plot of % Body Fat vs. Weight (lbs)"
> # Simple scatter plot of % Body Fat vs. Weight
> plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylab="% Body
Fat",main=main)
Crash Course: R & BioConductor
40
> # Define a continuous sequence Z ranging from -5 to +5
> Z <- seq(from=-5,to=5,length=8000)
> # Define a sequence representing the density of a normal curve
> fZ <- dnorm(Z)
> # Plot a normal curve
> plot(Z,fZ,type="l",ylab="Density",main="Normal Curve")
A main title was defined for the XY scatter plot of % body fat vs. weight (lbs) from the AE data, before
calling the plot() command with its xlab and ylab options to define the X- and Y-axis labels, respectively.
Since the probability density of a standard normal distribution is really a function f(x), two new variables Z and
fZ were defined to create a line plot of the standard normal density. First, the variable Z was defined as a
sequence of 8000 evenly spaced rational numbers from -5 to +5 using the sequence() command. Second, the
variable fZ was defined as a sequence of 8000 numbers resulting from the function f(x) using the dnorm()
command in R. Finally, a line plot was created from the plot() function by using the parameter type =”l”.
3.2 Custom Titles, Subtitles and Axes Labels
Most graphics procedures (e.g. pie(), hist(), plot(), ...) have some common parameters that allow
users to add specific text for the main titles, subtitles and axes labels. There are additional commands that allow
you to customize the look and feel of these labels for a more professional look. The following sections reveal
some helpful tips about customizing the labels on a graph.
3.2.1 Adding and removing groups from a factor variable
Take a close look at the pie chart (Figure 7) and bar chart (Figure 12) created from the adverse events of
the AE data. You may have noticed a possible typo in the data set, because the data contains two very similar
groups “myalgia” and “mylagia”. The “mylagia” group is a typo, but can it be removed from the plot?
> # Examine the 18 levels of the Adverse.Event variable
> AE$Adverse.Event
[1] Tenderness Arthralgia Mylagia Erythema Erythema Anemia
Anemia
...
[57] Nausea Headache Nodule Anemia Swelling
Leukopenia Elavated CH50
[64] Headache
18 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache
Induration ... Tenderness
> # Store the list of variable names as a new variable
> new.labels <- levels(AE$Adverse.Event)
> # Verify the list still has 18 levels
> length(new.labels)
[1] 18
> # Use indexing to replace the “Mylagia” label with “Myalgia”
> new.labels[12] <- "Myalgia"
> # Assign these new labels to the levels of Adverse.Event
> levels(AE$Adverse.Event) <- new.labels
> # Verify Adverse.Event now has only 17 levels
> AE$Adverse.Event
[1] Tenderness Arthralgia Myalgia Erythema Erythema Anemia
Anemia
...
[57] Nausea Headache Nodule Anemia Swelling
Leukopenia Elavated CH50
[64] Headache
17 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache
Induration ... Tenderness
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor
Crash course in R and BioConductor

Más contenido relacionado

La actualidad más candente

Vaughn aip walkthru_pag2015
Vaughn aip walkthru_pag2015Vaughn aip walkthru_pag2015
Vaughn aip walkthru_pag2015Araport
 
ICAR 2015 Workshop - Blake Meyers
ICAR 2015 Workshop - Blake MeyersICAR 2015 Workshop - Blake Meyers
ICAR 2015 Workshop - Blake MeyersAraport
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Webinar : Introduction to R Programming and Machine Learning
Webinar : Introduction to R Programming and Machine LearningWebinar : Introduction to R Programming and Machine Learning
Webinar : Introduction to R Programming and Machine LearningEdureka!
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
 
R Programming Overview
R Programming Overview R Programming Overview
R Programming Overview dlamb3244
 
Bio-IT 2017 - Session 7: Next-Gen Sequencing Informatics
Bio-IT 2017 - Session 7: Next-Gen Sequencing InformaticsBio-IT 2017 - Session 7: Next-Gen Sequencing Informatics
Bio-IT 2017 - Session 7: Next-Gen Sequencing InformaticsYaoyu Wang
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsAdam Bradley
 
Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsdgarijo
 
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...Ganesan Narayanasamy
 

La actualidad más candente (20)

LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
Vaughn aip walkthru_pag2015
Vaughn aip walkthru_pag2015Vaughn aip walkthru_pag2015
Vaughn aip walkthru_pag2015
 
Introduction to statistical software R
Introduction to statistical software RIntroduction to statistical software R
Introduction to statistical software R
 
ICAR 2015 Workshop - Blake Meyers
ICAR 2015 Workshop - Blake MeyersICAR 2015 Workshop - Blake Meyers
ICAR 2015 Workshop - Blake Meyers
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
4A2B2C-2013
4A2B2C-20134A2B2C-2013
4A2B2C-2013
 
Webinar : Introduction to R Programming and Machine Learning
Webinar : Introduction to R Programming and Machine LearningWebinar : Introduction to R Programming and Machine Learning
Webinar : Introduction to R Programming and Machine Learning
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at Scale
 
A
AA
A
 
R programming
R programmingR programming
R programming
 
R Programming Overview
R Programming Overview R Programming Overview
R Programming Overview
 
Bio-IT 2017 - Session 7: Next-Gen Sequencing Informatics
Bio-IT 2017 - Session 7: Next-Gen Sequencing InformaticsBio-IT 2017 - Session 7: Next-Gen Sequencing Informatics
Bio-IT 2017 - Session 7: Next-Gen Sequencing Informatics
 
R language
R languageR language
R language
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 
Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
 
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
 

Similar a Crash course in R and BioConductor

statistical computation using R- report
statistical computation using R- reportstatistical computation using R- report
statistical computation using R- reportKamarudheen KV
 
R journal 2011-2
R journal 2011-2R journal 2011-2
R journal 2011-2Ajay Ohri
 
A Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using RA Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using RNicole Adams
 
SessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioSessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioHellen Gakuruh
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
Up your data game: How to use R to wrangle, analyze, and visualize data faste...
Up your data game: How to use R to wrangle, analyze, and visualize data faste...Up your data game: How to use R to wrangle, analyze, and visualize data faste...
Up your data game: How to use R to wrangle, analyze, and visualize data faste...Charles Guedenet
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalHeather Choi
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometricsDiane Talley
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...Sundar B N
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
R programming language
R programming languageR programming language
R programming languageKeerti Verma
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04nihshowandtell
 
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...NASIG
 

Similar a Crash course in R and BioConductor (20)

statistical computation using R- report
statistical computation using R- reportstatistical computation using R- report
statistical computation using R- report
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
R journal 2011-2
R journal 2011-2R journal 2011-2
R journal 2011-2
 
A Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using RA Handbook Of Statistical Analyses Using R
A Handbook Of Statistical Analyses Using R
 
SessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioSessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudio
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
Up your data game: How to use R to wrangle, analyze, and visualize data faste...
Up your data game: How to use R to wrangle, analyze, and visualize data faste...Up your data game: How to use R to wrangle, analyze, and visualize data faste...
Up your data game: How to use R to wrangle, analyze, and visualize data faste...
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_Final
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometrics
 
UNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdfUNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdf
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...
Statistical Packages SPSS, R, Python - Business Statistics & Research Methods...
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
R programming language
R programming languageR programming language
R programming language
 
Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04Show and tell program 04 2014-09-04
Show and tell program 04 2014-09-04
 
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
 

Más de Bioinformatics and Computational Biosciences Branch

Más de Bioinformatics and Computational Biosciences Branch (20)

Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Biological networks
Biological networksBiological networks
Biological networks
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Categorical models
Categorical modelsCategorical models
Categorical models
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 
Appendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductorAppendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductor
 
GraphPad Prism: Customizing your graphs
GraphPad Prism: Customizing your graphsGraphPad Prism: Customizing your graphs
GraphPad Prism: Customizing your graphs
 

Último

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Crash course in R and BioConductor

  • 1. Bioinformatics and Computational Biosciences Branch   NIAID Office of Cyber Infrastructure and Computational Biology    NIH Intranet: http://bioinformatics.niaid.nih.gov  ScienceApps@niaid.nih.gov  ‘ Training Manual Crash Course: R & BioConductor Jeff Skinner, M.S. Sudhir Varma, Ph.D. Download a copy of this manual and all related training materials at: http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/
  • 2. Crash Course: R & BioConductor Table of Contents Ch. 1. Introduction to R ......................................................................................................................................... 1  1.1  What is R? What is BioConductor? ........................................................................................................ 1  1.2  A Brief History of R................................................................................................................................. 1  1.3  Download R and BioConductor............................................................................................................... 2  1.4  Licensing Concerns.................................................................................................................................. 3  1.5  Helpful Resources .................................................................................................................................... 3  Ch. 2. Basics of Using R........................................................................................................................................ 4  2.1  Computing Environments ........................................................................................................................ 4  2.1  R GUI....................................................................................................................................................... 5  2.3  Basic Arithmetic....................................................................................................................................... 8  2.4  Searching the Help Menus ..................................................................................................................... 12  2.5  Installing R Packages and Source Scripts .............................................................................................. 14  2.6  Entering and Importing Data.................................................................................................................. 16  2.7  Data Types.............................................................................................................................................. 22  2.8  Manipulating Data in R.......................................................................................................................... 26  2.9  Saving and Exporting Data .................................................................................................................... 31  2.10  Changing Directories.......................................................................................................................... 32  2.11  Sample Problems for Students............................................................................................................ 32  Ch. 3. Graphics and Figures in R......................................................................................................................... 34  3.1  Basic Types of Graphics and Figures..................................................................................................... 34  3.2  Custom Titles, Subtitles and Axes Labels.............................................................................................. 40  3.3  Custom Color and Layout Options......................................................................................................... 44  3.4  Multi-step Graphics................................................................................................................................ 48  3.5  Figure Legends and Overlaid Text......................................................................................................... 54  3.6  Multi-panel Layouts............................................................................................................................... 58  3.7  Exporting R Graphics............................................................................................................................. 60  3.8  Sample Problems for Students ............................................................................................................... 60  Ch. 4. Basic Statistical Tests and Analyses in R.................................................................................................. 62  4.1  Student’s T-test ...................................................................................................................................... 62  4.2  Linear Regression and ANOVA ............................................................................................................ 63  4.3  R Commander ........................................................................................................................................ 68  4.4  Sample Problems for Students ............................................................................................................... 69  Ch. 5. Writing Basic Scripts in R......................................................................................................................... 70  5.1  Text Editors............................................................................................................................................ 70  5.2  Hello World! .......................................................................................................................................... 71  5.3  Use Scripts to Automate and Save Workflows...................................................................................... 72  5.4  Computation and Output Options .......................................................................................................... 73  5.5  Sample Problems for Students ............................................................................................................... 77  Literature Cited..................................................................................................................................................... 77 
  • 3. Crash Course: R & BioConductor 1 Ch. 1. Introduction to R 1.1 What is R? What is BioConductor? Many biologists and researchers have heard about the powerful analysis and visualization capabilities of R and BioConductor, even if they have not used R or BioConductor themselves. It can be tempting to think of R as a typical statistics software package, but that would belie its true power and capabilities. R combines an open source software platform for statistics and data visualization with a powerful scripting language that can be used to create new analyses and workflows. Both the software package itself and its scripting language are called R. While most people use R for statistical analyses and data visualization, R can be used for matrix algebra computations, data management and enterprise reporting. Advanced users will find that R interacts well with databases, some commercial statistics software packages and many programming languages (e.g. Perl, Python, Java, Fortran, C, HTML, TeX and LaTeX), so R can be utilized in many complicated computing problems. BioConductor is an open source software development project that creates new tools for the analysis and comprehension of genomic. The BioConductor project is almost entirely concerned with the development and distribution of R package libraries for the analysis of microarray and other genomic data. Packages are available for doing various kinds of annotation, normalization, filtering, statistical analysis and visualization of the experimental data from microarray and other genomic studies. 1.2 A Brief History of R The history of R begins at AT&T Bell Laboratories in the 1970’s with the development of the S statistics package by John Chambers, Richard Becker and others. In the early 1970’s, researchers and statisticians at Bell labs were using a library of FORTRAN programs called Statistical Computing Subroutines (SCS) to compute all their statistical analyses (Becker 1994). This FORTRAN library was preferable to the commercial statistics packages available in the 1970’s, because the statisticians at Bell Labs were constantly developing new statistical methods and they wanted specialized reports of their statistical results. However, this SCS library was too cumbersome for many simple statistical analyses and graphs, like Student’s t-tests or linear regression methods. The S statistics package was created to provide an interactive programming language and computing environment to simplify the procedures in the SCS FORTRAN library, while still providing a flexible platform to program and develop new statistical and graphical methods. To make statistical computing more interactive, the S programming language was designed to have the most natural grammar and syntax possible. The goal was to create a higher-level programming language that would be similar to regular English. Most users would write their S code using basic function statements in this higher-level S language, while more advanced users and developers could still create new code in lower-level languages like FORTRAN. The original S language featured advanced text editing and powerful graphics, with the usual statistical tests. The S software was used internally at Bell labs in the late 1970’s and it was distributed publically by the early 1980’s. The first textbook about the S language was published in 1984 (Becker and Chambers 1984) and the S software was made publically available through AT&T’s software sales group. Later, the S software was rewritten in C and combined with another quantitative computing project at Bell Labs to create New S (Becker et al. 1988). By the early 1990’s, then S statistics software had found thousands of users and a handful of books had been published on the S language. The R statistical software package was initially published and released in 1996 (Ihaka and Gentleman 1996). The goal was to create a flexible statistics software and programming language that utilized the best features of the S statistics software package and a functional programming language called Scheme (Sussman and Steele 1975). The name R was chosen to represent the first names of its developers, Ross Ihaka and Robert Gentleman, and also as a play on the name of the S software and programming language (Hornik 2008).
  • 4. Crash Course: R & BioConductor 2 Shortly after its release, the Comprehensive R Archive Network (CRAN) was opened and R became an official part of the GNU Project (http://www.gnu.org). The first stable release of R was offered in February, 2000. 1.3 Download R and BioConductor If you need to download R, visit the Comprehensive R Archive Network (CRAN) website (http://cran.r- project.org/) and look for the download links (Figure 1). There are at least three platform-specific download links for Linux, Mac OSX and Windows operating systems. Click these links to download a ready-to-use installation of R software. There are additional links to download individual source or binary files, so expert users can build their own custom installation of R. These links to the source and binary files also include “daily snapshots” of future versions of R. Remember that R is open source software, so everyone is welcome to modify its code and contribute to upcoming versions of the software. Most biology researchers should probably use the platform-specific download links, but remember that the custom installation options are available. After you have downloaded R, you may want to download the BioConductor packages from their website (Figure 2). If you follow the installation instructions from the BioConductor website, you need to type the commands > source("http://bioconductor.org/biocLite.R") > biocLite() into the R command line. These two commands will download and install all of the basic BioConductor packages on your computer. There are many additional BioConductor packages available for download, but you will want to download them as needed. You can also download the biocLite() packages one at a time, but many of the biocLite() packages are required more advanced BioConductor packages therefore it makes sense to download biocLite() now. Installing R packages will be covered in greater detail in Section 2.5 of the manual. Figure 1. The Comprehensive R Archive Network (CRAN) website.
  • 5. Crash Course: R & BioConductor 3 Figure 2. The BioConductor website. 1.4 Licensing Concerns It is important to remember that R is open source software, distributed under the GNU Public License (GPL). It may be a good idea to review the terms of the GPL before delving into a project using R. If you just intend to use existing R packages to analyze and visualize data in published experiments, then you probably do not have much to worry about. However, if you want to modify and distributed R software, or more importantly if you want to use R software as a part of a patented process or product, you should review the license very carefully. 1.5 Helpful Resources Because R is a free, open source software program, there is no corporate office to call or email for technical support. However, there are many resources available to help users learn to use R. Visit the R project website (http://www.R-project.org) to find free manuals, a FAQ page, a list of published books on R, the R Wiki and various mailing lists. You can find extensive documentation of individual R functions by using its help() commands, as demonstrated in section 2.4 of this manual. Some historic books on the S and R software packages include “the blue book” (Becker et al. 1988), “the white book” (Chambers and Hastie 1992) and “the green book” (Chambers 1998), but there are now dozens of statistics and programming text books for the R and S languages. The R-help mailing list is sometimes your best bet for person-to-person help with R and its functions, but it is important to read their posting guide before posting new messages to the mailing list.
  • 6. Crash Course: R & BioConductor 4 Ch. 2. Basics of Using R 2.1 Computing Environments Most biologists or researchers will use R and BioConductor on a Windows PC or Mac computer using the standard R GUI interface for their platform. However, R and BioConductor can also be run as command line applications on a PC, Mac, Linux or Unix machine, a server or even a high performance parallel computing cluster. There is no real advantage or disadvantage to using R from the command line or the R GUI. The features of R remain the same, no matter how you choose to access R. However, some users with programming experience may feel more comfortable using R from the command line. 2.1.1 MS Windows Command Prompt On a Windows PC, you can access R from the command line by opening the MS Windows Command Prompt (Figure 3). For many Windows PC users, the command line can be found by clicking > Start > All Programs > Accessories > Command Prompt. At the Command Prompt, type the capital letter “R” and hit the <Return> key to open the R software package. You should immediately see a message from R, reporting your version number and the license information. If R does not open at the Command Prompt, you may need to specify a path within the MS Windows operating system. Figure 3. The MS Windows Command Prompt. 2.1.2 Mac OSX Terminal On an Apple Macintosh computer, you can access R from the command line by opening the Terminal (Figure 4). You will likely find Terminal.app in the Utilities folder within your Applications folder in the OSX Finder. If you cannot find Terminal.app, try searching for “terminal” in the OSX Spotlight. At the Terminal, type the capital letter “R” and hit the <Return> key to open the R software package. You should immediately see a message from R, reporting your version number and the license information. 2.1.3 UNIX shells and SSH clients Linux and UNIX users can access R from the command line in the Bourne shell (sh), Bourne-Again shell (bash), C shell (csh) or other command line terminals. Type the capital letter “R” and hit the <Return> key at the command line to open the R software package, and you should immediately see the message from R
  • 7. Crash Course: R & BioConductor 5 Figure 4. The Apple Macintosh OSX 10.5.5 Terminal. to report your version number and license information. Another possible option is to access R from a secure shell (SSH) client. This option allows you to install R on a powerful UNIX machine or server, then access the R software on this powerful machine from another machine connected to the internet. 2.1 R GUI If you are not comfortable accessing R from the command line, you can access R from the R GUI that is included in the usual Windows or Mac download. The R GUI provides a few point-and-click buttons and menus to help you open and save files, download new R packages or even edit R scripts. All the features from these point-and-click buttons and menus can be accessed from the command line, but some users may prefer to have these commonly used features accessible from a GUI button instead memorizing their specific commands. Note that specific R GUI features differ slightly between the MS Windows GUI and the Apple Mac OSX GUI. Both are described below. 2.2.1 Windows PC GUI The current R GUI in MS Windows features seven clickable menus and eight clickable buttons to help you access commonly used features (Figure 5). The File menu allows you to Source R code, create a New script, Open script…, Display file(s)…, Load workspace…, Save workspace…, Load history…, Save history…, Change dir…, Print…, Save to file… and Exit. Note the menu options Open script…, Load workspace… and Save workspace… are also available in the first three clickable buttons from the left on the Windows R GUI, while the Print option is available as the last button on the right of the R GUI. The Edit menu allows users to Copy, Paste, Paste commands only, Copy and Paste, Select all, Clear console, open the Data editor… or change the GUI preferences…, if necessary. Note the Copy, Paste and Copy and Paste commands are also available as the fourth, fifth and sixth clickable buttons from the left. The View menu is used to hide or display the Toolbar of buttons at the top of the R GUI window and the Statusbar of system messages at the bottom of the R GUI (Figure 6). The Misc menu is used to Stop current computation, Stop all computations, Buffered output, Word completion, File name completion, List objects, Remove all objects and List search path. The Buffered output option prevents R from printing any messages while a command is running. The Word completion and File name completion options allow you to complete the names of R commands and file names,
  • 8. Crash Course: R & BioConductor 6 Figure 5. The MS Windows R GUI. Figure 6. The Toolbar and Statusbar of the Windows R GUI. respectively, by hitting the <TAB> key after typing the first letters of a command or filename. The List objects, Remove all objects and List search path options are equivalent to the ls(), rm(list = ls()) and search()commands, respectively. Note the option to Stop current computation can also be accessed by the second clickable button from the right. 2.2.2 Mac OSX GUI The R GUI in Mac OSX includes a menu bar at the top of your screen (Figure 7), and the GUI itself includes 10 clickable icons to provide access to commonly used features or features specific to the Mac OSX R GUI (Figure 8). The stop sign icon allows you to stop processing the most recently submitted R command, or Interupt current R computation. This is a useful feature, because some R procedures can require lengthy processing times that may stall or freeze some computers. The R icon is used to Source script or load data in R. If you would like to write your own scripts to automate analysis workflows or create new analyses, this button will allow you to quickly load the source files for your scripts. It also provides an easy way to load data. The bar chart icon allows you to Open a new Quartz device window. On a Mac computer, the quartz window is used to produce all graphics figures, like scatterplots and histograms. Opening a Quartz graphics device will allow you to view your graphics figures as you build them. The X11 icon allows you to open an X11 window in Mac OSX. This is a critical feature for the R GUI in Mac OSX, because many important R functions require an open X11 window to work properly. The lock icon is used to Authorize R to run system commands as root. This allows you to overwrite protected files and directories, so use this option with extreme caution. The table
  • 9. Crash Course: R & BioConductor 7 Figure 7. The Mac OSX R GUI Menu Bar. Figure 8. The Mac OSX R GUI console. icon is used to Show / Hide R command history (Figure 9). The command history allow all of the commands recently submitted to R, which can be a useful feature during lengthy R sessions when older commands may run off the screen. The color wheel icon is used to Set R console colors (Figure 10). These options allow you to change the color schemes within the R GUI console, the R GUI editor and the R GUI Quartz window. The R sheet and blank sheet icons are used to Open document in editor and Create new, empty document in the editor, repectively. The R GUI editor is a text editor environment within the Mac OSX R GUI that is typically used to edit R source scripts. The R GUI editor for Mac OSX includes some helpful automatic text formatting features to help you write and edit R code. The print icon is used to Print this document, which will print the R console. Be careful with the print button, because the R console could contain hundreds of statements and produce a lengthy printout. The switch icon is used to Quit R, which closes the current R session and the R GUI.
  • 10. Crash Course: R & BioConductor 8 Figure 9. The R command history window for the R GUI in Mac OSX. Figure 10. The Set R console colors menu. 2.3 Basic Arithmetic 2.3.1 Addition, subtraction and other basic operations Before you open your first R data set, it may be useful to explore some basic arithmetic operations in R. Type a simple addition statement (e.g. 3 + 4) in the R prompt and hit <Return> to view the result (Figure 11).
  • 11. Crash Course: R & BioConductor 9 Figure 11. An arithmetic operation entered into the R GUI. From this point forward, I will describe all R commands in boxed Courier New text as shown below: > 3 + 4 [1] 7 > Note that user entered commands will be preceded by the “>” character, while output from R will typically be preceded by an index number (e.g. [1]) or it will be displayed with special formatting. Some keyboard characters are reserved for special functions in R. One special character in R is the “#” symbol, which is used to add notes to R commands, scripts and code. In most situations, all text or code preceded by the the “#” symbol will be ignored in R. Try it for yourself by entering > # 3 + 4 > into the R prompt. Notice the addition statement was not evaluated as before, because the sum was not calculated. You can enter any kind of information after the “#” symbol, without any fear of an error or ruined R scripts. You can even add these notes after a valid command in the same line of code, as seen below: > 3 + 4 # This command produces the sum of 3 and 4 [1] 7 Now, enter the following commands into R to explore some basic arithmetic operations:
  • 12. Crash Course: R & BioConductor 10 > 3 + 4 # Addition [1] 7 > 3 - 4 # Subtraction [1] -1 > 3*4 # Multiplication [1] 12 > 3/4 # Division [1] 0.75 > 3^4 # Exponents [1] 81 > 3**4 # Another way to enter exponents [1] 81 > log(3) # Natural logarithm (i.e. log base e) [1] 1.098612 > log10(3) # Log base 10 [1] 0.4771213 > log2(3) # Log base 2 [1] 1.584963 > log(81,base=3) # Logarithms computed to any other base [1] 4 > exp(1) # Base of the natural logarithm [1] 2.718282 > pi # The constant pi [1] 3.141593 You can use the equal sign “=” or an left-facing arrow “<-“ to define variables and compute simple algebraic expressions in R. Note the equal sign is sometimes used inconsistently in R and sometimes creates problems in lengthy scripts. > a = 4 # Define a = 4 > b <- 3 # Define b = 3 (alternative coding) > a*b # Multiply a*b [1] 12 2.3.2 Inequalities Beyond these simple mathematical operations, R can be used to evaluate many mathematical inequalities: > 3 < 4 # Strictly less than or greater than [1] TRUE > 3**4 >= 100 # Greater than or equal to [1] FALSE > log(81,3) == 4 # Equal to [1] TRUE Note that the double equal sign command “==” is used to evaluate an equality statement, because the single equal sign command “=” is used to define variables. Soon, you will see that inequalities can provide a powerful means to identify subsets of data, when used with an index. 2.3.3 Matrix algebra More advanced users may want to use R for matrix algebra computations, like matrix multiplication or matrix inversions. If matrix computations interest you, I believe you will find that R is a very powerful platform for matrix computations that nearly rivals Matlab and similar platforms.
  • 13. Crash Course: R & BioConductor 11 > aa <- c(1,2,3,4) # Define a column vector “aa” > aa # Display the column vector “aa” [1] 1 2 3 4 > t(aa) # Transpose “aa” [,1] [,2] [,3] [,4] [1,] 1 2 3 4 The command c() is used to create a column vector in R. In the example above, the command aa <- c(1,2,3,4) was used to define a column vector “aa” with entries 1, 2, 3, and 4. The command t() is used to transpose vectors and matrices, so t(aa) will convert “aa” from a column vector to a row vector. Notice the difference between how the column vector aa and the row vector t(aa) are displayed. The column vector aa is displayed in one line of output, where the vector is preceded by [1] and its entries are only separated by spaces. The row vector t(aa) is displayed on two separate lines, which report its column and row entries. The output [1,] denotes an entry on the first row of a matrix, and the output [,1] denotes an entry on the first column of a matrix. > aa*t(aa) # Element-wise multiplication [,1] [,2] [,3] [,4] [1,] 1 4 9 16 > t(aa)*aa # Element-wise multiplication [,1] [,2] [,3] [,4] [1,] 1 4 9 16 > aa%*%t(aa) # Matrix multiplication [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 2 4 6 8 [3,] 3 6 9 12 [4,] 4 8 12 16 > t(aa)%*%aa # Matrix multiplication [,1] [1,] 30 The usual multiplication operator “*” is not used for matrix multiplication. The “*” operator multiplies vectors and matrices element-wise, while the “%*%” operator is used for matrix multiplications. Both functions can be useful, but be careful to use the correct operator symbol for your calculations. ># Define a 3 x 3 matrix > bb <- matrix(c(1,2,3,4,0,5,1,3,1),nrow=3,ncol=3) ># ># Display the matrix ># > bb [,1] [,2] [,3] [1,] 1 4 1 [2,] 2 0 3 [3,] 3 5 1 ># ># Invert the matrix ># > solve(bb) [,1] [,2] [,3] [1,] -0.6521739 0.04347826 0.52173913 [2,] 0.3043478 -0.08695652 -0.04347826 [3,] 0.4347826 0.30434783 -0.34782609 The matrix() command is used to define a matrix from a vector. The matrix command is very robust and can be used to create simple or complicated matrices with a few keystrokes. E.g. the command
  • 14. Crash Course: R & BioConductor 12 matrix(4,nrow=3,ncol=5) would create a 3 x 5 matrix with every entry equal to 4. More details on these techniques will be given in section 2.6. The command solve() will invert most symmetric matrices, but more complicated Cholesky inverse and generalized inverse methods are also available. 2.4 Searching the Help Menus There are many books, guides and manuals available to help you learn how to use R, but inevitably every R user must search the help menus. The R help menus can help you find new functions or provide more detailed explanations of the inputs and outputs of a function you have already used. There are several useful help commands in R to find the documentation that you need. 2.4.1 Help documentation with help() and ? > help(t.test) # Find documentation for the function t.test > ?t.test # (same as above) The two functions above are used to find help documentation for a specific function, when you already know the function’s command. Try entering help(log) or ?log for another example. These help commands are most useful if you need a detailed explanation of a function you already use or if you would like to investigate a function you found in a paper or on the internet. The two commands are equivalent. Both produce an HTML-formatted manual for the specified function (Figure 12). You will notice that most of the help documentation for R functions follows very strict formatting. Each help page provides you the command information to call the function, the function’s name, a description of the function purpose, details about its usage within R, details about its arguments, details about its output and typically a specific example that you copy-and-paste into R for demonstration purposes. Figure 12. Help documentation for the function t.test.
  • 15. Crash Course: R & BioConductor 13 2.4.2 Keyword searches with help.search() Another type of search is required when you do not know the command of a specific function. Type help.search(“keyword”) to search for keywords and find all the command names of functions related to your keyword. For example, the command help.search(“students t test”) will produce a list of all functions related to the student’s t-test (Figure 13). In this example, only the t.test() function is found by our search, but other requests may generate many results. Figure 13. List of help files from help.search keyword search. 2.4.3 Google and other search engines One final suggestion is to use Google.com or other search engines to find help with R. I recommend that you always include the keyword “CRAN” or “BioConductor” in your R-related Google searches, because it can help direct you to search results that are most directly related to R packages and concepts (Figure 14). The search engine Rseek (http://www.rseek.org/) is a search engine that only queries the R help files and related websites.
  • 16. Crash Course: R & BioConductor 14 Figure 14. Searching Google for help with R packages. 2.5 Installing R Packages and Source Scripts Help searches and basic arithmetic functions are included in the base R software package, but often researchers need to use specialized research tools that are not included in the base R software. These specialized tools are often available as free downloadable packages or source scripts in R. Packages are user- submitted R scripts and functions that have are made been posted online by CRAN or BioConductor. Packages are downloaded and installed from the R GUI or the command line. The code for these packages is typically downloaded as a source file, written in the R programming language, or as a binary, written in a compiled language like C or FORTRAN. Some functions and scripts have not been submitted as packages to CRAN or BioConductor, but they still may be loaded into your installation of R as a source file. 2.5.1 R packages Click > Packages > Install Package(s)… to install packages from the Windows R GUI. If you are installing packages for the first time, you may be prompted to Set CRAN mirror… or Select Repositories…before you continue (Figure 15). Remember, the CRAN mirror site is a server that contains the most recent R software downloads and packages. You want to choose one of the CRAN mirrors nearest you for convenience. The package repositories are specific lists of packages from CRAN and BioConductor. Choose only the repositories you need, because selecting more repositories will create a longer list of packages for you to browse.
  • 17. Crash Course: R & BioConductor 15 Figure 15. Set CRAN mirror…, Select Repositories… and R packages menus in Windows R GUI. Use the scroll bar to browse through the list of R and BioConductor packages and select the packages you need for installation (Figure 15). You can hold the Ctrl key to select multiple packages, if necessary. Note the list of packages can be very long, especially if several repositories were selected. Once you have selected the R packages you need, click OK to download and install the packages. Alternatively, if you know the name and repository address of the packages you need to download, you can download and install a package from the command line using the command install.packages(). There is no advantage or disadvantage to using the R GUI or the command line to install a package, but the install.packages() command can be very helpful when scripting. Using the install.packages() command in your source code will ensure any users of your script will have all the necessary R packages. As the packages download, you may see some log messages in your R console to keep you informed of the download progress and any potential errors. After the packages have finished downloading, you will want to enter a library() or require() command for the package to load the contents of the package into your R workspace. Both commands have the same function, but the require() is preferred for use within R functions. > install.packages("gtools",repos="http://cran.r-project.org") trying URL 'http://cran.r-project.org/bin/windows/contrib/2.6/gtools_2.4.0.zip' Content type 'application/zip' length 157621 bytes (153 Kb) opened URL downloaded 153 Kb package 'gtools' successfully unpacked and MD5 sums checked The downloaded packages are in C:Documents and SettingsskinnerjLocal SettingsTempRtmpIQ1HTcdownloaded_packages updating HTML package descriptions > library(gtools)
  • 18. Crash Course: R & BioConductor 16 2.5.2 Source scripts This manual will introduce the idea of R source scripts in Chapter 5, but keep in mind that you can also upload new functions using R source scripts. If you have found a R source script that you need to upload, click > File > Source R code… on the Windows R GUI or > File > Source File… on the Mac OSX R GUI. Use the command source(), if you prefer to load the source script from the command line. Most casual R users will only use the file parameter of the source() command. > # Load a source script file (.R extension) > source("~/example.R") 2.6 Entering and Importing Data There are dozens of ways to enter data into R. Many famous and historical datasets are already uploaded and available for use in the base R software or in an R package. There are hundreds of functions that make it easy to type and enter small data sets manually into R. Other functions can help to generate huge amounts of random or simulated data. Finally, there are a variety of functions available to help you upload your own data files, whether they are stored as R workspace data (.Rdata), as plain text files (.txt, .csv, …), in proprietary data file formats from other popular statistics packages (.sav, .sas7DAT, …), as MS Excel spreadsheets (.xls, .xlsx, …) or even tables from a database (e.g. MS Access, MySQL, …). 2.6.1 Base and package data in R Enter the command data() or click > Packages & Data > Data Manager in the Mac OSX R GUI to view a list of the datasets currently available on your installation of R (Figure 16). Try to find the data set “iris” in the list. Select the “iris” data set from the list, or enter the help command ?iris, to read the documentation describing this data set. The “iris” data set in R is the famous “Fisher’s iris data”, originally collected by biologist Edgar Anderson. This classic data set has been used in countless statistics and biology textbooks, and I will use it later in this manual. Figure 16. A list of R data sets.
  • 19. Crash Course: R & BioConductor 17 Enter the command iris to view the Fisher’s iris data set, already stored in R. > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica > The iris data set includes data for 150 iris plants, with 50 plants each from the Iris setosa, I. versicolor L. and I. virginica L. species. Four measurements were taken from each of the 150 plants to record their sepal length, sepal width, petal length and petal width. This data set is often used as an example in classification and clustering problems, where statisticians use the data as a training set to predict the species of an iris plant based on its sepal length, sepal width, petal length and petal width measurements. 2.6.2 Entering data manually The base and package data available in R can be a useful resource, but most R users need to upload their own data into R. Most researchers will already have their data stored in a large data file. However, for some small data sets, it may be easiest to enter the data manually from the command line. In other situations, researchers may need to simulate large amounts of data using procedures from the command line. Some common R procedures will be used to generate a small dataset concerning the Alpha-fetoprotein (AFP) levels of 20 medical patients. > # generate a list of subject IDs, numbered from 1 to 20 > # > subject <- 1:20 > # > # create 10 entries for male subjects > # > males <- rep("male",10) > # > # create 10 entries for female subjects > # > females <- rep("female",10) > # > # combine male and female entries into one column vector > # > gender <- c(males,females) > # > # bind subjectID and gender columns together > # > afp.data <- data.frame(subjectID,gender) > afp.data Recall the command subject <- 1:20 can be used to generate the sequence of integers from 1 to 20. Alternatively, subject <- seq(from = 1, to = 20, by = 1) could have been used to generate the same sequence. These numeric sequences could be used to specify a series of patient ID’s or subject ID’s. Here the
  • 20. Crash Course: R & BioConductor 18 ID’s have been stored as a variable named subject. The command males <- rep("male",10) is used to generate a vector of 10 replicated string values to label the male patients. A similar replicated vector is created to identify 10 female patients, then the male and female labels are combined with the column vector command c() to store a variable named gender. These two variables can be joined together to create the subject ID’s and gender labels of a new data set named afp.data using the command data.frame(), which will be defined later. > # generate 10 male and 10 female random normal heights > # > height <- c(rnorm(10,70,2.5),rnorm(10,64,2.2)) > # > # generate 10 male and 10 female random uniform weights > # > weight <- c(runif(10,155,320),runif(10,95,210)) > # > # compute body mass index (BMI) for 10 men and 10 women > # > BMI <- (weight*703)/(height**2) > # > # enter five treatment levels of a new drug (ng/mL) > # > drug <- rep(x = seq(from = 0, to = 20, by = 5), times = 4) > # > # manually enter Alpha-fetoprotein (AFP) levels for 20 patients > # > AFP.before <- c(0.8,2.3,1.1,4.8,3.7,12.5,0.3,4.4,4.9,0.0,1.8, 2.4,23.6,8.9,0.7,3.3,3.1,0.5,2.7,4.5) > # > afp.data <- data.frame(afp.data,height,weight,BMI,AFP.before) > afp.data We can use random data procedures to create additional factors for the afp.data data set in R. Suppose we want to randomly generate height and weight values for our male and female patients. Furthermore, assume male heights are normally distributed with mean 70 inches (i.e. 5 foot 10 inches and standard deviation 2.5 inches, and female heights are normally distributed with mean 64 inches and standard deviation 2.2 inches. The command rnorm(10,70,2.5) is used to randomly generate 10 new observations from a Gaussian normal distribution with mean 70 and standard deviation 2.5 to represent the heights of our male patients, while rnorm(10,64,2.2) will generate the data for our female patients. Next, we want to generate weight values for male and female patients, assuming male weights are uniformly distributed between 155 pounds and 320 pounds (i.e. each mass between 155 lbs. and 320 lbs is equally likely), while female weights are uniformly distributed between 95 lbs and 210 lbs. The commands runif(10,155,320) and runif(10,95,210) will generate the male and female data, respectively. We could use the height and weight variables to compute a new variable to represent body mass index (BMI) from the usual formula BMI = (703 * mass (lbs) ) / (height2 ). Next, suppose we want to add a variable to represent five increasing concentrations of a new drug, from 0 ng/mL to 20 ng/mL. The seq() command can be used to generate the sequence of drug concentrations 0 ng/mL, 5 ng/mL, 10 ng/mL, 15 ng/mL and 20 ng/mL, while the rep() statement will repeat that sequence of drug concentrations four times to fill out the rest of the column. Finally, we could enter a vector of pre- treatment AFP values to complete the data set. Join all the new variable columns together with the earlier afp.data data set using the data.frame() command to finish, then view the results by entering the data set name afp.data at the command line.
  • 21. Crash Course: R & BioConductor 19 2.6.3 Importing previously saved R data workspaces (.RData) Not surprisingly, it is possible to save and load R data sets in a their own file format (file extension .RData). If you need to import a previously saved .RData file, you can use the load() command. The load() command only includes two parameters, the file parameter used to specify the filepath of the .RData file that will be imported and the more complicated envir parameter that specifies an environment for the uploaded R workspace. Briefly, an R environment is a collection of named objects in R. The user’s R session workspace is an environment, for example. If during one session of use, an R user defines the variable aa = 42.7, then any reference to the variable aa will return the value 42.7 until the variable aa is redefined or until the workspace is closed. If you load or import a previously saved R data workspace, all the variables and objects defined in that workspace will be retained. 2.6.4 Text file data (.txt, .csv) with read.table() and scan() The most efficient way to import data into R is to upload a text file. Text files are typically smaller than proprietary data formats, like MS Excel spreadsheets. Since plain text files are not organized by columns, rows and cells like a MS Excel spreadsheet, users will need to specify a character to separate the values of different fields (i.e. a delimiter to separate columns) and a character to mark the end of each line of text (i.e. the end of a row). Often, the first row of a text data file is used to name the fields (i.e. columns) of a data table. It is also common to enclose strings of character data with single or double quotation marks to avoid possible conflicts with delimiter symbols and numeric fields. All these issues will be addressed in the parameters of the text file import procedures. The most popular way to import a plain text data file is with the read.table() command. The read.table() command is the most general method to read table style data from a plain text data file. Suppose you have two different text file data sets. The first data set is a tab-delimited text file named . The second data set is a comma separated value (.csv) plain text file named . Both files can be opened with the read.table() command. > # Import a tab-delimited text for data set named Expression > # > Expression <- read.table(file = “~/expression.txt, header = TRUE, sep = “t”, nrows = 40000, stringsAsFactors = FALSE) > Expression > # > # Import a comma separated value text file data set named ‘AE’ > # > AE <- read.table(file = "~/AdverseEvents.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE) > AE The read.table() procedure include the parameter file to specify the quoted file path of the data file that will be imported. The header = TRUE parameter indicates that the data file has a header line to define the column (i.e. variable) names of the data table. If the header is not specified, column names can be entered using the parameter col.names. The parameter sep = “t” indicates that the Expression data file should be imported as a tab-delimited text files (i.e. columns are separated by tabs). Likewise, the parameter sep = “,” indicates that fields are separated by commas in the AE data. The parameter nrows specifies the number of rows of data in the text file to help speed up the import process for a large microarray expression data set. When stringsAsFactors = FALSE, all string variables in the data file will be stored as character data rather than factor data. Many statistical and graphing procedures require character data to be specified as factors, but it may be easier to modify data tables if the string variables are stored as character data. The read.table
  • 22. Crash Course: R & BioConductor 20 command includes many other parameters that could be useful. For example, the parameter dec = “,” could be used to specify that numbers are recorded with European-style “comma” decimals (e.g. 2.63 = 2,63). The read.csv() and read.csv2() commands are equivalent to read.table() with default parameters optimized for comma separated value text files. Specifically, the read.csv() command is optimized for ‘American’ formatted .csv files, where fields are separated by commas and decimals are separated from integers with a period. ‘European’ formatted .csv files, where fields are separated by semicolons and decimals are separated from integers with a comma, should be imported with read.csv2(). Similarly, the read.delim() and read.delim2() commands are equivalent to read.table() with default parameters optimized for importing tab-delimited text files. Specifically, the read.delim() command is optimized for ‘American’ formatted .txt files, where fields are separated by tabs and decimals are separated from integers with a period. ‘European’ formatted .txt files, where fields are separated by tabs and decimals are separated from integers with a comma, should be imported with the read.delim2() command. The command scan() can also be used to import data tables from text files. The primary differences between the read.table() commands and the scan() command is that the scan command reads data as a single large vector or list that must be shaped into a data table or data frame later. The scan() command can be more difficult to use than read.table() and similar commands when the text file data is already formatted in some way. The command read.fwf() is used to import text data files that are stored in fixed width format, where fields are not separated by a specific character like tab or comma, but instead each field is read from a specified number of characters from left to right in the text file (e.g. characters 1-6 store the first field, characters 7-8 store the second field, characters 9-12 store the third field, …). In other words, fields might be separated by 0 or more space characters. 2.6.5 MS Excel files and other proprietary formats Historically, it has been difficult to import data from MS Excel spreadsheets into R. Most people will convert their MS Excel spreadsheets into tab-delimited text files (.txt) or comma-separated value text files (.csv), then import these text files into R using the scan(), read.table() or read.csv() commands. However, converting MS Excel spreadsheets into text files may be tiresome, if dozens of files need to be converted, and some users may not have access to MS Excel to convert .xls spreadsheets into .txt files. It is now possible to import MS Excel spreadsheets directly using the read.xls() command from the xlsReadWrite package library. First, try uploading MS Excel data by converting the MS Excel spreadsheet into a tab-delimited .txt file or a .csv file. Start with a simple MS Excel data file (Figure 17). Figure 17. A MS Excel spreadsheet data set
  • 23. Crash Course: R & BioConductor 21 Click > File > Save As… to open up the Save As menu (Figure 18) and use the Save as type: drop down menu to save your file as Text (Tab delimited) or CSV (Comma delimited). This will covert your MS Excel spreadsheet into a tab-delimited text file or comma separated value text file that can be uploaded easily into R. Figure 18. Saving a MS Excel spreadsheet as a tab-delimited text file. Next, open R and import the text file with the read.table() or read.csv() statements seen below: aa <- read.table(file = “C:sample.txt”, header = TRUE, sep = “t”) Here, the statement aa <- read.table() implies that we are defining a data set named aa. The parameter statement file = “H:BCBB tipssample.txt” specifies the file path of our tab-delimited text file containing the data. The parameter statement header = TRUE specifies that the first row of data contains the column headings of our data set, while the statement sep = “t” specifies that the different columns (or fields) of our data table are separated by tab characters (i.e. the file is tab-delimited). You can use a similar command to upload a .CSV file with the read.csv() command. Now, open the same file using directly from MS Excel. Click > Packages > Install package(s)… on the MS Windows R GUI or click > Packages & Data > Package Installer on the Mac OSX R GUI to find and install the xlsReadWrite package library. Enter the command library(xls.ReadWrite) to load the package library to your workspace. Upload your MS Excel data using the command: bb <- read.xls(file=“C: sample.xls”,colNames=TRUE,sheet=1) Note, the statement bb <- read.xls() implies that we are defining a data set named bb, just like the previous example. Similarly, the parameter statements file = “H:BCBB tipssample.xls”, colNames = TRUE and sheet = 1 indicate the file path of the MS Excel spreadsheet, choice to read column names from the first row of data and the choice to only read data from the first sheet of the MS Excel file, respectively.
  • 24. Crash Course: R & BioConductor 22 Other R package libraries are available to open data files created or saved using commercial statistics software packages like SAS or SPSS. For example, SAS datafiles and SAS XPORT format libraries can be imported with the commands read.ssd() and read.xport() from the foreign package library. It is also possible to import SAS data sets and SAS Transport files using the sas.get() and sasxport.get() commands from the Hmisc package library. Similarly, spss.get() and read.spss() from the package libraries Hmisc and foreign, respectively, can both be used to open SPSS data files (.sav file extension) in R. The command stata.get() from the Hmisc package library can be used to import Stata datasets into R, etc. Data sets from most commercial statistics software packages can be imported directly into R. 2.7 Data Types 2.7.1 Simple object types (E.g. numeric, character and logical) One potentially frustrating problem with R is that you must carefully specify and manage how data is stored within R. Consider the following R statements: > a = 4.23 # Define a numeric object “a” > b = "Fred Flintstone" # Define a character object “b” > c = TRUE # Define a logical object “c” The R commands above define three variables: a, b and c. Each of these three variables are stored as an object within the R framework. All objects in R have a specific storage mode within R. The variable a was defined to be the real number 4.23, so the variable a will be stored as a numeric object in R. Likewise, the variable b was defined to be the character string “Fred Flintstone”, therefore it will be stored as character object in R. Note that character string objects are entered within double quotes in R. Finally, the variable c was defined as the logical outcome TRUE, so it will be stored as a logical object in R. Note the logical values TRUE and FALSE can be entered and stored without quote marks in R to create a logical object, but the character strings “TRUE”, “False” and other variations within double quotes would be stored as character objects in R. More specific classes of objects exist, like complex() R objects for storing complex numbers, integer() objects for storing integer numeric data or factor() objects for character string data that will be used as factor effects in statistical tests and graphs. Objects from different storage modes have different properties within R. If an R user tried to compute the sum 4.23 + “Fred Flintstone”, then R would return an error message. This is a good thing, because obviously the sum 4.23 + “Fred Flintstone” does not make sense. However, the different properties of these object types can sometimes cause conflicts, especially when data gets entered incorrectly. For example, the numeric value 4.23 can also be entered as the character string “4.23”. Numeric data sometimes gets into R, or other software programs, as character string data because of differences in how various software programs handle missing values or other problems. Incorrectly storing R objects can create unwanted errors in R statistical and graphing procedures, so it is often helpful to check the storage mode of an R object using the storage.mode() command. You can test whether an object belongs to a specific class using commands like is.numeric() or is.character(). You can also find the storage mode of an object using the command class(). Objects can often be coerced from one class into another using commands like as.numeric() or as.character(). > a = 4.23 # Define a numeric object “a” > is.numeric(a) # Test if “a” is a numeric object [1] TRUE > a = as.character(a) # Coerce “a” into a character object > a # View the object “a” [1] "4.23"
  • 25. Crash Course: R & BioConductor 23 > is.numeric(a) # Test if “a” is a numeric object [1] FALSE > class(a) # Find the storage mode class of “a” [1] “character” 2.7.2 Larger object types (E.g. data frames, matrices and lists) Larger entries of multiple values are also considered objects within R. For example, the data sets iris, afp.data, Expression and AE defined in Section 2.6 are all R objects. Many collections of values can be stored as vector(), matrix(), array(), table(), data.frame() or list() objects within the R framework. Again, these object storage modes each have unique properties within R to ensure proper handling of different types of data within R. You can also identify the storage mode of larger R objects, like arrays or data frames, with the command class() or with specific tests like is.data.frame(). You can coerce these larger R objects into different storage modes using commands like as.list() or as.numeric(). A vector is a one-dimensional, mathematical list of values that can be used in linear algebra (or matrix algebra) operations. The vector() and c() commands in R are used to define column vectors; row vectors must be created using the transpose command t() or by defining a 1 x n matrix, where n is the length of the row vector. A matrix is a two-dimensional list of values organized into m rows and n columns. Generally, vectors and matrices in R should only used for matrix algebra manipulation of numeric data. It is possible to enter character or logical data into a vector or matrix, but it is not possible to mix different types of data (e.g. numeric, character, logical,…) in the same vector or matrix. An array is similar to a vector or matrix, except the array can exist in higher dimensions. The higher dimensions of an n-dimensional array could be described as panels, pages, chapters, books, etc. For example, a six-dimensional array might have 3 books, 10 chapters, 150 pages, 16 panels, 8 rows and 45 columns. > # Display a vector with length = 6 > Vector <- c(45,50,53,47,44,52) > Vector [1] 45 50 53 47 44 52 > # Report the length of Vector > length(Vector) [1] 6 > # Display a matrix with 4 rows and 6 columns > Matrix <- matrix(x = x, nrow = 4, ncol = 6, byrow = TRUE) > Matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 49 54 49 47 57 53 [2,] 49 45 50 50 49 46 [3,] 45 56 54 52 46 51 [4,] 47 46 48 48 55 48 > # Report the number of rows and columns of matrix > nrow(Matrix) [1] 4 > ncol(Matrix) [1] 6 > # Display an array with 4 rows, 6 columns and 2 panels > Array <- array(data = x, dim = c(4,6,2)) > Array , , 1 [,1] [,2] [,3] [,4] [,5] [,6] [1,] 50 48 52 52 48 44 [2,] 52 46 45 56 44 59 [3,] 45 41 51 51 48 44 [4,] 46 52 53 52 45 51
  • 26. Crash Course: R & BioConductor 24 , , 2 [,1] [,2] [,3] [,4] [,5] [,6] [1,] 49 40 50 50 55 51 [2,] 44 50 46 50 53 57 [3,] 49 49 50 54 49 51 [4,] 56 55 45 49 51 47 > # Report the dimensions of Array > dim(Array) [1] 4 6 2 The combine function c() can be used to create column vectors in R. Even though vectors are displayed in rows when printed in the R workspace, mathematically they will behave as column vectors. The size of a vector is reported using the length() command in R. The matrix() command builds a matrix from a vector x, using the parameters nrow and ncol to specify the number of rows and columns. The parameter byrow is used to determine whether the matrix will be filled with the vector values row-by-row or column-by- column. The number of rows and columns in a matrix can be reported with the functions nrow() and ncol(), respectively. The array() command builds an array from a vector data, where the dim parameter is used to specify both the number and length of its dimensions. For example, the entry dim = c(4,6,2) indicates the array should have three dimensions of lengths 4, 6 and 2, respectively. The function dim() is used to report the dimensions of an array object. The table() storage mode is used to store n-dimensional tables of frequency data for two or more categorical variables. The tables might be reported directly to display the frequency distribution among two or more factors, or the tables could be used as the data format for specific statistical tests and graphical methods. The table() comma is helpful, because it computes cell frequencies automatically. Unfortunately, the table() command does not compute statistics (e.g. mean, sum, …) for numeric arguments, so it cannot be used to compute pivot tables or statistical summaries. > # Build a two-way table from AE data > Table <- table(Gender,Severity) > Table Mild Moderate Severe Female 6 16 9 Male 14 15 4 > # Build a three-way table from AE data > Table <- table(Region,Severity,Gender) > Table , , = Female Mild Moderate Severe Midwest 1 6 1 Northeast 0 3 7 Northwest 0 4 0 Southeast 0 3 1 Southwest 5 0 0 , , = Male Mild Moderate Severe Midwest 0 7 1 Northeast 1 5 1 Northwest 0 1 0 Southeast 0 2 2 Southwest 13 0 0
  • 27. Crash Course: R & BioConductor 25 The table() command uses two or more vectors of factor() or character() data to generate an n- way table. The numbers in each cell are counts representing the number of observations with each combination of factor levels. For example, a two-way table of gender versus severity reveals that there are 6 female patients with mild AE symptoms. Breaking the data down further into a three-way table with region, severity and gender, there is only one female patient with mild symptoms from the northeast region and there are 5 female patients with mild symptoms in the southwest region. No additional parameters were specified. However, one interesting parameter is exclude, which allows you to hide the results for specified factor levels. For example, exclude = “Midwest” would hide all the results for the midwest region. Most R data should be stored with the data.frame() storage mode. The vector, matrix, array and table objects can only store individual values from one storage type (e.g. all numeric data, all character data, …). A data frame is very useful, because it can store data sets with multiple columns (i.e. variable) that maintain their own unique storage modes (e.g. separate numeric, character and logical variables). The only limitation of the data frame is that each variable, or column, must be the same length. Commands like read.table() and read.csv() will automatically store imported data as a data frame. Data frames share some of the properties of the matrix, array and list storage modes, but you want to be careful about using data frames in matrix algebra calculations and other methods. The properties of data frames will not always work with commands that require matrix data, and vice versa. > # Define a data frame from AFP data > afp.data <- data.frame(subjectID,gender,stringsAsFactors=FALSE) > gender [1] "female" "female" "female" "female" "female" "female" [7] "female" "female" "female" "female" "male" "male” "male" [14] "male" "male" "male" "male" "male" "male" "male" > # Define a data frame from AFP data with strings as factors > afp.data <- data.frame(subjectID,gender,stringsAsFactors=TRUE) > gender [1] female female female female female female female female [9] female female male male male male male male male male male male Levels: female male The data.frame() command is used to join several vectors of the same length together to form a single data set. Again, individual vectors can have different storage modes, such as numeric or character. The only limitation is that the vectors must share the same length. The stringsAsFactors parameter of the data.frame() command is used to store any character string variables in the data frame as factor() objects. The diffence between a vector of character objects and a factor object is shown above. A vector of character data is simply a collection of character strings. If we wanted to add new rows of data to the character variable gender, then we could add a new character string to the variable (e.g. “Did not report” or “intersex”). However, a factor object is a variable of string or numeric data with a fixed number of levels or outcomes. When gender is stored as a factor, there are only two possible levels, female and male. If a new character string (e.g. “Did not report”) were added to the factor variable gender, it would produce an error. Many statistical and graphical procedures in R require factor data. The list() storage mode is used to store a collection of ordered or named objects in R. A list shares some properties in common with a vector, except the list can be used to store a collection objects with different storage modes. For example, a single list could contain one numeric entry, three character entries, one logical entry and an entry that is itself a vector or matrix. Lists can also have entries that are named, as well as ordered. Finally, a list can be built one element at a time. For this reason, lists can be a handy way to store the output from a statistical function, since tests often produce diagnostic results and data that may not be immediately useful to all users (e.g. lists can store the residuals from a linear regression analysis).
  • 28. Crash Course: R & BioConductor 26 > # Build a list of four numeric entries > List <- list(45,13,21,87) > List [[1]] [1] 45 [[2]] [1] 13 [[3]] [1] 21 [[4]] [1] 87 > # Build a list of character, numeric and vector objects > List <- list(Day = "Tuesday", Temperature = 70, WinningLotto = c(17,23,44,39,7)) > List $Day [1] "Tuesday" $Temperature [1] 70 $WinningLotto [1] 17 23 44 39 7 > # Add a new list element to the list > List[["Time"]] <- "4:30" > List $Day [1] "Tuesday" $Temperature [1] 70 $WinningLotto [1] 17 23 44 39 7 $Time [1] "4:30" The list() command allows you enter a series of values or R objects to build the list, similar to the vector() command. The first example shows a list of four numeric values. When the list is displayed in the R workspace, each entry of list is identified first in double brackets, then the individual rows of the entry are displayed on the line below with the entry values on that row. For example, the third entry of the first list is identified [[3]] with the value [1] 21. Since all of the entries are numeric, this list is equivalent to a numeric vector of length 4. In the second example, the entries of the list are assigned names using the = operator. Notice that the entries include a character object, a numeric object and a vector of numbers. Finally, new entries to the list can be added after it has been defined, by adding a new name and value, e.g. List[["Time"]] <- "4:30". 2.8 Manipulating Data in R Once you have a data set imported and stored correctly in R, you may still need to manipulate the data set to add data, remove data or to meet the formatting requirements of a graph or statistical test. You may want to remove outlier values, or rename rows and columns of data, or maybe merge two data sets together. The R command line language includes a wide variety of procedures for these needs. Understanding these commands
  • 29. Crash Course: R & BioConductor 27 can be crucial, because R data is not stored in a viewable spreadsheet format, with simple copy-cut-and-paste functions, like MS Excel and other programs. 2.8.1 Indexing One of the most important concepts in R is the idea of indexing, because it applies to so many types of R objects. Vectors, matrices, data frames, arrays and lists can all be indexed using similar command notations. The index of an R object refers to the specific location of a value in a vector, matrix, array, data frame or list. You can generalize this concept by thinking of the index as the row and column number of any value entry in a spreadsheet, but remember that some R objects can have more than two dimensions or fewer than two dimensions. Here are some examples: > # Report the third entry from a vector of length = 6 > Vector[3] [1] 53 > # Report the entry from the 2nd row and 5th column of a matrix > Matrix[2,5] [1] 49 > # Report the 3rd row, 2nd column and 2nd panel of an array > Array[3,2,2] [1] 49 > # Report the 3rd row, 2nd column and 'Female' gender of a table > Table[3,2,"Female"] [1] 4 > # Report the 1st entry of the 1st column from afp.data > afp.data[1,1] [1] 1 > # Report the 2nd entry from the 'WinningLotto' vector in a list > List[["WinningLotto"]][2] [1] 23 Generally, you refer to an indexed entry of an R object by adding square brackets after the objects name (e.g. Vector[3] refers to the 3rd entry of the object Vector). The dimensions of an object are separated by commas (e.g. Matrix[2,5] refers to the 2nd row and 5th column of the object Matrix). If the dimensions of an object are named instead of numbered, then those dimensions can be specified with a quoted character string (e.g. specify the "Female" of the gender dimension). The examples above use indexing to report single values from vectors, matrices, arrays, tables, data frames and lists, but an index can be used in more complicated ways. > # Rows 2-3 and columns 1, 2 and 6 of a matrix > Matrix[2:3,c(1,2,6)] [,1] [,2] [,3] [1,] 49 45 46 [2,] 45 56 51 > # Overwrite one value from a matrix > Matrix[3,3] <- NA > Matrix [,1] [,2] [,3] [,4] [,5] [,6] [1,] 49 54 49 47 57 53 [2,] 49 45 50 50 49 46 [3,] 45 56 NA 52 46 51 [4,] 47 46 48 48 55 48
  • 30. Crash Course: R & BioConductor 28 > # Identify observations with % Body fat less than 10% > AE[Percent.Body.Fat <= 10,] Region Gender Severity Age Weight Percent.Body.Fat 2 Southwest Male Mild 34 148.5672 7 30 Southwest Male Mild 36 155.3823 8 49 Midwest Male Moderate 34 151.3767 9 A sequence of row or column numbers can be entered into an index to view more than one row or column from a data table. These sequences can be entered using colon symbol notation (e.g. 1:5) or the combine function (e.g. c(1,4,7)) and other methods. The individual indexed values of a matrix, array or data frame can be overwritten without affecting any other values in the matrix, array or data frame. Sequences of row numbers or column numbers can be generated with an inequality or conditional statement to find special subsets of data. Indexing is a very powerful tool within R. 2.8.2 Column references and attach() Indexing can be a great way to create, view or modify subsets of your data, but often it might be more helpful to refer to specific columns, or variables, within a large data frame. We have already seen that the objects in a list can be called by their names using double square brackets and the quoted name (e.g. List[[“Time”]] yields the value “4:30”). We can also call a specific column of a data frame using the reserved dollar sign symbol (e.g. AE$Gender yields the gender column of the AE data set). Column references and list name references can be simplified using the attach() command to “attach” a specific data frame or list to the current R workspace. Once the object has been attached to the workspace, individual variables or list items can be called by name. 2.8.3 Binding rows and columns Indexing and column references allow you to manipulate smaller subsets of a large data set. The functions cbind() and rbind() allow you to add columns and rows to data sets, respectively. It is also possible to add columns to a data frame recursively, by redefining the data frame with its original data and the new columns of data Frame = data.frame(Frame,NewColumn). Obviously, new elements can be added to lists at any time, by adding new named elements. 2.8.4 Sort and order data It is often helpful to sort the results of a vector, array or data frame to reorganize a data set or result for better insights. For example, you might need to sort the results from several statistical tests by their p-values, so the most statistically significant results are easy to identify. The sort() function is used to sort the actual values of a single vector or list in ascending or descending order. The order() command is used to generate an index of the sorted rows numbers from a data frame sorted by one or more variables in either ascending or descending order. > # Sort a single vector of values > a [1] 4 4 0 5 4 1 0 1 > sort(a) [1] 0 0 1 1 4 4 4 5 > sort(a,descending = TRUE) [1] 5 4 4 4 1 1 0 0 > index = order(a,b,c) > index [1] 3 7 6 8 2 5 1 4
  • 31. Crash Course: R & BioConductor 29 > frame[index,] a b c 3 0 0.0 89 7 0 0.6 92 6 1 0.2 99 8 1 0.8 83 2 4 0.0 84 5 4 0.4 100 1 4 0.9 81 4 5 0.1 92 > index = order(b,c,a) > frame[index,] a b c 2 4 0.0 84 3 0 0.0 89 4 5 0.1 92 6 1 0.2 99 5 4 0.4 100 7 0 0.6 92 8 1 0.8 83 1 4 0.9 81 2.8.5 Replace values The replace() command is used to replace the values of a matrix, array or data frame according to an index list, which identifies the values that need to replaces, and a vector, matrix or array of replacement values. For example, you could use an equality statement to recode the values of a character or factor variable (e.g. recode “m” and “f” to “male” and “female”). Alternatively, you could use an inequality to identify and remove outliers from a numeric variable. > # display a simple data frame > frame a b 1 f 3 2 f 7 3 m 1 4 f 1200 5 m 6 > # generate an index to identify “f” values for replacement > indx.f <- frame == "f" > # replace “f” values with “female” > frame <- replace(frame,indx.f,"female") > # generate an index to identify possible outlier values of b > indx <- frame$b > 1000 > # replace outliers with NA to remove outliers > frame$b <- replace(frame$b,indx,NA) > # view results > frame a b 1 female 3 2 female 7 3 m 1 4 female NA 5 m 6 2.8.6 Stack, unstack and reshape data Often data sets need to be stacked or split to reorganize data for use in statistical tests or to simply the recording of new observations. For example, suppose you were to record blood pressure, cholesterol and other
  • 32. Crash Course: R & BioConductor 30 medical results for 6 patients on Monday, Tuesday and Wednesday of one week. It would make sense for the doctor to record the measurements in three separate columns for Monday, Tuesday and Wednesday. But most statistical tests would need one column of dates and one single column for each type of measurement (e.g. one column of blood pressure measurements). The stack() and unstack() commands are used to stack and split these kinds of data sets. More complex stack and split operations can be performed using reshape(). > # Display a data frame > frame Monday Tuesday Wednesday 1 96 76 156 2 100 78 163 3 102 80 163 4 106 82 159 5 105 82 153 6 103 78 162 # Stack the data from Monday, Tuesday and Wednesday > frame = stack(frame) > frame Date Measurement 1 Monday 96 2 Monday 100 3 Monday 102 ... 7 Tuesday 76 8 Tuesday 78 9 Tuesday 80 ... 16 Wednesday 159 17 Wednesday 153 18 Wednesday 162 > # Unstack the measurement data by dates > frame = unstack(frame) > frame Monday Tuesday Wednesday 1 96 76 156 2 100 78 163 ... 2.8.7 Merge data sets Two data frames can be joined together using the merge() command. The default option is to join the data frames using any columns that share the same name among all the data frames. However, specific columns can be matched to one another using the by, by.x or by.y parameters. For example, it might be necessary to combine the medical records of a general practitioner, a cardiologist, a dentist and a psychologist according to their patient id numbers or patient names. The merge command can also be used for more complicated join operations among database tables. > psych id ssri therepy 1 1 Y Y 2 1 Y N 3 2 N N 4 2 N Y 5 3 N Y 6 3 Y Y 7 4 N N 8 4 N Y
  • 33. Crash Course: R & BioConductor 31 > cardio id exercise lipids 1 1 Y high 2 1 Y high 3 1 Y norm 4 1 Y norm 5 3 Y norm 6 3 Y norm 7 4 Y norm > records = merge(psych,cardio) > records id ssri therepy exercise lipids 1 1 Y Y Y high 2 1 Y Y Y high 3 1 Y Y Y norm 4 1 Y Y Y norm 5 1 Y N Y high 6 1 Y N Y high 7 1 Y N Y norm 8 1 Y N Y norm 9 3 N Y Y norm 10 3 N Y Y norm 11 3 Y Y Y norm 12 3 Y Y Y norm 13 4 N N Y norm 14 4 N Y Y norm 2.9 Saving and Exporting Data 2.9.1 Save workspace data with save() With any software program, it is important to save your data. There are several options available to save your data in R. Most of the options should be familiar from the data import commands in section 2.6 of this manual. Click > File > Save Workspace... on the MS Windows R GUI or click > Workspace > Save Workspace File... on the Mac OSX R GUI to save the entire R workspace. Alternatively, you could enter the command save() or save.image() to save the R workspace from the command line. If you use the save() command, you can specify a list of R objects to be saved (e.g. save(AE) to save only the AE data set), otherwise the save() command will save all the R objects defined in the R workspace. Use the command ls() to view the objects in your R workspace, and use the command rm() to remove individual objects from the workspace. The command rm(list=ls()) will remove all objects from the R workspace. > ls() [1] "AFP.after" "AFP.before" "BMI" [4] "InternetTest" "Monday" "R2HTML.test" [7] "Tuesday" "Wednesday" "a" [10] "aa" "afp.data" "b" [13] "biocLite" "biocinstall" “biocinstallPkgGroups" [16] "biocinstallRepos" "c" "cardio" ... > rm(a,aa,b,c) > save(file="~/workspace.RData") 2.9.2 Save data tables with write.table() You can save an R data frame with the commands write.table(), write.csv() or write.delim(). The parameter options are similar to the read.table() commands, but here you will choose whether the save text file should have a header or row names, if the fields should be separated by commas or tabs, etc. The na
  • 34. Crash Course: R & BioConductor 32 parameter is important, because it controls how missing values will be saved in the text files; you may want to be carefully choose the symbol for missing data, if you want to open the text file in another software package like MS Excel, SAS, etc. Another powerful option in write.table is the append parameter, which allows you to add new data to an existing text file. This can be a useful option when you need to save large amounts of data from a script or analysis that involves long computations. It is often helpful to save a data file one-piece-at-a- time to avoid losing data during lengthy computations or to avoid problems when trying to save one gigantic data file. > # Save AE data as tab-delimited text optimized for SAS import > write.table(AE,file="~/ae.txt",sep="t",na=".",row.names=FALSE) > # Save the AFP data as tab delimited text using write.delim() > write.delim(afp.data) 2.10 Changing Directories When opening and saving files from R, it may be helpful to change the working directory. Changing the directory will often allow you to specify only a file name, rather than a complete file path, when opening and saving data files or source scripts. Change directories from the MS Windows GUI by clicking > File > Change dir... on the menu bar; click > Misc > Change Working Directory... to change directories on the Mac OSX R GUI. The Mac OSX GUI also allows you to click > Misc > Get Working Directory to find the current working directory. The commands getwd() and setwd() allow you to find the current directory and change the working directory, respectively. 2.11 Sample Problems for Students #1. {Fisher’s iris data} Sir Ronald A. Fisher famously used this set of iris flower data as an example to test his new linear discriminant statistical model. Now, the iris data set is used as a historical example for new statistical classification models. A) Search the help menu for the keyword “linear discriminant”, then report the names of the functions and packages you find. B) Search the help menus or a search engine for additional classification models that could be tested with the iris data. C) The measurements from the iris data set were made in centimeters, but suppose a researcher wanted to compare the performance of their classifier for measurements in both cm and inches. Remember 1 cm = 0.3937 inch and create a new iris data set with measurements in inches. D) Use indexing to verify the 77th plant (i.e. row 77) has petal length of approximately 1.89 inches. #2. {AFP data} Suppose alpha-fetoprotein (AFP) is a potential biomarker for liver cancer and other cancer types. A researcher might be interested in AFP levels before and after taking a new drug in one of four concentrations. A) The example in section 2.7.2 of the manual provided a list of 20 AFP levels before drug treatment. Use your own methods to enter a new column of 20 AFP levels after drug treatment, then enter another column with the difference between the pre- and post-treatment AFP levels B) Verify the storage mode of the data set afp.data. Verify the storage mode of the variable drug. Verify the storage mode of the variable gender. Convert the storage mode of drug to factor.
  • 35. Crash Course: R & BioConductor 33 C) Create a subset of the AFP data that only includes male patients with BMI > 25.5 or weight > 180 lbs. How many men are included in the data subset? D) Sort the entire data subset created in part C) by the BMI variable in an descending order. What is the row ordering of the sorted data subset? Save the data subset as a comma separated value (.csv) text file, then remove the subset from your R workspace. #3. {AE data} Doctors, epidemiologists and other researchers look at adverse events to explore the symptoms and medical conditions affecting patients. A researcher might choose to look for associations between adverse events and diet. A) One of the adverse events in the data table is “Malaise”. Recode the AE data table, such that all entries for “Malaise” read “Discomfort” instead. B) Look at the results of your recoded adverse events. How many different types of adverse events are there? Look through their names. Do you see any potential problems? Fix any problems that you might find. C) Create an adverse event table to examine relationship between different adverse event symptoms and their severities. Make sure the “Discomfort” AE shows up in the table, instead of “Malaise”. D) Search the help menus for the functions rowSums and colSums. Use these functions to count up the number of patients with each adverse event and the number of patients with mild, moderate and severe symptoms. E) Define a new variable AEmatrix by converting the AE table into the matrix storage mode. Define two new matrix variables using the commands LL = matrix(1,1,17) and RR = c(1,1,1). Look at all these new matrices. Compute the products of LL by AEmatrix; AEmatrix by RR; and LL by AEmatrix by RR.
  • 36. Crash Course: R & BioConductor 34 Ch. 3. Graphics and Figures in R 3.1 Basic Types of Graphics and Figures You can use R to produce dozens or hundreds of different kinds of graphics and figures. Many popular types of graphs, like pie charts and histograms, have their own dedicated commands and procedures in the graphics package library. Other types of graphs, like multifactor XY scatterplots, are most easily produced using multiple commands from general graphing utilities, like plot() and legend(). Often, specialized package libraries will include graphics commands that can help streamline the graphing process. Other graphs can only be produced in the context of the appropriate statistical analysis. Several simple examples are provided below. 3.1.1 Pie charts Pie charts are used to quickly display the frequencies of each outcome of a single categorical variable. The relative size of each slice of the pie chart represents the relative frequency of its respective outcome in the sample. For example, we could use a pie chart to examine the proportion of samples from each iris species in Edgar Anderson’s iris data (Figure 19). We could also use a pie chart to explore the frequencies of each adverse event in our AE data set (Figure 20). Figure 19. Pie chart of Edgar Anderson’s iris species. Figure 20. Pie chart of the adverse events (AE) data. ># Create the labels for the iris data pie chart > labels <- levels(iris$Species) ># Create a vector with all three species counts > counts <- summary(iris$Species) ># Define a vector with three color choices for the pie chart > colors <- c("red","blue","yellow") ># Define a main title for the pie chart > main <- "Pie Chart of Iris Species" ># Call the pie() command to produce the pie chart > pie(x = counts, labels = labels, col = colors, main = main)
  • 37. Crash Course: R & BioConductor 35 ># Create the labels for the adverse events (AE) data pie chart > labels <- levels(as.factor(AE$Adverse.Event)) ># Create a vector with counts for all adverse events > counts <- summary(as.factor(AE$Adverse.Event)) ># Define a main title for the pie chart > main <- "Adverse Events Pie Chart" ># Call the pie() command to produce the pie chart > pie(x = counts, labels = labels, main = main) The pie() command includes the parameter x to define the counts or frequencies in each slice of the pie chart, the parameter labels to define the text labels in each slice of the pie chart and the parameter col to define the colors of each slice in the pie chart. You can also add generic graphing parameters, like main and others, to customize the pie chart with a main title and other features. Notice that the labels and the x (counts or frequencies) parameters could have been computed and entered manually, but instead the commands levels() and summary() were used to define labels and x, respectively. The levels() command lists all the outcomes of a factor variable, while the summary() command adds up all the counts for each outcome of a factor variable. Note, the species variable of the iris data set was already defined as a factor variable, while the adverse events variable from the AE data set needed to be converted to a factor variable first, using the as.factor() command. Also notice, in the second pie chart, that the col parameter was left undefined and R automatically generated the color choices for each of the 18 adverse event slices. 3.1.2 Histograms Histograms are used to quickly display the distribution of a single continuous numeric variable. Often researchers want to determine if a variable might be normally distributed or non-normally distributed. Other researchers want to estimate descriptive statistics like means, medians, variances or ranges. A key issue in the construction of a histogram is the choice of the histogram “bins” or groupings. If too many bins are used, the true shape of the distribution will be lost because the histogram will be too sparse, but if too few bins are used, the true shape of the distribution will be lost because the bins are too dense to remain informative. The location of the bin mid points and break points can also be important to the shape of the histogram. The importance of binning is shown in two histograms of the height measurements from the AFP dataset shown below (Figure 21 and Figure 22). Figure 21. Histogram of height from the AFP data set with default number of bins. Figure 22. Histogram of BMI from the AFP data set with a larger number of bins.
  • 38. Crash Course: R & BioConductor 36 > # Define a vector of BMI data > height <- as.numeric(afp.data[,3]) > # Define a main title for the histogram > main <- "Histogram of height from AFP data" > # Call the hist() command to produce the histogram > hist(x=height,xlab="height (inches)",main=main,col="wheat") > # Call hist() command with extra breaks for a second histogram > hist(x=height,breaks=30,xlim=c(15,45),...) The hist() command includes many parameter options. The parameter x must be specified, to identify the sample of continuous data displayed in the histogram. The breaks parameter specifies the number of bins used in the histogram. The number of histogram bins can be specified using one of three automated binning algorithm choices (i.e. “Sturges”, “Scott” or “Freedman-Diaconis”), a single number (i.e. breaks = 30 will produce 31 bins), a vector of specific break points or a formula. In the first histogram, the default “Sturges” method produced a histogram with six bins, which appears to show a normal distribution. In the second histogram, the command breaks = 30 specified that 31 bins should be used, and the resulting histogram was sparse and uninformative. The command xlab specifies the label for the x-axis. As before, the commands main and col specify the main title and the color of the plotted bars, respectively. 3.1.3 Box plots Box plots are an alternative to the histogram for researchers who want to quickly summarize the distribution of continuous numeric variables. Box plots were introduced by statistician John Tukey in his historic book Exploratory Data Analysis (Tukey 1977). The box plot is a graphical representation of the five number summary, where the central line in the box plot represents the median of a sample, the outer edges of the box in the box plot represent the 25th and 75th percentiles of the sample and the whiskers of the box plot represent the minimum and maximum of a sample. Alternate versions of the box plot often use dots or asterisks to identify outliers beyond the whiskers, which might represent the 5th and 95th percentile of a distribution or the smallest and largest “non-outlier” values of a distribution. Generally, a single box plot (Figure 23) provides less information about the shape of a distribution than an analogous histogram. For example, a box plot cannot be used to identify a bimodal distribution of female and male heights, while a histogram can. However, box plots are often more appropriate than histograms when researchers want to compare the distributions of several samples in the same figure (Figure 24). > # Define a vector of height data > height <- as.numeric(afp.data[,3]) > # Define a main title for the boxplot > main <- "Boxplot of height from AFP data" > # Call boxplot() command for boxplot of height from AFP data > boxplot(x=height,main=main,xlab=”height (inches)”,col=”wheat”) > # > # Call boxplot() command for boxplot of calories from AE data > boxplot(formula=Calories~Region,data=AE,range=1.5,...) The boxplot() function can be used in at least two different ways, with a single vector of continuous data or with a formula to produce side-by-side box plots. A simple box plot of the height variable from the AFP data set is produced using the boxplot() command with parameter x = height to specify a single vector of numeric data for the Calories ~ Region was used to create a graph with side-by-side box plots of the calories variable for each of the five regions of the categorical region variable. The parameter data = AE is used to specify that we only want to use variables from the AE data set, which is why we could call the calories and region variables with out defining them as vectors before the boxplot() command. The range parameter is
  • 39. Crash Course: R & BioConductor 37 Figure 23. Box plot of patient height from AFP data Figure 24. Box plot of calories among five regions. used to identify outliers in the box plot. The parameters main, xlab and col were used to specify the main title, x-axis label and the color of the boxplot, respectively, as seen in the previous examples. In the second box plot, the parameter formula = on the box plot figure. Here, range = 1.5 implies that any calorie measurement smaller than Q1 – 1.5*IQR and any measurement larger than Q3 + 1.5*IQR will be identified as an outlier, where Q1 represents the 25th percentile, Q3 represents the 75th percentile and IQR represents the interquartile range (i.e. Q3 – Q1, the middle 50% of the data). Another parameter ylab = “Calories” was entered to specify the y-axis label, while the main and col were used to define a main title and the box plot color as above. 3.1.4 Simple bar charts Researchers and statisticians often use the phrase “bar chart” to describe two subtly different types of graphs. Sometimes a bar chart is used as an alternative to the pie chart to display the relative frequencies of different outcomes from a categorical variable (e.g. gender or region), but in other situations a bar chart with error bars might be used to display the mean response levels of a continuous variable (e.g. weight, concentration) and its standard error among several categories as an alternative to a box plot. E.g. a bar chart can be used to plot the frequencies of each adverse event in the AE data set (Figure 25), or a bar chart could be used to plot the mean BMI levels for male and female patients in the AFP data set (Figure 26). > # Create a vector of counts for the AE bar chart > counts <- summary(AE$Adverse.Event) > # Define a main title for the bar chart > main <- "Bar chart of Adverse Events" > # Call the barplot() command for a bar chart of adverse events > barplot(height=counts,main=main,ylab=”Counts”)
  • 40. Crash Course: R & BioConductor 38 Figure 25. A bar chart of adverse events from the AE data set Figure 26. A bar chart of female and male BMI from the AFP data set. > # compute mean BMI for male and female patients from AFP data > BMI.females <-mean(AFP[AFP$gender=="female",5]) > BMI.males <-mean(AFP[AFP$gender=="male",5]) > mean.BMI <- c(BMI.females,BMI.males) > # define labels for female and male bars > names(mean.BMI) <- levels(as.factor(AFP$gender)) > # Define colors for female and male bars > colors <- c("pink","sky blue") > # Specify a main title for the bar chart graph > main <- "Bar chart of mean BMI by gender for AFP data" > # Call the barplot() command for a bar chart of BMI responses > barplot(height = mean.BMI,ylab="BMI", main=main,col=colors)
  • 41. Crash Course: R & BioConductor 39 Before creating a bar chart of mean BMI levels for male and female patients from the AFP data, we first need to compute the individual BMI means for male and female patients using subscripting. Names were assigned to the vector of BMI means using the names() function, so the appropriate gender labels will show up below the male and female BMI data. Colors and a main title were defined for the chart, and the barplot() function was called with the appropriate options. Note that more advanced bar charts of several categorical variables can be easily created with the barchart() command from the lattice package library. Multiple categorical variables can be summarized with the table() command, then the table of categorical variables is entered into the barchart() command for easy clustered, stacked or paneled bar charts. Notice that numeric variables will not work appropriately in the barchart() command. However, quick and easy bar charts with error bars can be created with the bargraph.CI() command from the sciplot package library. Other helpful packages may exist to create more variations on these types of bar charts. 3.1.5 Simple scatter plots and line plots Scatter plots are used to display the relationship between two continuous variables that might be analyzed using linear regression or nonlinear regression models. E.g. an XY scatter plot might be used to examine the relationship between % body fat and weight (lbs) from the AE dataset (Figure 27). Line plots are often used to plot survival curves, probability density functions (PDFs), cumulative distribution functions (CDFs) and other continuous functions of interest. E.g. you might need a plot of the standard normal curve for a class lecture or a statistics textbook (Figure 28). Figure 27. Scatter plot of % Body Fat vs Weight (lbs) Figure 28. Plot of (Gaussian) normal density function. > # Define a main title for a scatter plot > main <- "Simple scatter plot of % Body Fat vs. Weight (lbs)" > # Simple scatter plot of % Body Fat vs. Weight > plot(AE$Weight,AE$Percent.Body.Fat,xlab="Weight (lbs)",ylab="% Body Fat",main=main)
  • 42. Crash Course: R & BioConductor 40 > # Define a continuous sequence Z ranging from -5 to +5 > Z <- seq(from=-5,to=5,length=8000) > # Define a sequence representing the density of a normal curve > fZ <- dnorm(Z) > # Plot a normal curve > plot(Z,fZ,type="l",ylab="Density",main="Normal Curve") A main title was defined for the XY scatter plot of % body fat vs. weight (lbs) from the AE data, before calling the plot() command with its xlab and ylab options to define the X- and Y-axis labels, respectively. Since the probability density of a standard normal distribution is really a function f(x), two new variables Z and fZ were defined to create a line plot of the standard normal density. First, the variable Z was defined as a sequence of 8000 evenly spaced rational numbers from -5 to +5 using the sequence() command. Second, the variable fZ was defined as a sequence of 8000 numbers resulting from the function f(x) using the dnorm() command in R. Finally, a line plot was created from the plot() function by using the parameter type =”l”. 3.2 Custom Titles, Subtitles and Axes Labels Most graphics procedures (e.g. pie(), hist(), plot(), ...) have some common parameters that allow users to add specific text for the main titles, subtitles and axes labels. There are additional commands that allow you to customize the look and feel of these labels for a more professional look. The following sections reveal some helpful tips about customizing the labels on a graph. 3.2.1 Adding and removing groups from a factor variable Take a close look at the pie chart (Figure 7) and bar chart (Figure 12) created from the adverse events of the AE data. You may have noticed a possible typo in the data set, because the data contains two very similar groups “myalgia” and “mylagia”. The “mylagia” group is a typo, but can it be removed from the plot? > # Examine the 18 levels of the Adverse.Event variable > AE$Adverse.Event [1] Tenderness Arthralgia Mylagia Erythema Erythema Anemia Anemia ... [57] Nausea Headache Nodule Anemia Swelling Leukopenia Elavated CH50 [64] Headache 18 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache Induration ... Tenderness > # Store the list of variable names as a new variable > new.labels <- levels(AE$Adverse.Event) > # Verify the list still has 18 levels > length(new.labels) [1] 18 > # Use indexing to replace the “Mylagia” label with “Myalgia” > new.labels[12] <- "Myalgia" > # Assign these new labels to the levels of Adverse.Event > levels(AE$Adverse.Event) <- new.labels > # Verify Adverse.Event now has only 17 levels > AE$Adverse.Event [1] Tenderness Arthralgia Myalgia Erythema Erythema Anemia Anemia ... [57] Nausea Headache Nodule Anemia Swelling Leukopenia Elavated CH50 [64] Headache 17 Levels: Anemia Arthralgia Dimpling Ecchymosis Elavated CH50 Erythema Headache Induration ... Tenderness