SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Crunching Molecules and Numbers in R
Rajarshi Guha
NIH Chemical Genomics Center
238th ACS National Meeting
17th August, 2009
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Outline
Some background on R
Doing cheminformatics in R
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R History
S developed by
John Chambers at
Bell Labs
1976
S rewritten in C
1988
Licensed to
Insightful Corp.
1993
Bought by Insightful
Corp for $2M
2004
Bought by TIBCO
for $25M
2008
First public release
1993
Created by Ihaka &
Gentleman
1991
Released under
GPL
1995
R 1.0.0
2000
R 2.9
2009
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview of R
An environment for statistical computation
Wide variety of standard and state of the art statistical
methods built in or accessible via packages
But also a complete, interpreted programming language
Well suited for manipulating and operating on datasets -
numerical, categorical or a mixture - and of varying
shape
Impressive visualization facilities (but not very
interactive)
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview of R
Syntax is pretty much S-Plus
Highly cross-platform
Frequent and regular releases, active development by
core group
The dev and user community extremely active
r-help is not just for learning R, you can get a decent
statistics education from the list!
Used by many top statisticians, many cutting edge
techniques first show up in R
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Usability
Default mode is a command line like prompt
GUI’s available
But learning curve is steep
Does force you to think about the analysis
Not a great tool for casual, once-in-a-while usage
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R primitives
Numeric, character, list, matrix, data.frame
§ ¤
> x <- ’Hello World ’
> x <- 1
> x <- c(1,2,3,4,5,6)
> x
[1] 1 2 3 4 5 6
x <- data.frame(MW=runif(5, 10, 50),
hERG=sample(c(’active ’,’inactive ’),
5, TRUE ))
> x
MW hERG
1 23.55435 active
2 42.90365 inactive
3 49.35149 active
4 26.85912 active
5 10.01877 active
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Matrix oriented programming
Similar in style to Matlab
Easily access (multiple) rows, columns
Vector/matrix indexing is very powerful and key to
efficient R code
Perform operations on entire rows or columns
Makes subsetting a trivial operation
Perfect for QSAR type analyses
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional style
R’s functional paradigms are closely tied to matrix
operations
apply, lapply, sapply, tapply allow you to easily
operate on groups of objects
Elements of a list
Rows and/or columns of a matrix
Subsets of data, using a grouping variable
Anonymous functions are supported
Use of these funtional forms can lead to speed up
compared to traditional for loops
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Non-functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
sds <- numeric(ncol(m))
for (i in 1: ncol(m)) sds[i] <- sd(m[,i])
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
toxLogP <- 0
nontoxLogP <- 0
for (j in 1: nrow(m)) {
if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1]
else nontoxLogP <- nontoxLogP + m[j,1]
}
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
apply(m, 2, sd)
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
by(m, m$toxic , function(x) mean(x$logp ))
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Object oriented style
R supports multiple object oriented mechanisms
Simplest is S3 classes
Object orientation is in terms of function names
Easy to work with, not always flexible enough
S4 classes are much more powerful, but also more
complex
Many problems can ignore these as R primitives provide
sufficient support for attaching meta-data to objects
(crude encapsulation)
Becomes important/useful when writing packages, not
for day to day code
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Interfacing with C & Fortran
R is interpreted,
functional forms help a
bit
Very useful to refactor
inner loops into C (or
Fortran)
Also useful to provide an
R interface to pre-existing
C/Fortran code
Can lead to dramatic
speedups
1024 166 79
Bit length
Speedup
01020304050
5000 pairwise Tanimoto similarity
calculations, Macbook Pro,
2GHz, 1GB RAM
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Visualization
R generates publication quality graphics in a variety of
formats
A huge number of statistical visualization methods (2D,
3D, OpenGL)
Extremely powerful display specifications
core commands
lattice (a.k.a trellis graphics)
Based on sound statistical theories
While standard plots are easy to make, but complex
plots do have a learning curve
Interactivity is limited, though some package do alleviate
this
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Code quality
It’s not just enough to write code
RUnit is a package that supports unit testing, analogous
to JUnit
R comes with well defined package structure that can be
automatically checked for various errors
Packages can be uploaded to CRAN which allows any R
user to install them directly from R
Extensive documentation format
Sweave is an important feature which allows one to
include R code and associated text in a single document
- literate programming or reproducible research
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The downsides of R
Memory bound (but can use as much memory as you
have)
Language inconsistencies
Indexing starts from 1, but no error if you use 0 as an
index
See blog posts by Radford Neal (U Toronto)
Debugging environment not so great (though ESS is
good for Emacs users)
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Cheminformatics programming
Fundamental requirement is support for core chemical
concepts
Representation and manipulation of these concepts
Flexibility
Could implement all of this directly in R - lots of wheels
would be reinvented
We also want such functionality to be R-like
Writing Java or C in R is not R-like
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The Chemistry Development Kit
Open source Java library for cheminformatics
Wide variety of functionality
Core chemical concepts (atoms, bonds, molecules)
SMARTS, pharmacophores
Molecular descriptors and fingerprints
2D depictions
Used in a variety of tools, applications and services
Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk - CDK from R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk Motivations
Have access to cheminformatics functionality from
within R
Support processing of data from chemistry databases
Not reimplement cheminformatics methods
Have access to all of this in idiomatic R
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecular operations - I/O
Read in molecular file formats support by the CDK
Files can be local or remote
Parse SMILES strings
In contrast to the CDK, rcdk will configure molecules
automatically (unless instructed not to)
The resultant molecule objects are Java references, can
be passed to a variety of rcdk functions
§ ¤
mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’))
mol <- parse.smiles(’c1ccccc1CC (=O)’)
mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’),
parse.smiles)
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecular operations
Given a molecule, we can extract or add properties
Get lists of atoms and bonds and then manipulate them
Currently doesn’t support a lot of molecular graph
operations
§ ¤
# get the atoms from a molecule
mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ")
atoms <- get.atoms(mol)
# get the coordinate matrix of the molecule
coords <- do.call(’rbind ’,
lapply(atoms , get.point3d ))
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Working with fingerprints
rcdk will generate a variety of fingerprints via the CDK
Other packages can generate fingerprints
The fingerprint package suports I/O of fingerprint
data and various similarity operations on fingerprints
Provides an S4 class representing binary fingerprints
§ ¤
m1 <- parse.smiles(’c1ccccc1C(COC)N’)
m2 <- parse.smiles(’C1CCCCC1C(COC)N’)
# Calculate fingerprints
fps <- lapply(list(m1 ,m2),
get.fingerprint , type=’maccs ’)
distance(fps [[1]] , fps [[2]] , method=’tanimoto ’)
fps <- fp.read(’fp.txt ’, lf=moe.lf ,
size =166, header=TRUE)
fpsim <- fp.sim.matrix(fps , method=’tanimoto ’)
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSAR
Molecular
Descriptors
Machine
Learning
Property
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSAR
Access to descriptors and fingerprints makes for very
easy QSAR modeling within R
Evaluate the descriptors (individually, by type or all)
Get back a data.frame which can be used as input to
pretty much any modeling method
§ ¤
mols <- load.molecules(’big.sdf ’)
dnames <- get.desc.names(’topological ’)
descs <- eval.desc(mols , dnames)
str(descs)
’data.frame ’: 467 obs. of 180 variables:
$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...
$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...
$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ..
$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Viewing molecules
While numerical modeling is a fundamental task in this
environment, visualization is also important
Either view structures of individual molecules or tables of
structure and data
rcdk supports both (not very well on OS X)
§ ¤
mol <- parse.smiles(’c1ccccc1C(N)CC ’)’
view.molecule .2d(mol)
smiles <- c("CCC", "CCN", "CCN(C)(C)",
" c1ccccc1Cc1ccccc1 ",
"C1CCC1CC(CN(C)(C))CC(=O)CC")
mols <- sapply(smiles , parse.smiles)
view.molecule .2d(mols)
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Downsides to rcdk
Can’t save state of Java objects
Doesn’t take advantage of S4 classes to provide R-side
representations of CDK classes
Incomplete coverage of the CDK API - sometimes need
to go down to rJava to perform an operation
Big datasets are problematic (mainly due to R
limitations)
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to chemical databases
Useful to be able to transparently access data from
various public data sources
PubChem compound and assays are supported via
rpubchem
Compound access is primarily by CID, while assay data
can be obtained from key word searches
End up with a data.frame containing all relevant assay
information (along with meta-data as attributes)
R can also easily access arbitrary RDBMS’s (Postgres,
MySQL, Oracle)
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to PubChem
§ ¤
> dat <- get.cids (1:30)
’data.frame ’: 30 obs. of 11 variables:
$ CID : chr "1" "2" "3" "4" ...
$ IUPACName : chr "3-acetyloxy -4-( trimethylaz
$ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+](
$ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H
$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.
> find.assay.id(’LDR ’)
[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865
> adat <- get.assay (990)
> str(adat)
’data.frame ’: 51 obs. of 9 variables:
$ PUBCHEM.SID : int 845800 848472 852502 857608
$ PUBCHEM.CID : int 648162 6603466 655127 65895
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Bioinformatics in R
While the focus is on cheminformatics, many problems
involve bioinformatics to some degree
The Bioconductor project provides a wide variety of
packages
A lot of it focused on gene expression analysis
A number of packages provide access to various
biological databases, annotations etc
Protein structure analysis is supported in R via Bio3d
Never have to leave the comfort of R
http://www.bioconductor.org/
Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Long calculations, big data
Many statistical methods require long running
calculations
Bootstrap
Bayesian methods
Many problems involve large datasets
A common feature to both scenarios is that they can be
trivially parallelized
As opposed to require parallel version of underlying
algorithm
R has good support for both trivial and non-trivial
parallelization methods
See R/parallel for a package that will parallelize
actual R code
Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple parallelization
The snow package allows easy use of multiple cores on a
single computer or a cluster of computers
A simple wrapper over other parallel R libraries
Can support PVM, MPI
At the very least you can use all the cores on your own
machine
http://cran.r-project.org/web/packages/snow/
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Serial code - Feature Selection
Rather than use GA, SA etc, just look at all
combinations
Inelegant, but no worries about missing the global
optimum
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
apply(combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple parallelization - Feature Selection
Trivially parallelized
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
library(snow)
cl <- makeSOCKcluster (2)
clusterExport(cl , "x")
clusterExport(cl , "y")
parApply(cl , combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})
¦ ¥
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Big data scenarios
The idea behind snow can also be used to handle very
large datasets
Simply chunk the data appropriately and papply over
the list of filenames
Still requires you to perform chunking and keep track of
everything
Hadoop is a nice way to avoid all this
Throw one or more (very) large files at it, let it deal with
chunking and computation
For non-trivial file formats, you need to implement a
chunker
RHIPE provides access to a Hadoop cluster from within R
http://hadoop.apache.org/core/
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Summary
rcdk successfully integrates cheminformatics
functionality into the R environment
Related packages provide access to other forms of
chemical data (fingerprints) and data sources
An excellent environment for chemical and biological
data mining

Más contenido relacionado

La actualidad más candente

An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsKamel Mansouri
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Technology for Drug Discovery Research Productivity
Technology for Drug Discovery Research ProductivityTechnology for Drug Discovery Research Productivity
Technology for Drug Discovery Research ProductivityYogesh Wagh
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectIES / IAQM
 
Basics of QSAR Modeling
Basics of QSAR ModelingBasics of QSAR Modeling
Basics of QSAR ModelingPrachi Pradeep
 
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...Andrew McEachran
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08Paolo Missier
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 

La actualidad más candente (20)

An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Technology for Drug Discovery Research Productivity
Technology for Drug Discovery Research ProductivityTechnology for Drug Discovery Research Productivity
Technology for Drug Discovery Research Productivity
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
 
Basics of QSAR Modeling
Basics of QSAR ModelingBasics of QSAR Modeling
Basics of QSAR Modeling
 
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
Resume Or
Resume OrResume Or
Resume Or
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 

Destacado

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The Trans-NIH RNAi Initiative : Informatics
The Trans-NIH RNAi Initiative: InformaticsThe Trans-NIH RNAi Initiative: Informatics
The Trans-NIH RNAi Initiative : InformaticsRajarshi Guha
 
The smaller sukhavati vyuha
The smaller sukhavati vyuhaThe smaller sukhavati vyuha
The smaller sukhavati vyuhaLin Zhang Sheng
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp coursetigerron
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drugSean Ekins
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarJüri Kaljundi
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientistsgizemk
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesOlmeda Orígenes
 
Eit orginal
Eit orginalEit orginal
Eit orginalanamsini
 
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative WritingFatheha Rahman
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Insight
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipPJ Caposey
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...Digiday
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2CAkira Sasaki
 

Destacado (20)

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The Trans-NIH RNAi Initiative : Informatics
The Trans-NIH RNAi Initiative: InformaticsThe Trans-NIH RNAi Initiative: Informatics
The Trans-NIH RNAi Initiative : Informatics
 
The smaller sukhavati vyuha
The smaller sukhavati vyuhaThe smaller sukhavati vyuha
The smaller sukhavati vyuha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
 
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drug
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
 
Eit orginal
Eit orginalEit orginal
Eit orginal
 
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative Writing
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
 
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑAMANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C
 

Similar a Crunching Molecules and Numbers in R with Rcdk and Fingerprint Packages

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptxranapoonam1
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactJean-Paul Calbimonte
 
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientistDataWorks Summit
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsDataWorks Summit
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange Rajarshi Guha
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 

Similar a Crunching Molecules and Numbers in R with Rcdk and Fingerprint Packages (20)

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
User biglm
User biglmUser biglm
User biglm
 
A gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R EnterpriseA gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R Enterprise
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Ruby on rails
Ruby on railsRuby on rails
Ruby on rails
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 

Más de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataRajarshi Guha
 

Más de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
Smashing Molecules
Smashing MoleculesSmashing Molecules
Smashing Molecules
 
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity Data
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Crunching Molecules and Numbers in R with Rcdk and Fingerprint Packages

  • 1. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Crunching Molecules and Numbers in R Rajarshi Guha NIH Chemical Genomics Center 238th ACS National Meeting 17th August, 2009
  • 2. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Outline Some background on R Doing cheminformatics in R
  • 3. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R History S developed by John Chambers at Bell Labs 1976 S rewritten in C 1988 Licensed to Insightful Corp. 1993 Bought by Insightful Corp for $2M 2004 Bought by TIBCO for $25M 2008 First public release 1993 Created by Ihaka & Gentleman 1991 Released under GPL 1995 R 1.0.0 2000 R 2.9 2009
  • 4. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R An environment for statistical computation Wide variety of standard and state of the art statistical methods built in or accessible via packages But also a complete, interpreted programming language Well suited for manipulating and operating on datasets - numerical, categorical or a mixture - and of varying shape Impressive visualization facilities (but not very interactive)
  • 5. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R Syntax is pretty much S-Plus Highly cross-platform Frequent and regular releases, active development by core group The dev and user community extremely active r-help is not just for learning R, you can get a decent statistics education from the list! Used by many top statisticians, many cutting edge techniques first show up in R
  • 6. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Usability Default mode is a command line like prompt GUI’s available But learning curve is steep Does force you to think about the analysis Not a great tool for casual, once-in-a-while usage
  • 7. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R primitives Numeric, character, list, matrix, data.frame § ¤ > x <- ’Hello World ’ > x <- 1 > x <- c(1,2,3,4,5,6) > x [1] 1 2 3 4 5 6 x <- data.frame(MW=runif(5, 10, 50), hERG=sample(c(’active ’,’inactive ’), 5, TRUE )) > x MW hERG 1 23.55435 active 2 42.90365 inactive 3 49.35149 active 4 26.85912 active 5 10.01877 active ¦ ¥
  • 8. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Matrix oriented programming Similar in style to Matlab Easily access (multiple) rows, columns Vector/matrix indexing is very powerful and key to efficient R code Perform operations on entire rows or columns Makes subsetting a trivial operation Perfect for QSAR type analyses
  • 9. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style R’s functional paradigms are closely tied to matrix operations apply, lapply, sapply, tapply allow you to easily operate on groups of objects Elements of a list Rows and/or columns of a matrix Subsets of data, using a grouping variable Anonymous functions are supported Use of these funtional forms can lead to speed up compared to traditional for loops
  • 10. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Non-functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) sds <- numeric(ncol(m)) for (i in 1: ncol(m)) sds[i] <- sd(m[,i]) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) toxLogP <- 0 nontoxLogP <- 0 for (j in 1: nrow(m)) { if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1] else nontoxLogP <- nontoxLogP + m[j,1] } ¦ ¥
  • 11. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) apply(m, 2, sd) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) by(m, m$toxic , function(x) mean(x$logp )) ¦ ¥
  • 12. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Object oriented style R supports multiple object oriented mechanisms Simplest is S3 classes Object orientation is in terms of function names Easy to work with, not always flexible enough S4 classes are much more powerful, but also more complex Many problems can ignore these as R primitives provide sufficient support for attaching meta-data to objects (crude encapsulation) Becomes important/useful when writing packages, not for day to day code
  • 13. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Interfacing with C & Fortran R is interpreted, functional forms help a bit Very useful to refactor inner loops into C (or Fortran) Also useful to provide an R interface to pre-existing C/Fortran code Can lead to dramatic speedups 1024 166 79 Bit length Speedup 01020304050 5000 pairwise Tanimoto similarity calculations, Macbook Pro, 2GHz, 1GB RAM
  • 14. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Visualization R generates publication quality graphics in a variety of formats A huge number of statistical visualization methods (2D, 3D, OpenGL) Extremely powerful display specifications core commands lattice (a.k.a trellis graphics) Based on sound statistical theories While standard plots are easy to make, but complex plots do have a learning curve Interactivity is limited, though some package do alleviate this
  • 15. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Code quality It’s not just enough to write code RUnit is a package that supports unit testing, analogous to JUnit R comes with well defined package structure that can be automatically checked for various errors Packages can be uploaded to CRAN which allows any R user to install them directly from R Extensive documentation format Sweave is an important feature which allows one to include R code and associated text in a single document - literate programming or reproducible research
  • 16. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The downsides of R Memory bound (but can use as much memory as you have) Language inconsistencies Indexing starts from 1, but no error if you use 0 as an index See blog posts by Radford Neal (U Toronto) Debugging environment not so great (though ESS is good for Emacs users)
  • 17. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Cheminformatics programming Fundamental requirement is support for core chemical concepts Representation and manipulation of these concepts Flexibility Could implement all of this directly in R - lots of wheels would be reinvented We also want such functionality to be R-like Writing Java or C in R is not R-like
  • 18. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The Chemistry Development Kit Open source Java library for cheminformatics Wide variety of functionality Core chemical concepts (atoms, bonds, molecules) SMARTS, pharmacophores Molecular descriptors and fingerprints 2D depictions Used in a variety of tools, applications and services Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
  • 19. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk - CDK from R R Programming Environment rJava CDK Jmol rcdk XML rpubchem fingerprint
  • 20. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk Motivations Have access to cheminformatics functionality from within R Support processing of data from chemistry databases Not reimplement cheminformatics methods Have access to all of this in idiomatic R
  • 21. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations - I/O Read in molecular file formats support by the CDK Files can be local or remote Parse SMILES strings In contrast to the CDK, rcdk will configure molecules automatically (unless instructed not to) The resultant molecule objects are Java references, can be passed to a variety of rcdk functions § ¤ mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’)) mol <- parse.smiles(’c1ccccc1CC (=O)’) mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’), parse.smiles) ¦ ¥
  • 22. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations Given a molecule, we can extract or add properties Get lists of atoms and bonds and then manipulate them Currently doesn’t support a lot of molecular graph operations § ¤ # get the atoms from a molecule mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ") atoms <- get.atoms(mol) # get the coordinate matrix of the molecule coords <- do.call(’rbind ’, lapply(atoms , get.point3d )) ¦ ¥
  • 23. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Working with fingerprints rcdk will generate a variety of fingerprints via the CDK Other packages can generate fingerprints The fingerprint package suports I/O of fingerprint data and various similarity operations on fingerprints Provides an S4 class representing binary fingerprints § ¤ m1 <- parse.smiles(’c1ccccc1C(COC)N’) m2 <- parse.smiles(’C1CCCCC1C(COC)N’) # Calculate fingerprints fps <- lapply(list(m1 ,m2), get.fingerprint , type=’maccs ’) distance(fps [[1]] , fps [[2]] , method=’tanimoto ’) fps <- fp.read(’fp.txt ’, lf=moe.lf , size =166, header=TRUE) fpsim <- fp.sim.matrix(fps , method=’tanimoto ’) ¦ ¥
  • 24. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Molecular Descriptors Machine Learning Property
  • 25. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Access to descriptors and fingerprints makes for very easy QSAR modeling within R Evaluate the descriptors (individually, by type or all) Get back a data.frame which can be used as input to pretty much any modeling method § ¤ mols <- load.molecules(’big.sdf ’) dnames <- get.desc.names(’topological ’) descs <- eval.desc(mols , dnames) str(descs) ’data.frame ’: 467 obs. of 180 variables: $ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ... $ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ... $ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 .. $ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ... ¦ ¥
  • 26. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Viewing molecules While numerical modeling is a fundamental task in this environment, visualization is also important Either view structures of individual molecules or tables of structure and data rcdk supports both (not very well on OS X) § ¤ mol <- parse.smiles(’c1ccccc1C(N)CC ’)’ view.molecule .2d(mol) smiles <- c("CCC", "CCN", "CCN(C)(C)", " c1ccccc1Cc1ccccc1 ", "C1CCC1CC(CN(C)(C))CC(=O)CC") mols <- sapply(smiles , parse.smiles) view.molecule .2d(mols) ¦ ¥
  • 27. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Downsides to rcdk Can’t save state of Java objects Doesn’t take advantage of S4 classes to provide R-side representations of CDK classes Incomplete coverage of the CDK API - sometimes need to go down to rJava to perform an operation Big datasets are problematic (mainly due to R limitations)
  • 28. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to chemical databases Useful to be able to transparently access data from various public data sources PubChem compound and assays are supported via rpubchem Compound access is primarily by CID, while assay data can be obtained from key word searches End up with a data.frame containing all relevant assay information (along with meta-data as attributes) R can also easily access arbitrary RDBMS’s (Postgres, MySQL, Oracle)
  • 29. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to PubChem § ¤ > dat <- get.cids (1:30) ’data.frame ’: 30 obs. of 11 variables: $ CID : chr "1" "2" "3" "4" ... $ IUPACName : chr "3-acetyloxy -4-( trimethylaz $ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+]( $ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H $ MolecularWeight : num 203.2 204.2 156.1 75.1 169. > find.assay.id(’LDR ’) [1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865 > adat <- get.assay (990) > str(adat) ’data.frame ’: 51 obs. of 9 variables: $ PUBCHEM.SID : int 845800 848472 852502 857608 $ PUBCHEM.CID : int 648162 6603466 655127 65895 ¦ ¥
  • 30. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Bioinformatics in R While the focus is on cheminformatics, many problems involve bioinformatics to some degree The Bioconductor project provides a wide variety of packages A lot of it focused on gene expression analysis A number of packages provide access to various biological databases, annotations etc Protein structure analysis is supported in R via Bio3d Never have to leave the comfort of R http://www.bioconductor.org/ Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
  • 31. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Long calculations, big data Many statistical methods require long running calculations Bootstrap Bayesian methods Many problems involve large datasets A common feature to both scenarios is that they can be trivially parallelized As opposed to require parallel version of underlying algorithm R has good support for both trivial and non-trivial parallelization methods See R/parallel for a package that will parallelize actual R code Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
  • 32. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization The snow package allows easy use of multiple cores on a single computer or a cluster of computers A simple wrapper over other parallel R libraries Can support PVM, MPI At the very least you can use all the cores on your own machine http://cran.r-project.org/web/packages/snow/
  • 33. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Serial code - Feature Selection Rather than use GA, SA etc, just look at all combinations Inelegant, but no worries about missing the global optimum § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) apply(combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 34. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization - Feature Selection Trivially parallelized § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) library(snow) cl <- makeSOCKcluster (2) clusterExport(cl , "x") clusterExport(cl , "y") parApply(cl , combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 35. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Big data scenarios The idea behind snow can also be used to handle very large datasets Simply chunk the data appropriately and papply over the list of filenames Still requires you to perform chunking and keep track of everything Hadoop is a nice way to avoid all this Throw one or more (very) large files at it, let it deal with chunking and computation For non-trivial file formats, you need to implement a chunker RHIPE provides access to a Hadoop cluster from within R http://hadoop.apache.org/core/
  • 36. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Summary rcdk successfully integrates cheminformatics functionality into the R environment Related packages provide access to other forms of chemical data (fingerprints) and data sources An excellent environment for chemical and biological data mining