1. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
R on Amazon cloud
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer)
2012
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
2. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Outline
1 Getting started on Amazon cloud
2 Some concrete applications using Hadoop
3 About RBelgium
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
3. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account
(http://aws.amazon.com/)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
4. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account
(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509
Certificate
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
5. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account
(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509
Certificate
S3, EC2, EMR, . . .
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
6. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account
(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509
Certificate
S3, EC2, EMR, . . .
Not followed or some more info ?
http://aws.amazon.com/documentation/gettingstarted/
http://www.bucketexplorer.com/documentation/
amazon-s3--what-is-my-aws-access-and-secret-key.html
http://www.yusufhm.info/content/
adding-x509-certificate-aws-iam-user-api-command-line-tools-0
...
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
7. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Why AWS?
Simple to use Just start up an instance with an AMI
Elastic: Auto-scaling groups (RAM,CPU) + Load balancing
(I/O) + Elastic IPs
On demand: anytime, what you want (limit to 20 EC2
instances without demand), normal, spot, reserved and
EBS-optimized (see http://aws.amazon.com/ec2/)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
8. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Which AMI(s)? (1/2)
Bioconductor on Amazon cloud: http:
//bioconductor.org/help/bioconductor-cloud-ami/
MPI cluster on Amazon:
Example
1 l i b r a r y ( Rmpi )
mpi . spawn . R s l a v e s ( )
3 mpi . p a r L a p p l y ( 1 : mpi . u n i v e r s e . s i z e ( ) , f u n c t i o n ( x
) x +1)
mpi . c l o s e . R s l a v e s ( )
5 mpi . q u i t ( )
Listing 1: ’Rmpi’ on EC2
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
9. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Which AMI(s)? (2/2)
Parallel cluster on Amazon:
Example
1 library ( parallel )
c l <− makePSOCKcluster ( c ( ’ 1 0 . 6 8 . 1 5 5 . 3 0 ’ , ’
10.68.155.45 ’ , ’ 10.68.155.65 ’ ) )
3 c l u s t e r C a l l ( c l , e v a l , myfunc ( arg1 , arg2 , . . . ) )
Listing 2: ’parallel’ on EC2
Hadoop cluster on Amazon with RHadoop:
https://github.com/RevolutionAnalytics/RHadoop/tree/
master/rmr2/pkg/tools
Storm cluster on Amazon:
https://github.com/nathanmarz/storm-deploy
SAP Hana (http://aws.amazon.com/sap/), Oracle R Enterprise
(Hadoop for batch + NoSQL for real-time), etc.
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
10. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
Xβ=y
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
11. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
Xβ=y
solve(t(X)%*%X, t(X)%*%y)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
12. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
Xβ=y
solve(t(X)%*%X, t(X)%*%y)
=
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
13. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
Xβ=y
solve(t(X)%*%X, t(X)%*%y)
=
Example
1 l i b r a r y ( rmr2 )
X = t o . d f s ( m a t r i x ( rnorm ( 2 0 0 0 ) , n c o l = 1 0 ) )
3 y = a s . m a t r i x ( rnorm ( 2 0 0 ) )
Listing 6: initializing variables
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
14. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (2/4)
Example
1 tXX =
values (
3 from . d f s (
mapreduce (
5 input = X,
map = f u n c t i o n ( k , X i ) k e y v a l ( 1 , l i s t ( t ( Xi )%∗%Xi ) ,
7 % reduce = reducerFunction ,
combine = TRUE) ) ) [ [ 1 ] ]
Listing 7: ’rmr2’ matrix multiplication
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
15. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (3/4)
Example
tXy =
2 values (
from . d f s (
4 mapreduce (
input = X,
6 map = f u n c t i o n ( k , X i )
k e y v a l ( 1 , l i s t ( t ( Xi ) %∗% y ) ) ,
8 combine = TRUE) ) ) [ [ 1 ] ]
s o l v e ( tXX , tXy )
Listing 8: ’rmr2’ solving
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
16. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
How to debug (4/4)
Debugging
rmr.str(varName)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
17. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
R on EMR with segue package
Example
1 l i b r a r y ( segue )
s e t C r e d e n t i a l s (” accessKey ” ,” secretAccessKey ”)
3 m y C l u s t e r <− c r e a t e C l u s t e r ( n u m I n s t a n c e s =1 ,
m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” ,
s l a v e I n s t a n c e T y p e=”m1 . s m a l l ” , l o c a t i o n=” us−e a s t −1a
”)
5 R e s u l t L i s t<−e m r l a p p l y ( m y c l u s t e r , d a t a L i s t , myfunc )
stopCluster ()
Listing 9: R on EMR with ’segue’
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
18. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
R on EMR using the API command (1/3)
Upload the numberList file (integers from 1 to 100 with one
integer per line) and the following R scripts: ”mapper.r” and
”reducer.r” to your AWS S3
Run the command line in your bash:
Example
. / e l a s t i c −mapreduce −−c r e a t e −−s t r e a m −−i n p u t s 3 : / /
y o u r b u c k e t / n u m b e r L i s t . t x t −−mapper s 3 : / /
y o u r b u c k e t / mapper . r −−r e d u c e r s 3 : / / y o u r b u c k e t /
r e d u c e r . r −−o u t p u t s 3 : / / e m r o u t r 1 v v / m y r e s u l t s −−
name EMRexampleR1 −−num−i n s t a n c e s 1
Listing 10: Running R on EMR
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
19. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
R on EMR using the API command (2/3)
Example
1 #! / u s r / b i n / env R s c r i p t
t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$
) ” , ”” , l i n e )
3 con <− f i l e ( ” s t d i n ” , open = ” r ” )
w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn
= FALSE ) ) > 0 ) {
5 l i n e <− t r i m W h i t e S p a c e ( l i n e )
c a t ( a s . n u m e r i c ( l i n e ) , ” t ” , ” n” , s e p=” ” )
7 }
Listing 11: Running simple R scripts on EMR - mapper script
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
20. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
R on EMR using the API command (2/3)
Example
1 #! / u s r / b i n / env R s c r i p t
t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$
) ” , ”” , l i n e )
3 con <− f i l e ( ” s t d i n ” , open = ” r ” )
x <− c ( )
5 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn
= FALSE ) ) > 0 ) {
x <− c ( x , a s . n u m e r i c ( t r i m W h i t e S p a c e ( l i n e ) ) )
7 }
c a t ( mean ( x ) )
Listing 12: Running simple R scripts on EMR - reducer script
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
21. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
How to debug (4/4)
Debugging
Debug first your R code in local with the command line:
c a t i n p u t . t x t | R CMD BATCH −−s l a v e −−no−t i m i n g
mapper . r o u t . t x t ;
2 c a t o u t . t x t | R CMD BATCH −−s l a v e −−no−t i m i n g
r e d u c e r . r 2>&1
Listing 13: Debugging R code before using EMR
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
22. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Tips with EMR
Be careful between s3 and s3n, either you use one or the other
but not both. For more information about the differences
between s3 and s3n, see
http://stackoverflow.com/questions/10569455/difference-
between-amazon-s3-and-s3n-in-hadoop (accessed on Nov 6
2012).
The first line of the file must be well written to call the right
language (such as #! /usr/bin/env Rscript" for R or
#!/usr/bin/python for python). If this file is called by
another one then this is not necessary (ex: an R script calls an
R function from another file, the R function file does not need
to start with #! /usr/bin/env Rscript).
the output directory may NOT exist before launching your
EMR job, otherwise the job will always FAIL. Use
s3://yourProjects/project1 instead of s3://project1.
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
23. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Projects in RBelgium
http://www.heritagehealthprize.com/c/hhp
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
24. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Projects in RBelgium
http://www.heritagehealthprize.com/c/hhp
Text Mining using real “text” data extracted from the
database systems of a project-partner
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
25. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
RBelgium members (1/3)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
26. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
RBelgium members (2/3)
Example
mygroup <− ” RBelgium ”
2 # l i b r a r i e s f o r c o m m u n i c a t i n g w i t h meetup API
l i b r a r y ( RJSONIO , R c u r l )
4 # library for plotting
l i b r a r y ( ggplot2 )
6 # g e t member d a t a from meetup . com
domain . u r l<−p a s t e ( ” h t t p s : / / a p i . meetup . com/ 2 /
members ? k e y=” , mykey , ”&s i g n=t r u e&g r o u p u r l n a m e
=RBelgium ” , c o l l a p s e=” ” , s e p=” ” )
8 domain . g e t<−getURL ( domain . u r l )
domain . d a t a<−fromJSON ( domain . g e t )
10 # d i s p l a y i n g names
p r i n t ( u n l i s t ( l a p p l y ( domain . d a t a $ r e s u l t s , f u n c t i o n (
x ) x $name ) ) )
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
27. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
RBelgium members (3/3)
Example
1 # p l o t t i n g graph
j o i n s <− u n l i s t ( l a p p l y ( domain . d a t a $ r e s u l t s ,
f u n c t i o n ( x ) x$ j o i n e d ) )
3 o r d e r e d J o i n s <− j o i n s [ o r d e r ( j o i n s ) ]
l a b = a s . POSIXct ( o r d e r e d J o i n s / 1 0 0 0 , o r i g i n=”
1970−01−01” )
5 d f <− d a t a . f r a m e (
x=l a b ,
7 y =1: l e n g t h ( domain . d a t a $ r e s u l t s )
)
9 png ( ” memberJoined . png ” )
ggplot ( df ) +
11 geom p o i n t ( a e s ( x = x , y = y ) ) +
x l a b ( ” Date ” ) +
13 y l a b ( ”#members ” )
dev . o f f ( )
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
28. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
RBelgium on internet
Website: http://www.meetup.com/RBelgium/ (68
members)
Website: http://www.rbelgium.be
Twitter: twitter.com/rbelgium (5 followers)
LinkedIn: http://www.linkedin.com/groups/
RBelgium-4223869?gid=4223869&trk=hb_side_g (7
members)
Google group:
http://groups.google.com/group/rbelgium,
rbelgium@googlegroups.com
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
29. Getting started on Amazon cloud
Some concrete applications using Hadoop
About RBelgium
Questions?
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud