SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
Distributed Data Analysis with Hadoop and R
Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.
                 StrangeLoop2011
               September 20 | 2011
Flow of this Talk

•  Introductions

•  Hadoop, R and Interfacing the two


•  Our Prototypes

•  A use case for interfacing Hadoop and R

•  Alternatives for Running R on Hadoop

•  Alternatives to Hadoop and R

•  Conclusions

•  References
Who We Are
•  Ramesh Venkataramaiah, Ph. D.
   –  Principal Engineer, TechOps
   –  rvenkataramaiah@orbitz.com
   –  @rvenkatar

•  Jonathan Seidman
   –  Lead Engineer, Business Intelligence/Big Data Team
   –  Co-founder/organizer of Chicago Hadoop User Group (
      http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and
      Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/
   –  jseidman@orbitz.com
   –  @jseidman


•  Orbitz Careers
   –  http://careers.orbitz.com/
   –  @OrbitzTalent
Launched in 2001




                   Over 160 million
                   bookings
                   7th Largest seller of
                   travel in the world
                                           page 4
Hadoop and R
as an analytic platform?
What is Hadoop?


Distributed file system (HDFS) and parallel processing
framework.


       Uses   MapReduce programming model as the core.

                Provides   fault tolerant and scalable storage
                 of very large datasets across machines in a cluster.
What is R? When do we need it?


Open-source stat package with visualization
        Vibrant community support.
                 One-line calculations galore!
                          Steep learning curve but worth it!




Insight into statistical properties and trends…
                        or for machine learning purposes…
                                  or Big Data to be understood well.



                                                                  page 7
Our Options

•  Data volume reduction by sampling
   –  Very bad for long-tail data distribution
   –  Approximation lead to bad conclusion
•  Scaling R
   –  Still in-memory
   –  But make it parallel using segue, Rhipe, R-Hive…
•  Use sql-like interfaces
   –  Apache Hive with Hadoop
   –  File sprawl and process issues
•  Regular DBMS
   –  How to fit square peg in a round hole
   –  No in-line R calls from SQL but commercial efforts are underway.


•  This Talk: How to bring Hadoop’s parallel processing
   capability to R environment.
                                                                         page 8
Our prototypes
  User segmentations
        Hotel bookings
  Airline Performance*


                   * Public dataset	





                                   page 9
We have two distinct dataspaces serving different
constituents




      Semi-structure data       Transactional data
        (e.g. searches)           (e.g. bookings)



      Hadoop Cluster           Data Warehouse

                                                     page 10
Our Hadoop infrastructure allows us to record and process
user activity at the individual level




         Detailed Non-          Transactional Data
      Transactional Data          (e.g. bookings)
       (What Each User
       Sees and Clicks)

                               Data Warehouse



          Hadoop

                                                            page 11
Getting a Buy-in
presented a long-term, semi-structured data growth story and
explained how this will help harness long-tail opportunities at
lowest cost.
         - Traditional DW!               - Big Data!
         -  Classical Stats!             -  Specific spikes!
         -  Sampling!                    -  Median is not the message!

                                         - Create a universal key !
                                         - Always keep source data!
                                         - Operationalize the
                                         infrastructure!




                                                                    * From a blog
An example of “median is not the message”

•  Positional Bias during Hotel Searches
Our Customers pick top positions the most…
Safari Users Seem to be Interested in More Expensive Hotels




                                                              page 15
Seasonal variations
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.




                                                          page 16
Workload and Resource Partition


     �                                   �                               �           �


             �                                   �                   �           �
                         �

                                     �               �                           �
         �           �                                       �               �
                         �                                       �                       �


                                             �           �                   �
                 �               �                                                       �
                             �




                                                                                         page 17
Airline Performance




                      page 18
Description of Use Case


•  Analyze openly available dataset: Airline on-time performance.
•  Dataset was used in Visualization Poster Competition 2009
   –  Consists of flight arrival/departure details from 1987-2008.
   –  Approximately 120 MM records totaling 12GB.
•  Available at: http://stat-computing.org/dataexpo/2009/




                                                                     page 19
Our dataset




              page 20
Airline Delay Plot: R code



> deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F)
                                                                                   !
> View(deptdelays.monthly.full)!
> names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)!

> Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full
$Delay,decreasing=TRUE),]


> Top_10_Delay_by_Month <- Delay_by_Month[1:10,]!
> Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))!

> symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)!
> text(Month, Delay, Airline, cex= 0.5)!




                                                                             page 21
Airline delay




                page 22
Multiple Distributions: R code
>   library(lattice)!
>   deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)!
>   h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))!
>   update(h)!




                                                                            page 23
Running R on Hadoop:
 Hadoop Streaming



                       page 24
Hadoop Streaming – Overview


•  An alternative to the Java MapReduce API which allows you to
   write jobs in any language supporting stdin/stdout.
•  Limited to text data in current versions of Hadoop. Support for
   binary streams added in 0.21.0.
•  Requires installation of R on all DataNodes.




                                                                     page 25
Hadoop Streaming – Dataflow



                   1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                   1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map       1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                   1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

                   1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                                                                                             *	

                   1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Map function receives input records line-by-line via standard input.


                                                                                                page 26
Hadoop Streaming – Dataflow Continued



                   DL|1987|10   	

 14 	

                   PI|1988|1    	

 0 	

Input to reduce    PI|1988|1    	

 17 	

                   PS|1987|10
                   PS|1987|10
                                	

 1 	

                                	

 11 	

                                                                                          *	

                   PS|1987|10   	

 -2 	





                                1987         	

 10   	

 1   	

 DL   	

 14 	

 Output from reduce             1988
                                1987
                                             	

 1
                                             	

 10
                                                      	

 2
                                                      	

 3
                                                              	

 PI
                                                              	

 PS
                                                                       	

 8.5 	

                                                                       	

 3.333333 	





* Reduce receives map output key/value pairs sorted by key, line-by-line.


                                                                                             page 27
Hadoop Streaming Example – map.R




                                   page 28
Hadoop Streaming Example – reduce.R




                                      page 29
Running R on Hadoop:
 Hadoop Interactive



                       page 30
Hadoop Interactive (hive) – Overview


•  Very unfortunate acronym.
•  Provides an interface to Hadoop from the R environment.
   –  Functions to access HDFS
   –  Control Hadoop
   –  And run streaming jobs directly from R
•  Allows HDFS data, including the output from MapReduce
   processing, to be manipulated and analyzed directly from R.
•  Seems to still have some rough edges.




                                                                 page 31
Hadoop Interactive – Example




                               page 32
Running R on Hadoop:
       RHIPE



                       page 33
RHIPE – Overview

•  Active project with frequent updates and active community.
•  RHIPE is based on Hadoop streaming source, but provides
   some significant enhancements, such as support for binary
   files (sort of).
•  Developed to provide R users with access to same Hadoop
   functionality available to Java developers.
   –  For example, provides rhcounter() and rhstatus(),
      analagous to counters and the reporter interface in the Java
      API.




                                                                     page 34
RHIPE – Overview

•  Can be somewhat confusing and intimidating.
  –  Then again, the same can be said for the Java API.
  –  Worth taking the time to get comfortable with.




                                                          page 35
RHIPE – Overview


•  Allows developers to work directly on data stored in HDFS in
   the R environment.
•  Also allows developers to write MapReduce jobs in R and
   execute them on the Hadoop cluster.
•  RHIPE uses Google protocol buffers to serialize data. Most R
   data types are supported.
   –  Using protocol buffers increases efficiency and provides
      interoperability with other languages.
•  Must be installed on all DataNodes.




                                                                  page 36
RHIPE – MapReduce

map <- expression({}) !
reduce <- expression( !
       pre = {…},!
       reduce = {…}, !
       post = {…}!
 ) !
z <- rhmr(map=map,reduce=reduce,!
             inout=c("text","sequence ), !
             ifolder=INPUT_PATH ,!
             ofolder=OUTPUT_PATH,!
             …)!
rhex(z) !




                                             page 37
RHIPE – Dataflow



                   Keys = […]	

                   Values =	

                    [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                     1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map         1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                     1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

       *
                     1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                     1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...]	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Note that Input to map is a vector of keys and a vector of values.


                                                                                                    page 38
RHIPE – Dataflow Continued



                   DL|1987|10    	

 [14] 	


Input to reduce    PI|1988|1    	

 [0, 17] 	

                                                                                                 *
                   PS|1987|10   	

 [1,11,-2] 	





                                1987                	

 10   	

 1   	

 DL   	

 14 	

  Output from reduce            1988
                                1987
                                                    	

 1
                                                    	

 10
                                                             	

 2
                                                             	

 3
                                                                     	

 PI
                                                                     	

 PS
                                                                              	

 8.5 	

                                                                              	

 3.333333 	





* Input to reduce is a key and a vector containing a subset of intermediate
  values associated with that key. The reduce will get called until no more
  values exist for the key.

                                                                                                     page 39
RHIPE – Example




                  page 40
RHIPE – Example




                  page 41
RHIPE – Example




                  page 42
Running R on Hadoop:
        rmr



                       page 43
rmr Overview


•  New project from Revolution Analytics introduced August 2011.
•  Part of RHadoop, a suite of open-source projects which also
   includes:
   –  rhdfs – functions to access and manage HDFS from within
      R.
   –  rhbase – functions providing basic connectivity to HBase.
•  Goals are to provide productive environment for MapReduce
   programming in an R-like way - “…stay true to map reduce and
   true to R …”
•  Reduce gets all intermediate values for each key (yay!).
•  Like RHIPE, based on streaming source.




                                                                   page 44
rmr – Example




                page 45
Running R on Hadoop:
       Segue



                       page 46
Segue – Overview


•  Intended to work around single-threading in R by taking
   advantage of Hadoop streaming to provide simple parallel
   processing.
   –  For example, running multiple simulations in parallel.
•  Suitable for embarrassingly pleasantly parallel problems – big
   CPU, not big data.
•  Runs on Amazon’s Elastic Map Reduce (EMR).
   –  Not intended for internal clusters.
•  Provides emrlapply(), a parallel version of lapply()!




                                                                    page 47
Segue – Example




                  page 48
Performance Comparison:
 Streaming and RHIPE



                      page 49
Performance Testing – Environment and Setup


•  Twenty-eight DataNodes:
   –  Dual hex-core
   –  24GB RAM
   –  Shared cluster.


•  Data
   –  Airline dataset
   –  22 input files
   –  About 12GB uncompressed data




                                              page 50
Performance Comparison


Number of Reducers   Streaming      RHIPE
264                  246 seconds*   96 seconds*




*All numbers are an average of 3 runs.

                                                  page 51
Alternatives


Alternate languages/libraries:


•  Apache Mahout
   –  Scalable machine learning library.
   –  Offers clustering, classification, collaborative filtering on
      Hadoop.
•  Python
   –  Many modules available to support scientific and statistical
      computing.




                                                                      page 52
Alternatives


Alternative parallel processing frameworks:


•  Revolution Analytics
   –  Provides commercial packages to support processing big
      data with R.
•  Other HPC/parallel processing packages for R, e.g. Rmpi or
   snow.




                                                                page 53
Alternatives


Apache Hive + RJDBC?


•  We haven t been able to get it to work yet.
•  You can however wrap calls to the Hive client in R to return R
   objects. See https://github.com/satpreetsingh/rDBwrappers/wiki




                                                                    page 54
Alternatives


Interesting solutions that you can t have:


•  Ricardo
   –  Developed as part of a research project at IBM.
   –  Interesting paper published, but apparently no plans to
      make available.




                                                                page 55
Conclusions


•  If practical, consider using Hadoop to aggregate data for input
   to R analyses.
•  Avoid using R for general purpose MapReduce use.




                                                                     page 56
Conclusions


•  For simple MapReduce jobs, or embarrassingly parallel jobs
   on a local cluster, consider Hadoop streaming.
  –  Limited to processing text only.
  –  But easy to test at the command line outside of Hadoop:
     •  $ cat DATAFILE |./map.R |sort |./reduce.R!
•  To run compute-bound analyses with relatively small amount of
   data on the cloud look at Segue.




                                                                   page 57
Conclusions


•  Otherwise, your best bet is RHIPE, but definitely check out rmr.
•  Also consider alternatives – Mahout, Python, etc.




                                                                      page 58
Conclusions


On an operational note:


•  Make sure your cluster nodes are consistent – same version of
   R installed, required libraries are installed on each node, etc.




                                                                      page 59
Example Code


•  https://github.com/jseidman/hadoop-R




                                          page 60
References


•  Hadoop
   –  Apache Hadoop project: http://hadoop.apache.org/
   –  Hadoop The Definitive Guide, Tom White, O Reilly Press,
      2011
•  R
   –  R Project for Statistical Computing: http://www.r-project.org/
   –  R Cookbook, Paul Teetor, O Reilly Press, 2011
   –  Getting Started With R: Some Resources:
      http://quanttrader.info/public/gettingStartedWithR.html




                                                                       page 61
References


•  Hadoop Streaming
  –  Documentation on Apache Hadoop Wiki:
     http://hadoop.apache.org/mapreduce/docs/current/
     streaming.html
  –  Word count example in R :
     https://forums.aws.amazon.com/thread.jspa?
     messageID=129163




                                                        page 62
References


•  Hadoop InteractiVE
  –  Project page on CRAN:
     http://cran.r-project.org/web/packages/hive/index.html
  –  Simple Parallel Computing in R Using Hadoop:
     http://www.rmetrics.org/Meielisalp2009/Presentations/
     Theussl1.pdf




                                                              page 63
References


•  RHIPE
  –  RHIPE - R and Hadoop Integrated Processing Environment:
     http://www.stat.purdue.edu/~sguha/rhipe/
  –  Code: http://code.google.com/p/rhipe/
  –  Wiki: http://code.google.com/p/rhipe/w/list
  –  Installing RHIPE on CentOS:
     https://groups.google.com/forum/#!topic/brumail/
     qH1wjtBgwYI
  –  Introduction to using RHIPE:
     http://ml.stat.purdue.edu/rhafen/rhipe/
  –  RHIPE combines Hadoop and the R analytics language, SD
     Times: http://www.sdtimes.com/link/34792


                                                               page 64
References


•  RHIPE
  –  Using R and Hadoop to Analyze VoIP Network Data for
     QoS, Hadoop World 2010:
     •  video:
        http://www.cloudera.com/videos/
        hw10_video_using_r_and_hadoop_to_analyze_voip_net
        work_data_for_qos
     •  slides:
        http://www.cloudera.com/resource/
        hw10_voice_over_ip_studying_traffic_characteristics_for
        _quality_of_service
  –  RHIPE examples (k-means, etc.):
     http://groups.google.com/group/brumail/browse_thread/
     thread/e403db404f039e31?pli=1

                                                                  page 65
References


•  RHadoop (including rmr)
  –  Github: https://github.com/RevolutionAnalytics/RHadoop
  –  Advanced Big Data Analytics with R and Hadoop
    whitepaper:
    http://info.revolutionanalytics.com/R-and-Hadoop-Big-Data-
    Analytics-White-Paper.html




                                                                 page 66
References


•  Segue
  –  Project page: http://code.google.com/p/segue/
  –  Google Group:http://groups.google.com/group/segue-r
  –  Abusing Amazon s Elastic MapReduce Hadoop service…
     easily, from R, Jefferey Breen:
     http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-
     amazon-elastic-mapreduce-hadoop/
  –  Presentation at Chicago Hadoop Users Group March 23,
     2011:
     http://files.meetup.com/1634302/segue-presentation-
     RUG.pdf




                                                                page 67
References

•  Sawmill (A framework for integrating a PMML-compliant Scoring
   Engine with Hadoop).
  –  More information:
     •  Open Data Group www.opendatagroup.com
     •  oscon-info@opendatagroup.com
  –  Augustus, an open source system for building & scoring
     statistical models
     •  augustus.googlecode.com
  –  PMML
     •  Data Mining Group: dmg.org
  –  Analytics over Clouds using Hadoop, presentation at Chicago
     Hadoop User Group:
     http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf

                                                                   page 68
References


•  Ricardo
   –  Ricardo: Integrating R and Hadoop, paper:
      http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010-
      das.pdf
   –  Ricardo: Integrating R and Hadoop, Powerpoint
      presentation:
      http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo-
      SIGMOD10.pptx




                                                           page 69
References


•  General references on Hadoop and R
  –  Pete Skomoroch s R and Hadoop bookmarks:
     http://www.delicious.com/pskomoroch/R+hadoop
  –  Pigs, Bees, and Elephants: A Comparison of Eight
     MapReduce Languages:
     http://www.dataspora.com/2011/04/pigs-bees-and-
     elephants-a-comparison-of-eight-mapreduce-languages/
  –  Quora – How can R and Hadoop be used together?:
     http://www.quora.com/How-can-R-and-Hadoop-be-used-
     together




                                                            page 70
References


•  Mahout
   –  Mahout project: http://mahout.apache.org/
   –  Mahout in Action, Owen, et. al., Manning Publications, 2011
•  Python
   –  Think Stats, Probability and Statistics for Programmers, Allen
      B. Downey, O Reilly Press, 2011
•  CRAN Task View: High-Performance and Parallel Computing with
   R, a set of resources compiled by Dirk Eddelbuettel:
   http://cran.r-project.org/web/views/
   HighPerformanceComputing.html




                                                                    page 71
References


•  Other examples of airline data analysis with R:
   –  A simple Big Data analysis using the RevoScaleR package
      in Revolution R:
      http://www.r-bloggers.com/a-simple-big-data-analysis-using-
      the-revoscaler-package-in-revolution-r/




                                                                    page 72
And finally…


 Parallel R (working title), Q Ethan McCallum, Stephen
  Weston, O Reilly Press, due autumn 2011


    R meets Big Data - a basket of strategies to help you use R
      for large-scale analysis and computation.




                                                                  page 73

Más contenido relacionado

La actualidad más candente

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? Datameer
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUG IT
 

La actualidad más candente (20)

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
 

Destacado

Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopCloudera, Inc.
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoopjakehofman
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 

Destacado (20)

Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 

Similar a Distributed Data Analysis with Hadoop and R - Strangeloop 2011

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...Senturus
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 

Similar a Distributed Data Analysis with Hadoop and R - Strangeloop 2011 (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
מיכאל
מיכאלמיכאל
מיכאל
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...
Is Hadoop the Demise of Data Warehousing? The Impact of Hadoop/Big Data on BI...
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 

Más de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 

Más de Jonathan Seidman (9)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Distributed Data Analysis with Hadoop and R - Strangeloop 2011

  • 1. Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. StrangeLoop2011 September 20 | 2011
  • 2. Flow of this Talk •  Introductions •  Hadoop, R and Interfacing the two •  Our Prototypes •  A use case for interfacing Hadoop and R •  Alternatives for Running R on Hadoop •  Alternatives to Hadoop and R •  Conclusions •  References
  • 3. Who We Are •  Ramesh Venkataramaiah, Ph. D. –  Principal Engineer, TechOps –  rvenkataramaiah@orbitz.com –  @rvenkatar •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group ( http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/ –  jseidman@orbitz.com –  @jseidman •  Orbitz Careers –  http://careers.orbitz.com/ –  @OrbitzTalent
  • 4. Launched in 2001 Over 160 million bookings 7th Largest seller of travel in the world page 4
  • 5. Hadoop and R as an analytic platform?
  • 6. What is Hadoop? Distributed file system (HDFS) and parallel processing framework. Uses MapReduce programming model as the core. Provides fault tolerant and scalable storage of very large datasets across machines in a cluster.
  • 7. What is R? When do we need it? Open-source stat package with visualization Vibrant community support. One-line calculations galore! Steep learning curve but worth it! Insight into statistical properties and trends… or for machine learning purposes… or Big Data to be understood well. page 7
  • 8. Our Options •  Data volume reduction by sampling –  Very bad for long-tail data distribution –  Approximation lead to bad conclusion •  Scaling R –  Still in-memory –  But make it parallel using segue, Rhipe, R-Hive… •  Use sql-like interfaces –  Apache Hive with Hadoop –  File sprawl and process issues •  Regular DBMS –  How to fit square peg in a round hole –  No in-line R calls from SQL but commercial efforts are underway. •  This Talk: How to bring Hadoop’s parallel processing capability to R environment. page 8
  • 9. Our prototypes User segmentations Hotel bookings Airline Performance* * Public dataset page 9
  • 10. We have two distinct dataspaces serving different constituents Semi-structure data Transactional data (e.g. searches) (e.g. bookings) Hadoop Cluster Data Warehouse page 10
  • 11. Our Hadoop infrastructure allows us to record and process user activity at the individual level Detailed Non- Transactional Data Transactional Data (e.g. bookings) (What Each User Sees and Clicks) Data Warehouse Hadoop page 11
  • 12. Getting a Buy-in presented a long-term, semi-structured data growth story and explained how this will help harness long-tail opportunities at lowest cost. - Traditional DW! - Big Data! -  Classical Stats! -  Specific spikes! -  Sampling! -  Median is not the message! - Create a universal key ! - Always keep source data! - Operationalize the infrastructure! * From a blog
  • 13. An example of “median is not the message” •  Positional Bias during Hotel Searches
  • 14. Our Customers pick top positions the most…
  • 15. Safari Users Seem to be Interested in More Expensive Hotels page 15
  • 16. Seasonal variations •  Customer hotel stay gets longer during summer months •  Could help in designing search based on seasons. page 16
  • 17. Workload and Resource Partition � � � � � � � � � � � � � � � � � � � � � � � � � � page 17
  • 19. Description of Use Case •  Analyze openly available dataset: Airline on-time performance. •  Dataset was used in Visualization Poster Competition 2009 –  Consists of flight arrival/departure details from 1987-2008. –  Approximately 120 MM records totaling 12GB. •  Available at: http://stat-computing.org/dataexpo/2009/ page 19
  • 20. Our dataset page 20
  • 21. Airline Delay Plot: R code > deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F) ! > View(deptdelays.monthly.full)! > names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)! > Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full $Delay,decreasing=TRUE),]
 > Top_10_Delay_by_Month <- Delay_by_Month[1:10,]! > Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))! > symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)! > text(Month, Delay, Airline, cex= 0.5)! page 21
  • 22. Airline delay page 22
  • 23. Multiple Distributions: R code > library(lattice)! > deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)! > h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))! > update(h)! page 23
  • 24. Running R on Hadoop: Hadoop Streaming page 24
  • 25. Hadoop Streaming – Overview •  An alternative to the Java MapReduce API which allows you to write jobs in any language supporting stdin/stdout. •  Limited to text data in current versions of Hadoop. Support for binary streams added in 0.21.0. •  Requires installation of R on all DataNodes. page 25
  • 26. Hadoop Streaming – Dataflow 1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… * 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL... PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Map function receives input records line-by-line via standard input. page 26
  • 27. Hadoop Streaming – Dataflow Continued DL|1987|10 14 PI|1988|1 0 Input to reduce PI|1988|1 17 PS|1987|10 PS|1987|10 1 11 * PS|1987|10 -2 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Reduce receives map output key/value pairs sorted by key, line-by-line. page 27
  • 28. Hadoop Streaming Example – map.R page 28
  • 29. Hadoop Streaming Example – reduce.R page 29
  • 30. Running R on Hadoop: Hadoop Interactive page 30
  • 31. Hadoop Interactive (hive) – Overview •  Very unfortunate acronym. •  Provides an interface to Hadoop from the R environment. –  Functions to access HDFS –  Control Hadoop –  And run streaming jobs directly from R •  Allows HDFS data, including the output from MapReduce processing, to be manipulated and analyzed directly from R. •  Seems to still have some rough edges. page 31
  • 32. Hadoop Interactive – Example page 32
  • 33. Running R on Hadoop: RHIPE page 33
  • 34. RHIPE – Overview •  Active project with frequent updates and active community. •  RHIPE is based on Hadoop streaming source, but provides some significant enhancements, such as support for binary files (sort of). •  Developed to provide R users with access to same Hadoop functionality available to Java developers. –  For example, provides rhcounter() and rhstatus(), analagous to counters and the reporter interface in the Java API. page 34
  • 35. RHIPE – Overview •  Can be somewhat confusing and intimidating. –  Then again, the same can be said for the Java API. –  Worth taking the time to get comfortable with. page 35
  • 36. RHIPE – Overview •  Allows developers to work directly on data stored in HDFS in the R environment. •  Also allows developers to write MapReduce jobs in R and execute them on the Hadoop cluster. •  RHIPE uses Google protocol buffers to serialize data. Most R data types are supported. –  Using protocol buffers increases efficiency and provides interoperability with other languages. •  Must be installed on all DataNodes. page 36
  • 37. RHIPE – MapReduce map <- expression({}) ! reduce <- expression( ! pre = {…},! reduce = {…}, ! post = {…}! ) ! z <- rhmr(map=map,reduce=reduce,! inout=c("text","sequence ), ! ifolder=INPUT_PATH ,! ofolder=OUTPUT_PATH,! …)! rhex(z) ! page 37
  • 38. RHIPE – Dataflow Keys = […] Values = [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... * 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...] PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Note that Input to map is a vector of keys and a vector of values. page 38
  • 39. RHIPE – Dataflow Continued DL|1987|10 [14] Input to reduce PI|1988|1 [0, 17] * PS|1987|10 [1,11,-2] 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Input to reduce is a key and a vector containing a subset of intermediate values associated with that key. The reduce will get called until no more values exist for the key. page 39
  • 40. RHIPE – Example page 40
  • 41. RHIPE – Example page 41
  • 42. RHIPE – Example page 42
  • 43. Running R on Hadoop: rmr page 43
  • 44. rmr Overview •  New project from Revolution Analytics introduced August 2011. •  Part of RHadoop, a suite of open-source projects which also includes: –  rhdfs – functions to access and manage HDFS from within R. –  rhbase – functions providing basic connectivity to HBase. •  Goals are to provide productive environment for MapReduce programming in an R-like way - “…stay true to map reduce and true to R …” •  Reduce gets all intermediate values for each key (yay!). •  Like RHIPE, based on streaming source. page 44
  • 45. rmr – Example page 45
  • 46. Running R on Hadoop: Segue page 46
  • 47. Segue – Overview •  Intended to work around single-threading in R by taking advantage of Hadoop streaming to provide simple parallel processing. –  For example, running multiple simulations in parallel. •  Suitable for embarrassingly pleasantly parallel problems – big CPU, not big data. •  Runs on Amazon’s Elastic Map Reduce (EMR). –  Not intended for internal clusters. •  Provides emrlapply(), a parallel version of lapply()! page 47
  • 48. Segue – Example page 48
  • 50. Performance Testing – Environment and Setup •  Twenty-eight DataNodes: –  Dual hex-core –  24GB RAM –  Shared cluster. •  Data –  Airline dataset –  22 input files –  About 12GB uncompressed data page 50
  • 51. Performance Comparison Number of Reducers Streaming RHIPE 264 246 seconds* 96 seconds* *All numbers are an average of 3 runs. page 51
  • 52. Alternatives Alternate languages/libraries: •  Apache Mahout –  Scalable machine learning library. –  Offers clustering, classification, collaborative filtering on Hadoop. •  Python –  Many modules available to support scientific and statistical computing. page 52
  • 53. Alternatives Alternative parallel processing frameworks: •  Revolution Analytics –  Provides commercial packages to support processing big data with R. •  Other HPC/parallel processing packages for R, e.g. Rmpi or snow. page 53
  • 54. Alternatives Apache Hive + RJDBC? •  We haven t been able to get it to work yet. •  You can however wrap calls to the Hive client in R to return R objects. See https://github.com/satpreetsingh/rDBwrappers/wiki page 54
  • 55. Alternatives Interesting solutions that you can t have: •  Ricardo –  Developed as part of a research project at IBM. –  Interesting paper published, but apparently no plans to make available. page 55
  • 56. Conclusions •  If practical, consider using Hadoop to aggregate data for input to R analyses. •  Avoid using R for general purpose MapReduce use. page 56
  • 57. Conclusions •  For simple MapReduce jobs, or embarrassingly parallel jobs on a local cluster, consider Hadoop streaming. –  Limited to processing text only. –  But easy to test at the command line outside of Hadoop: •  $ cat DATAFILE |./map.R |sort |./reduce.R! •  To run compute-bound analyses with relatively small amount of data on the cloud look at Segue. page 57
  • 58. Conclusions •  Otherwise, your best bet is RHIPE, but definitely check out rmr. •  Also consider alternatives – Mahout, Python, etc. page 58
  • 59. Conclusions On an operational note: •  Make sure your cluster nodes are consistent – same version of R installed, required libraries are installed on each node, etc. page 59
  • 61. References •  Hadoop –  Apache Hadoop project: http://hadoop.apache.org/ –  Hadoop The Definitive Guide, Tom White, O Reilly Press, 2011 •  R –  R Project for Statistical Computing: http://www.r-project.org/ –  R Cookbook, Paul Teetor, O Reilly Press, 2011 –  Getting Started With R: Some Resources: http://quanttrader.info/public/gettingStartedWithR.html page 61
  • 62. References •  Hadoop Streaming –  Documentation on Apache Hadoop Wiki: http://hadoop.apache.org/mapreduce/docs/current/ streaming.html –  Word count example in R : https://forums.aws.amazon.com/thread.jspa? messageID=129163 page 62
  • 63. References •  Hadoop InteractiVE –  Project page on CRAN: http://cran.r-project.org/web/packages/hive/index.html –  Simple Parallel Computing in R Using Hadoop: http://www.rmetrics.org/Meielisalp2009/Presentations/ Theussl1.pdf page 63
  • 64. References •  RHIPE –  RHIPE - R and Hadoop Integrated Processing Environment: http://www.stat.purdue.edu/~sguha/rhipe/ –  Code: http://code.google.com/p/rhipe/ –  Wiki: http://code.google.com/p/rhipe/w/list –  Installing RHIPE on CentOS: https://groups.google.com/forum/#!topic/brumail/ qH1wjtBgwYI –  Introduction to using RHIPE: http://ml.stat.purdue.edu/rhafen/rhipe/ –  RHIPE combines Hadoop and the R analytics language, SD Times: http://www.sdtimes.com/link/34792 page 64
  • 65. References •  RHIPE –  Using R and Hadoop to Analyze VoIP Network Data for QoS, Hadoop World 2010: •  video: http://www.cloudera.com/videos/ hw10_video_using_r_and_hadoop_to_analyze_voip_net work_data_for_qos •  slides: http://www.cloudera.com/resource/ hw10_voice_over_ip_studying_traffic_characteristics_for _quality_of_service –  RHIPE examples (k-means, etc.): http://groups.google.com/group/brumail/browse_thread/ thread/e403db404f039e31?pli=1 page 65
  • 66. References •  RHadoop (including rmr) –  Github: https://github.com/RevolutionAnalytics/RHadoop –  Advanced Big Data Analytics with R and Hadoop whitepaper: http://info.revolutionanalytics.com/R-and-Hadoop-Big-Data- Analytics-White-Paper.html page 66
  • 67. References •  Segue –  Project page: http://code.google.com/p/segue/ –  Google Group:http://groups.google.com/group/segue-r –  Abusing Amazon s Elastic MapReduce Hadoop service… easily, from R, Jefferey Breen: http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to- amazon-elastic-mapreduce-hadoop/ –  Presentation at Chicago Hadoop Users Group March 23, 2011: http://files.meetup.com/1634302/segue-presentation- RUG.pdf page 67
  • 68. References •  Sawmill (A framework for integrating a PMML-compliant Scoring Engine with Hadoop). –  More information: •  Open Data Group www.opendatagroup.com •  oscon-info@opendatagroup.com –  Augustus, an open source system for building & scoring statistical models •  augustus.googlecode.com –  PMML •  Data Mining Group: dmg.org –  Analytics over Clouds using Hadoop, presentation at Chicago Hadoop User Group: http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf page 68
  • 69. References •  Ricardo –  Ricardo: Integrating R and Hadoop, paper: http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010- das.pdf –  Ricardo: Integrating R and Hadoop, Powerpoint presentation: http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo- SIGMOD10.pptx page 69
  • 70. References •  General references on Hadoop and R –  Pete Skomoroch s R and Hadoop bookmarks: http://www.delicious.com/pskomoroch/R+hadoop –  Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04/pigs-bees-and- elephants-a-comparison-of-eight-mapreduce-languages/ –  Quora – How can R and Hadoop be used together?: http://www.quora.com/How-can-R-and-Hadoop-be-used- together page 70
  • 71. References •  Mahout –  Mahout project: http://mahout.apache.org/ –  Mahout in Action, Owen, et. al., Manning Publications, 2011 •  Python –  Think Stats, Probability and Statistics for Programmers, Allen B. Downey, O Reilly Press, 2011 •  CRAN Task View: High-Performance and Parallel Computing with R, a set of resources compiled by Dirk Eddelbuettel: http://cran.r-project.org/web/views/ HighPerformanceComputing.html page 71
  • 72. References •  Other examples of airline data analysis with R: –  A simple Big Data analysis using the RevoScaleR package in Revolution R: http://www.r-bloggers.com/a-simple-big-data-analysis-using- the-revoscaler-package-in-revolution-r/ page 72
  • 73. And finally… Parallel R (working title), Q Ethan McCallum, Stephen Weston, O Reilly Press, due autumn 2011 R meets Big Data - a basket of strategies to help you use R for large-scale analysis and computation. page 73