SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
R and Hadoop

    Ram Venkat
   Dawn Analytics
What is Hadoop?
•   Hadoop is an open source Apache software for
    running distributed applications on 'big data'
•   It contains a distributed file system (HDFS) and a
    parallel processing 'batch' framework
•   Hadoop is written in java, runs on unix/linux for
    development and production
•   Windows and Mac can be used as development
    platform
•   Yahoo has > 43000 nodes hadoop cluster and
    Facebook has over 100 PB(PB= 1 M GB) of data in
    hadoop clusters
Hadoop overview (1/2)
     Central Idea: Moving computation to data and
    compute across nodes in parallel
•   Data Loading
Hadoop overview (2/2)
Parallel Computation: MapReduce
Map Reduce : Example 'Hello word'
•   Mathematically, this is what MapReduce is about:
    —map (k1, v1) ➔ list(k2, v2)
    —reduce(k2, list(v2)) ➔ list(<k3, v3>)
•   Implementation of the 'hello word' (word count):
    Mapper: K1 -> file name, v1 -> text of the file
             K2 -> word, V2 -> “1”
    Reducer: Sums up the '1' s and produces a list of
    words and their counts
Word Count (slide from Yahoo)
R libraries to work with Hadoop
•    'Hadoop Streaming' - An alternative to the Java
     MapReduce API
•    Hadoop Streaming allows you to write jobs in any
     language supporting stdin/stdout.
•    R has several libraries/ways that help you to work with
     Hadoop:
      – Write your mapper.R and reducer.R and run a shell
        script
      – 'rmr' and 'rhadoop' from revolution analytics
     – 'rhipe' from Purdue University statistical computing
     – 'RHive' interacts with Hadoop via Hive query
Word Count Demo with R(rmr)
  mapper.wordcount = function(key, val) {

                lapply(
                  strsplit( x = val, split = " ")[[1]],
                  function(w) keyval(w,1)
                )
  }

   reducer.wordcount = function(key, val.list) {
             output.key = key
             output.val = sum(unlist(val.list))
                return( keyval(output.key, output.val) )
   }
More advanced example –
    Sentiment Analysis in R(rmr)
•   One area where Hadoop could help out traders is in
    sentiment analysis
•   Oreilly Strata blog 'Trading on sentiment' :
    http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html

•   Demo2 is modified code from this example from Jeffrey
    Breen on airlines sentiment analysis :
    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

•   Jeffrey has been very active in R groups in Chicago
    area, This is another tutorial last month on R and
    Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started-
    with-r-hadoop
Demo2 Sentiment Analysis with rmr
 mapper.score = function(key, val) {

 # clean up tweets with R's regex-driven global substitute, gsub():
   val = gsub('[[:punct:]]', '', val)
   val = gsub('[[:cntrl:]]', '', val)
   val = gsub('d+', '', val)

 # Key is the Airline we added as tag to the tweets
   airline = substr(val,1,2)

 # Run the sentiment analysis
   output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words))

 # our interest is in computing the counts by airlines and scores, so 'this' count is 1
   output.val = 1
   return( keyval(output.key, output.val) )
 }
Demo3 - Hive

•   Hive is a data warehousing infrastructure for Hadoop
•   Provides a familiar SQL like interface to create tables,
    insert and query data
•   Behind the scene , it implements map-reduce
•    Hive is an alternative to our hadoop streaming we
    covered before
•   Demo3 – stock query with Hive
Use cases for Traders
•    Stock sentiment analysis
•    Stock trading pattern analysis
•    Default prediction
•    Fraud/anomoly detection
•    NextGen data warehousing
Hadoop support - Cloudera
•   Cloudera distribution of hadoop is one of the most
    popular distribution (I used their CDH3v5 in my 2
    demos above)
•   Doug Cutting, the creator of Hadoop is the architect
    with Cloudera
•   Adam Muise , a Cloudera engineer at Toronto is the
    organizer of Toronto Hadoop user Group (TOHUG)
•   Upcoming meeting organized by TOHUG on the 30th
    October - “PIG-fest”
Hadoop Tutorials and Books
•   http://hadoop.apache.org/docs/r0.20.2/quickstart.html
•   Cloudera: http://university.cloudera.com/
•   Book: “Hadoop in Action” – Manning
•   Book: “Hadoop - The Definitive Guide” – Oreilly
•   Hadoop Streaming:
    http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html
•   Google Code University:
    http://code.google.com/edu/parallel/mapreduce-tutorial.html
•   Yahoo's Tutorial :
    http://developer.yahoo.com/hadoop/tutorial/module1.html
Thank You

For any clarification, send e-mail to
ram@dawnanalytics.com

Más contenido relacionado

La actualidad más candente

Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
alogarg
 

La actualidad más candente (20)

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 

Similar a How to use hadoop and r for big data parallel processing

Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 

Similar a How to use hadoop and r for big data parallel processing (20)

Hive paris
Hive parisHive paris
Hive paris
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop
HadoopHadoop
Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

Último

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Último (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

How to use hadoop and r for big data parallel processing

  • 1. R and Hadoop Ram Venkat Dawn Analytics
  • 2. What is Hadoop? • Hadoop is an open source Apache software for running distributed applications on 'big data' • It contains a distributed file system (HDFS) and a parallel processing 'batch' framework • Hadoop is written in java, runs on unix/linux for development and production • Windows and Mac can be used as development platform • Yahoo has > 43000 nodes hadoop cluster and Facebook has over 100 PB(PB= 1 M GB) of data in hadoop clusters
  • 3. Hadoop overview (1/2) Central Idea: Moving computation to data and compute across nodes in parallel • Data Loading
  • 4. Hadoop overview (2/2) Parallel Computation: MapReduce
  • 5. Map Reduce : Example 'Hello word' • Mathematically, this is what MapReduce is about: —map (k1, v1) ➔ list(k2, v2) —reduce(k2, list(v2)) ➔ list(<k3, v3>) • Implementation of the 'hello word' (word count): Mapper: K1 -> file name, v1 -> text of the file K2 -> word, V2 -> “1” Reducer: Sums up the '1' s and produces a list of words and their counts
  • 6. Word Count (slide from Yahoo)
  • 7. R libraries to work with Hadoop • 'Hadoop Streaming' - An alternative to the Java MapReduce API • Hadoop Streaming allows you to write jobs in any language supporting stdin/stdout. • R has several libraries/ways that help you to work with Hadoop: – Write your mapper.R and reducer.R and run a shell script – 'rmr' and 'rhadoop' from revolution analytics – 'rhipe' from Purdue University statistical computing – 'RHive' interacts with Hadoop via Hive query
  • 8. Word Count Demo with R(rmr) mapper.wordcount = function(key, val) { lapply( strsplit( x = val, split = " ")[[1]], function(w) keyval(w,1) ) } reducer.wordcount = function(key, val.list) { output.key = key output.val = sum(unlist(val.list)) return( keyval(output.key, output.val) ) }
  • 9. More advanced example – Sentiment Analysis in R(rmr) • One area where Hadoop could help out traders is in sentiment analysis • Oreilly Strata blog 'Trading on sentiment' : http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html • Demo2 is modified code from this example from Jeffrey Breen on airlines sentiment analysis : http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ • Jeffrey has been very active in R groups in Chicago area, This is another tutorial last month on R and Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started- with-r-hadoop
  • 10. Demo2 Sentiment Analysis with rmr mapper.score = function(key, val) { # clean up tweets with R's regex-driven global substitute, gsub(): val = gsub('[[:punct:]]', '', val) val = gsub('[[:cntrl:]]', '', val) val = gsub('d+', '', val) # Key is the Airline we added as tag to the tweets airline = substr(val,1,2) # Run the sentiment analysis output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words)) # our interest is in computing the counts by airlines and scores, so 'this' count is 1 output.val = 1 return( keyval(output.key, output.val) ) }
  • 11. Demo3 - Hive • Hive is a data warehousing infrastructure for Hadoop • Provides a familiar SQL like interface to create tables, insert and query data • Behind the scene , it implements map-reduce • Hive is an alternative to our hadoop streaming we covered before • Demo3 – stock query with Hive
  • 12. Use cases for Traders • Stock sentiment analysis • Stock trading pattern analysis • Default prediction • Fraud/anomoly detection • NextGen data warehousing
  • 13. Hadoop support - Cloudera • Cloudera distribution of hadoop is one of the most popular distribution (I used their CDH3v5 in my 2 demos above) • Doug Cutting, the creator of Hadoop is the architect with Cloudera • Adam Muise , a Cloudera engineer at Toronto is the organizer of Toronto Hadoop user Group (TOHUG) • Upcoming meeting organized by TOHUG on the 30th October - “PIG-fest”
  • 14. Hadoop Tutorials and Books • http://hadoop.apache.org/docs/r0.20.2/quickstart.html • Cloudera: http://university.cloudera.com/ • Book: “Hadoop in Action” – Manning • Book: “Hadoop - The Definitive Guide” – Oreilly • Hadoop Streaming: http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html • Google Code University: http://code.google.com/edu/parallel/mapreduce-tutorial.html • Yahoo's Tutorial : http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 15. Thank You For any clarification, send e-mail to ram@dawnanalytics.com