SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Amazon-style
shopping cart analysis
  using MapReduce
 on a Hadoop cluster


      Dan Şerban
Agenda
:: Introduction
   - Real-world uses of MapReduce
   - The origins of Hadoop
   - Hadoop facts and architecture
:: Part 1
   - Deploying Hadoop
:: Part 2
   - MapReduce is machine learning
:: Q&A
Why shopping cart analysis
 is useful to amazon.com
Linkedin and Google Reader
The origins of Hadoop
:: Hadoop got its start in Nutch
:: A few enthusiastic developers were attempting
   to build an open source web search engine
   and having trouble managing computations
   running on even a handful of computers
:: Once Google published their GoogleFS and MapReduce
   whitepapers, the way forward became clear
:: Google had devised systems to solve precisely
   the problems the Nutch project was facing
:: Thus, Hadoop was born
Hadoop facts
:: Hadoop is a distributed computing platform
   for processing extremely large amounts of data
:: Hadoop is divided into two main components:
   - the MapReduce runtime
   - the Hadoop Distributed File System (HDFS)
:: The MapReduce runtime allows the user to submit
   MapReduce jobs
:: The HDFS is a distributed file system that provides
   a logical interface for persistent and redundant
   storage of large data
:: Hadoop also provides the HadoopStreaming library
   that leverages STDIN and STDOUT so you can write
   mappers and reducers in your programming language
   of choice
Hadoop facts
:: Hadoop is based on the principle
   of moving computation to where the data is
:: Data stored on the Hadoop Distributed File System
   is broken up into chunks and replicated across
   the cluster providing fault tolerant parallel processing
   and redundancy for both the data and the jobs
:: Computation takes the form of a job which consists
   of a map phase and a reduce phase
:: Data is initially processed by map functions which run
   in parallel across the cluster
:: Map output is in the form of key-value pairs
:: The reduce phase then aggregates the map results
:: The reduce phase typically happens in multiple
   consecutive waves until the job is complete
Hadoop architecture
Part 1:
Configuring and deploying
   the Hadoop cluster
Hands-on with Hadoop
core-site.xml - before
core-site.xml - after
hdfs-site.xml - before
hdfs-site.xml - after
mapred-site.xml - before
mapred-site.xml - after
Setting up SSH
:: hadoop@hadoop-master needs to be able to ssh* into:
   - hadoop@hadoop-master
   - hadoop@chunkserver-a
   - hadoop@chunkserver-b
   - hadoop@chunkserver-c
:: hadoop@job-tracker needs to be able to ssh* into:
   - hadoop@job-tracker
   - hadoop@chunkserver-a
   - hadoop@chunkserver-b
   - hadoop@chunkserver-c




*Passwordless-ly and passphraseless-ly
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Part 2:
    MapReduce
is machine learning
Rolling your own
self-hosted alternative to ...
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
mapper.py
#!/usr/bin/python

import sys

for line in sys.stdin:
  line = line.strip()
  IDs = line.split()
  for firstID in IDs:
    for secondID in IDs:
      if secondID > firstID:
        print '%s_%st%s' % (firstID, secondID, 1)
reducer.py
#!/usr/bin/python

import sys

subTotals = {}
for line in sys.stdin:
  line = line.strip()
  word = line.split('t')[0]
  count = int(line.split('t')[1])
  subTotals[word] = subTotals.get(word, 0) + count
for k, v in subTotals.items():
  print "%st%s" % (k, v)
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Other MapReduce use cases
::   Google Suggest
::   Video recommendations (YouTube)
::   ClickStream Analysis (large web properties)
::   Spam filtering and contextual advertising (Yahoo)
::   Fraud detection (eBay, CC companies)
::   Firewall log analysis to discover exfiltration and
     other undesirable (possibly malware-related) activity
::   Finding patterns in social data, analyzing “likes” and
     building a search engine on top of them (FaceBook)
::   Discovering microblogging trends and opinion leaders,
     analyzing who follows who (Twitter)
::   Plain old supermarket shopping basket analysis
::   The semantic web
Questions / Feedback
Bonus slide: Making of SQLite DB

Más contenido relacionado

La actualidad más candente

Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

La actualidad más candente (20)

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 

Similar a Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 

Similar a Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster (20)

Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Data Science
Data ScienceData Science
Data Science
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Unit 5
Unit  5Unit  5
Unit 5
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 

Más de Asociatia ProLinux

Cristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology OverviewCristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology Overview
Asociatia ProLinux
 
Nicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRONicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRO
Asociatia ProLinux
 
Razvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily WorkaholicRazvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily Workaholic
Asociatia ProLinux
 
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilorRăzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Asociatia ProLinux
 
Ioan Eugen Stan - Introducere HBase
Ioan Eugen Stan -  Introducere HBaseIoan Eugen Stan -  Introducere HBase
Ioan Eugen Stan - Introducere HBase
Asociatia ProLinux
 
Ciprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRIDCiprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRID
Asociatia ProLinux
 
Petru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfsPetru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfs
Asociatia ProLinux
 
Calin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in LinuxCalin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in Linux
Asociatia ProLinux
 
Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8
Asociatia ProLinux
 
Cornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on LinuxCornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on Linux
Asociatia ProLinux
 
Radu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PCRadu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PC
Asociatia ProLinux
 
Ovidiu Constantin - Debian Live
Ovidiu Constantin - Debian LiveOvidiu Constantin - Debian Live
Ovidiu Constantin - Debian Live
Asociatia ProLinux
 

Más de Asociatia ProLinux (20)

Cristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology OverviewCristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology Overview
 
Razvan Deaconescu - rss2email
Razvan Deaconescu - rss2emailRazvan Deaconescu - rss2email
Razvan Deaconescu - rss2email
 
Nicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRONicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRO
 
Razvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily WorkaholicRazvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily Workaholic
 
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilorRăzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
 
Ioan Eugen Stan - Introducere HBase
Ioan Eugen Stan -  Introducere HBaseIoan Eugen Stan -  Introducere HBase
Ioan Eugen Stan - Introducere HBase
 
Ioan Eugen Stan - James
Ioan Eugen Stan - JamesIoan Eugen Stan - James
Ioan Eugen Stan - James
 
Dumitru Enache - Bacula
Dumitru Enache - BaculaDumitru Enache - Bacula
Dumitru Enache - Bacula
 
Ciprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRIDCiprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRID
 
Ovidiu Constantin - ReactOS
Ovidiu Constantin - ReactOSOvidiu Constantin - ReactOS
Ovidiu Constantin - ReactOS
 
Petru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfsPetru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfs
 
Calin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in LinuxCalin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in Linux
 
Alex Juncu - UDPCast
Alex Juncu - UDPCastAlex Juncu - UDPCast
Alex Juncu - UDPCast
 
Razvan Deaconescu - Org-Mode
Razvan Deaconescu - Org-ModeRazvan Deaconescu - Org-Mode
Razvan Deaconescu - Org-Mode
 
Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8
 
Cornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on LinuxCornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on Linux
 
Radu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PCRadu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PC
 
Ovidiu Constantin - Debian Live
Ovidiu Constantin - Debian LiveOvidiu Constantin - Debian Live
Ovidiu Constantin - Debian Live
 
Razvan Deaconescu - Redmine
Razvan Deaconescu - RedmineRazvan Deaconescu - Redmine
Razvan Deaconescu - Redmine
 
Ovidiu constantin 1 airopl
Ovidiu constantin   1 airoplOvidiu constantin   1 airopl
Ovidiu constantin 1 airopl
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster