SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
Data Computing Division
Hadoop Hands On
Session
Milind Bhandarkar
Greenplum,A Division of EMC
Monday, February 18, 13
Data Computing Division
Prerequisites
•Make sure you haveVMWare player installed
•VMWare Fusion for Mac OS X
•Copy the GPHD (Greenplum Distribution of
Hadoop v 1.0) virtual machine to your
laptop
•Also copy exercise.zip file to your laptop,
and decompress
Monday, February 18, 13
Data Computing Division
Setting Up
•Start GPHDVirtual Machine
•Make sure you can login to it
•Copy exercise.zip from your laptop to the
VM, and unzip in ~/exercise
Monday, February 18, 13
Data Computing Division
Preparation
•Make sure HDFS is running
•Make sure MapReduce is running
•Check configuration files *-site.xml
Monday, February 18, 13
Data Computing Division
Hands-On
•Objective: Implement Linear Regression using
MapReduce, and use it to train a model
•Data Set: from Marine Resources Division,
Department of Primary Industries and
Fisheries,Tasmania
•4177 samples from observations
Monday, February 18, 13
Data Computing Division
Data
•Attributes about a type of fish
•M/F, Length, Diameter, Height,Weight,
Rings on shell
•Problem:To predict number of rings as a
function of other attributes
Monday, February 18, 13
Data Computing Division
Step 1
•Copy the small sample data set to HDFS
•See: Scripts/cp_to_grid.sh
Monday, February 18, 13
Data Computing Division
Step 2
•Blow up the dataset 1000 times by adding
gaussian noise to most fields
•Output: 4M sample observations
•Using Hadoop Streaming
•See: Scripts/stream_replicate.sh
•Monitor this job in JobTracker UI
Monday, February 18, 13
Data Computing Division
Step 3
•Train model based on Linear Regression
•See: Scripts/stream_train_linreg.sh
•Monitor the Job
•Copy the model to a local directory
•Check it
Monday, February 18, 13

Más contenido relacionado

Destacado

Insaat kursu-bakirkoy
Insaat kursu-bakirkoyInsaat kursu-bakirkoy
Insaat kursu-bakirkoy
sersld54
 
Protectora d'animals_Xènia, Malina i Gemma
Protectora d'animals_Xènia, Malina i GemmaProtectora d'animals_Xènia, Malina i Gemma
Protectora d'animals_Xènia, Malina i Gemma
mgonellgomez
 
Hvad koster stress?
Hvad koster stress?Hvad koster stress?
Hvad koster stress?
roddik
 

Destacado (12)

Ինչպիսին պետք է լինի
Ինչպիսին պետք է լինիԻնչպիսին պետք է լինի
Ինչպիսին պետք է լինի
 
Insaat kursu-bakirkoy
Insaat kursu-bakirkoyInsaat kursu-bakirkoy
Insaat kursu-bakirkoy
 
Changing the Security Monitoring Status Quo
Changing the Security Monitoring Status QuoChanging the Security Monitoring Status Quo
Changing the Security Monitoring Status Quo
 
Protectora d'animals_Xènia, Malina i Gemma
Protectora d'animals_Xènia, Malina i GemmaProtectora d'animals_Xènia, Malina i Gemma
Protectora d'animals_Xènia, Malina i Gemma
 
Yourprezi
YourpreziYourprezi
Yourprezi
 
Forex graphs
Forex graphsForex graphs
Forex graphs
 
Topic 9 final accounts
Topic 9 final accountsTopic 9 final accounts
Topic 9 final accounts
 
Ablation material book
Ablation material   bookAblation material   book
Ablation material book
 
Hvad koster stress?
Hvad koster stress?Hvad koster stress?
Hvad koster stress?
 
Manage vm’s and services across private clouds and windows azure with system ...
Manage vm’s and services across private clouds and windows azure with system ...Manage vm’s and services across private clouds and windows azure with system ...
Manage vm’s and services across private clouds and windows azure with system ...
 
3349
33493349
3349
 
Advance DNA sequencing
Advance DNA sequencing Advance DNA sequencing
Advance DNA sequencing
 

Similar a Hadoop Hands-On by @techmilind

Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
IJECEIAES
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 

Similar a Hadoop Hands-On by @techmilind (20)

Managing forestry operations
Managing forestry operationsManaging forestry operations
Managing forestry operations
 
An example Hadoop Install
An example Hadoop InstallAn example Hadoop Install
An example Hadoop Install
 
Back to FME School - Day 3: Expanding Frontiers
Back to FME School - Day 3: Expanding FrontiersBack to FME School - Day 3: Expanding Frontiers
Back to FME School - Day 3: Expanding Frontiers
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
Unlocking the Full Power of Your Backup Data with Veritas NetBackup Data Virt...
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Instalação geo ip
Instalação geo ipInstalação geo ip
Instalação geo ip
 
Help your Enterprise Implement Big Data with Control-M for Hadoop
 Help your Enterprise Implement Big Data with Control-M for Hadoop Help your Enterprise Implement Big Data with Control-M for Hadoop
Help your Enterprise Implement Big Data with Control-M for Hadoop
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
Use case of Disaster Management System by using Geopaparazzi and MapGuide Ope...
 
Ict 9 module 3, lesson 1.5 materials, tools, equipment and testing devices
Ict 9 module 3, lesson 1.5 materials, tools, equipment and testing devicesIct 9 module 3, lesson 1.5 materials, tools, equipment and testing devices
Ict 9 module 3, lesson 1.5 materials, tools, equipment and testing devices
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
 
TechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a yearTechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a year
 
Infrastructure Management in GCP
Infrastructure Management in GCPInfrastructure Management in GCP
Infrastructure Management in GCP
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Backup and Disaster Recovery Product
Backup and Disaster Recovery ProductBackup and Disaster Recovery Product
Backup and Disaster Recovery Product
 
Deploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise EnvironmentsDeploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise Environments
 
Best Practices: Migrating a Postgres Production Database to the Cloud
Best Practices: Migrating a Postgres Production Database to the CloudBest Practices: Migrating a Postgres Production Database to the Cloud
Best Practices: Migrating a Postgres Production Database to the Cloud
 

Más de EMC

Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 

Más de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Hadoop Hands-On by @techmilind

  • 1. Data Computing Division Hadoop Hands On Session Milind Bhandarkar Greenplum,A Division of EMC Monday, February 18, 13
  • 2. Data Computing Division Prerequisites •Make sure you haveVMWare player installed •VMWare Fusion for Mac OS X •Copy the GPHD (Greenplum Distribution of Hadoop v 1.0) virtual machine to your laptop •Also copy exercise.zip file to your laptop, and decompress Monday, February 18, 13
  • 3. Data Computing Division Setting Up •Start GPHDVirtual Machine •Make sure you can login to it •Copy exercise.zip from your laptop to the VM, and unzip in ~/exercise Monday, February 18, 13
  • 4. Data Computing Division Preparation •Make sure HDFS is running •Make sure MapReduce is running •Check configuration files *-site.xml Monday, February 18, 13
  • 5. Data Computing Division Hands-On •Objective: Implement Linear Regression using MapReduce, and use it to train a model •Data Set: from Marine Resources Division, Department of Primary Industries and Fisheries,Tasmania •4177 samples from observations Monday, February 18, 13
  • 6. Data Computing Division Data •Attributes about a type of fish •M/F, Length, Diameter, Height,Weight, Rings on shell •Problem:To predict number of rings as a function of other attributes Monday, February 18, 13
  • 7. Data Computing Division Step 1 •Copy the small sample data set to HDFS •See: Scripts/cp_to_grid.sh Monday, February 18, 13
  • 8. Data Computing Division Step 2 •Blow up the dataset 1000 times by adding gaussian noise to most fields •Output: 4M sample observations •Using Hadoop Streaming •See: Scripts/stream_replicate.sh •Monitor this job in JobTracker UI Monday, February 18, 13
  • 9. Data Computing Division Step 3 •Train model based on Linear Regression •See: Scripts/stream_train_linreg.sh •Monitor the Job •Copy the model to a local directory •Check it Monday, February 18, 13