SlideShare una empresa de Scribd logo
1 de 12
Apache Hadoop, MapReduce
           &
     Windows Azure

   Guðmundur Jón Halldórsson
        Five Degrees
          July 2012
Web crawler! „No this isn‘t about that“
What is Hadoop?

System for processing
mind-boggingly large
amount of data
Hadoop


Map-Reduce = Computation
  HDFS     = Storage
HDFS
Hadoop Distributed File System

Yes it is file system written in Java 
And you can do normal file system operations
like [ls, mkdir, ...].

Works best with large files. HDFS splits file into
blocks of 128 MB (can be configures)
HDFS
HDFS will keep 3 copies of each block
The NameNode tracks blocks and datanodes


  DN1   DN2   DN3
                        NN


  DN4   DN5   DN5
                     Namenode
                       DN1, DN4, DN7
                       DN3, DN5, DN8
  DN5   DN8   DN9      DN3, DN4, DN5
Map-Reduce
• Write a mapper that takes a key and value,
  emits zero or more new keys and values
• Write a reducer all the values of one key and
  emits zero or more new keys and values
Map-Reduce JS example
var map = function ( key, value, context ) {
    var words = value.split(/[^a-zA-Z]/);
    for ( var i=0; i < words.length; i++ ) {
        if ( words[i] !== „“ ) {
            context.write( words[i].toLowerCase(), 1 );
        }
    }
}; var reduce = function ( key, values, context ) {
    var sum = 0;
    while ( values.hasNext() ) {
        sum += parseInt( values.next() );
    }
    context.write( key, sum );
}
MapReduce
Data Systems and Their Timeframes
Does hadoop solve all my DATA
problems or is are there something
         else out there?
•   PIG         High-level MapReduce Language
•   HIVE        SQL Like high-level MapReduce Language
•   HBase       Realtime processing (based on google
                BigTable)
•   Accumulo    NSA fork of Hbase
•   Avro        Data Serialization
•   ZooKeeper   Low level coordination
•   HCatalog    Storage Management and interoperability
                between all systems
•   OOZIE       Job scheduler
•   Flume       Log and data aggregation
•   Whirr       Automated cloud cluster on ec2, rackspace etc
•   Sqoop       Relational data importer
•   MrUnit      Unit testing job
•   Mahout      Machine learning libraries
•   BigTop      Interoperability
•   Crunch      MapReduce pipelines in Java and Scala
•   Giraph      Processing math on huge distribute graphs

Más contenido relacionado

La actualidad más candente

Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
alogarg
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
cacois
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 

La actualidad más candente (20)

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Hadoop
HadoopHadoop
Hadoop
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)
 
Google App Engine BeCamp 2008
Google App Engine BeCamp 2008Google App Engine BeCamp 2008
Google App Engine BeCamp 2008
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 

Destacado (20)

Getting started
Getting startedGetting started
Getting started
 
Identidad verbal
Identidad verbalIdentidad verbal
Identidad verbal
 
Tutoria
TutoriaTutoria
Tutoria
 
Tutorial dropbox
Tutorial dropboxTutorial dropbox
Tutorial dropbox
 
Balance between insight and noise indicia v2
Balance between insight and noise indicia v2Balance between insight and noise indicia v2
Balance between insight and noise indicia v2
 
Presentation1.pptx 1
Presentation1.pptx 1Presentation1.pptx 1
Presentation1.pptx 1
 
Tests
TestsTests
Tests
 
Twitter
TwitterTwitter
Twitter
 
New Media DL Day One Intro Deck
New Media DL Day One Intro DeckNew Media DL Day One Intro Deck
New Media DL Day One Intro Deck
 
London
LondonLondon
London
 
New mediadl adwords_intro
New mediadl adwords_introNew mediadl adwords_intro
New mediadl adwords_intro
 
Score
ScoreScore
Score
 
111108 Succes
111108 Succes111108 Succes
111108 Succes
 
Kolory jesieni
Kolory jesieniKolory jesieni
Kolory jesieni
 
Bala_krishna_resume
Bala_krishna_resumeBala_krishna_resume
Bala_krishna_resume
 
presentation
presentationpresentation
presentation
 
ALEKS: How can we help at-risk students be more successful in math?
ALEKS: How can we help at-risk students be more successful in math?ALEKS: How can we help at-risk students be more successful in math?
ALEKS: How can we help at-risk students be more successful in math?
 
Tutorial Imagen
Tutorial ImagenTutorial Imagen
Tutorial Imagen
 
Expo marcas
Expo marcasExpo marcas
Expo marcas
 
Tumša nakte, zaļa zāle soc spele
Tumša nakte, zaļa zāle soc speleTumša nakte, zaļa zāle soc spele
Tumša nakte, zaļa zāle soc spele
 

Similar a 2012 apache hadoop_map_reduce_windows_azure

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

Similar a 2012 apache hadoop_map_reduce_windows_azure (20)

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Anju
AnjuAnju
Anju
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

2012 apache hadoop_map_reduce_windows_azure

  • 1. Apache Hadoop, MapReduce & Windows Azure Guðmundur Jón Halldórsson Five Degrees July 2012
  • 2. Web crawler! „No this isn‘t about that“
  • 3. What is Hadoop? System for processing mind-boggingly large amount of data
  • 5. HDFS Hadoop Distributed File System Yes it is file system written in Java  And you can do normal file system operations like [ls, mkdir, ...]. Works best with large files. HDFS splits file into blocks of 128 MB (can be configures)
  • 6. HDFS HDFS will keep 3 copies of each block The NameNode tracks blocks and datanodes DN1 DN2 DN3 NN DN4 DN5 DN5 Namenode DN1, DN4, DN7 DN3, DN5, DN8 DN5 DN8 DN9 DN3, DN4, DN5
  • 7. Map-Reduce • Write a mapper that takes a key and value, emits zero or more new keys and values • Write a reducer all the values of one key and emits zero or more new keys and values
  • 8. Map-Reduce JS example var map = function ( key, value, context ) { var words = value.split(/[^a-zA-Z]/); for ( var i=0; i < words.length; i++ ) { if ( words[i] !== „“ ) { context.write( words[i].toLowerCase(), 1 ); } } }; var reduce = function ( key, values, context ) { var sum = 0; while ( values.hasNext() ) { sum += parseInt( values.next() ); } context.write( key, sum ); }
  • 10. Data Systems and Their Timeframes
  • 11. Does hadoop solve all my DATA problems or is are there something else out there?
  • 12. PIG High-level MapReduce Language • HIVE SQL Like high-level MapReduce Language • HBase Realtime processing (based on google BigTable) • Accumulo NSA fork of Hbase • Avro Data Serialization • ZooKeeper Low level coordination • HCatalog Storage Management and interoperability between all systems • OOZIE Job scheduler • Flume Log and data aggregation • Whirr Automated cloud cluster on ec2, rackspace etc • Sqoop Relational data importer • MrUnit Unit testing job • Mahout Machine learning libraries • BigTop Interoperability • Crunch MapReduce pipelines in Java and Scala • Giraph Processing math on huge distribute graphs