Hadoop - A Very Short Introduction

•

1 recomendación•3,402 vistas

dewang_mistry

A short introduction to Hadoop and it's ecosystem.

Tecnología

Hadoop
A Distributed Programming Framework

A Very Short Introduction

@Dewang_Mistry

DewangMistry.com

“Big data” is data that
becomes large enough
that it cannot be
processed using
conventional methods
~ O’Reilly Radar

Hadoop

Apache Hadoop is not a database
Apache Hadoop is not a single program, tool or application but a set of projects with a
common goal integrated under one umbrella / term Hadoop (Core)

Distributed Systems

Low-end/commodity machines
(scale-out)
Huge monolithic
servers (scale-up)

Anatomy of a Hadoop Cluster
Distributed Computing (MapReduce)

Distributed storage (HDFS)

Commodity Hardware

Hadoop Architecture
The MapReduce master is
responsible for organizing
where computational work
should be scheduled on the
slave nodes.

Name Node
Job Tracker
HDFS

The HDFS master is
responsible for
partitioning the storage
across the slave nodes and
keeping track of where
data is located.

Data Node

Data Node

Data Node

Task Tracker
HDFS

Task Tracker
HDFS

Task Tracker
HDFS

Let the data remain where it is and move the executable code to its hosting machine.

Hadoop Ecosystem
Predictive analytics

Misc.

Crunch

RHadoop

Sqoop

Cascading

RHIPE

Hue

Pig

R

Flume

Hive

mahout

Hbase

High-level languages

HDFS

MapReduce
Hadoop

MapReduce
Stated simply, the mapper is meant to filter and
transform the input into something that the reducer can
aggregate over.
MapReduce uses lists and (key/value) pairs as its main
data primitives.
Example next
Shapes are keys, its colors are values.

MapReduce
IN

IN

IN

IN

IN

IN

Map

(k1, v1)

Reduce

(k2, v2)

OUT

OUT

OUT

Data Logistics
HDFS

Move data from RDBMS into Hadoop using Sqoop
Move log files using Flume, Chukwa, or Scribe

Writing Map/Reduce Jobs
We can use multiple languages to write Map/Reduce jobs
Python with Hadoop Streaming
Pros: fast development
Cons: slower than Java, no access to Hadoop API
Java
Pros: fast, access to Hadoop API
Cons: verbose language
PIG
Pros: very small scripts, faster than streaming
Cons: yet another language to learn
Hive
Pros: SQL like syntax (easy for non-programmers) and relational data model
Cons: slower than PIG, more moving parts

Use Cases
Where can we use Hadoop?
Reporting
Granular reports over large data set of 5-7 years
Business analysis
Risk analysis
Predictive analysis
Operational analysis
Root cause analysis
Latency analysis
Better capacity planning (servers, people, bandwidth)
Product features
Recommendations (better than external parties, because of the amount of data)

Más contenido relacionado

La actualidad más candente

3.introduction to map reducedatabloginfo

Hadoop white papersMuthu Natarajan

Hadoopsiva shankari

Hadoop EcosystemLior Sidi

Apache HadoopKumaresan Manickavelu

An introduction to Apache Hadoop HiveMike Frampton

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

HADOOP TECHNOLOGY pptsravya raju

Hadoopavnishagr

AnjuAnju Shekhawat

Hadoop seminarKrishnenduKrishh

CSB_communityAlbert Anthony Gavino, MBA

Big data business caseKarthik Padmanabhan ( MLE℠)

Big data Hadoop presentation Shivanee garg

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

HadoopTuan Cuong Luu

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

High Performance Predictive Analytics in R and HadoopDataWorks Summit

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance

La actualidad más candente (19)

3.introduction to map reduce

Hadoop white papers

Hadoop

Hadoop Ecosystem

Apache Hadoop

An introduction to Apache Hadoop Hive

Evolution of spark framework for simplifying data analysis.

HADOOP TECHNOLOGY ppt

Hadoop

Anju

Hadoop seminar

CSB_community

Big data business case

Big data Hadoop presentation

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

Hadoop

Strata NYC 2015 - Supercharging R with Apache Spark

High Performance Predictive Analytics in R and Hadoop

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Destacado

A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari

Hadoop HDFS Detailed IntroductionHanborq Inc.

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Pig, Making Hadoop EasyNick Dimiduk

introduction to data processing using Hadoop and PigRicardo Varela

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

HIVE: Data Warehousing & Analytics on HadoopZheng Shao

Hive Quick Start TutorialCarl Steinbach

Integration of Hive and HBaseHortonworks

Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil

Introduction To Map Reducerantav

Big Data Analytics with HadoopPhilippe Julio

Destacado (12)

A Basic Introduction to the Hadoop eco system - no animation

Hadoop HDFS Detailed Introduction

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

Pig, Making Hadoop Easy

introduction to data processing using Hadoop and Pig

Practical Problem Solving with Apache Hadoop & Pig

HIVE: Data Warehousing & Analytics on Hadoop

Hive Quick Start Tutorial

Integration of Hive and HBase

Hadoop, Pig, and Twitter (NoSQL East 2009)

Introduction To Map Reduce

Big Data Analytics with Hadoop

Similar a Hadoop - A Very Short Introduction

Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar

Unit-3_BDA.pptPoojaShah174393

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Apache hadoop introduction and architectureHarikrishnan K

Big Data and Hadoop GuideSimplilearn

Hadoop map reduceVijayMohan Vasu

Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3

2.1-HADOOP.pdfMarianJRuben

What is hadoopAsis Mohanty

Hadoop_arunam_pptjerrin joseph

Hadoop MapReduce FundamentalsLynn Langit

Cppt Hadoopchunkypandey12

Cpptchunkypandey12

Hadoop ppt2Ankit Gupta

Cred_hadoop_presenatationAshish Saraf

project report on hadoopManoj Jangalva

Hadoop training by keylabsSiva Sankar

Lecture 2 Hadoop.pptxAnonymous9etQKwW

Hadoop: An Industry PerspectiveCloudera, Inc.

Similar a Hadoop - A Very Short Introduction (20)

Hadoop a Natural Choice for Data Intensive Log Processing

Unit-3_BDA.ppt

Big Data Hoopla Simplified - TDWI Memphis 2014

Apache hadoop introduction and architecture

Big Data and Hadoop Guide

Hadoop map reduce

Survey on Performance of Hadoop Map reduce Optimization Methods

2.1-HADOOP.pdf

What is hadoop

Hadoop_arunam_ppt

Hadoop MapReduce Fundamentals

Cppt Hadoop

Cppt

Hadoop ppt2

Cred_hadoop_presenatation

project report on hadoop

Hadoop training by keylabs

Lecture 2 Hadoop.pptx

Hadoop: An Industry Perspective

Último

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Gen AI in Business - Global Trends Report 2024.pdfAddepto

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

CloudStudio User manual (basic edition):comworks

Artificial intelligence in cctv survelliance.pptxhariprasad279825

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Hadoop - A Very Short Introduction

1. Hadoop A Distributed Programming Framework A Very Short Introduction @Dewang_Mistry DewangMistry.com

2. “Big data” is data that becomes large enough that it cannot be processed using conventional methods ~ O’Reilly Radar

3. Hadoop Apache Hadoop is not a database Apache Hadoop is not a single program, tool or application but a set of projects with a common goal integrated under one umbrella / term Hadoop (Core)

4. Distributed Systems Low-end/commodity machines (scale-out) Huge monolithic servers (scale-up)

5. Anatomy of a Hadoop Cluster Distributed Computing (MapReduce) Distributed storage (HDFS) Commodity Hardware

6. Hadoop Architecture The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. Name Node Job Tracker HDFS The HDFS master is responsible for partitioning the storage across the slave nodes and keeping track of where data is located. Data Node Data Node Data Node Task Tracker HDFS Task Tracker HDFS Task Tracker HDFS Let the data remain where it is and move the executable code to its hosting machine.

7. Hadoop Ecosystem Predictive analytics Misc. Crunch RHadoop Sqoop Cascading RHIPE Hue Pig R Flume Hive mahout Hbase High-level languages HDFS MapReduce Hadoop

8. MapReduce Stated simply, the mapper is meant to filter and transform the input into something that the reducer can aggregate over. MapReduce uses lists and (key/value) pairs as its main data primitives. Example next Shapes are keys, its colors are values.

9. MapReduce IN IN IN IN IN IN Map (k1, v1) Reduce (k2, v2) OUT OUT OUT

10. Data Logistics HDFS Move data from RDBMS into Hadoop using Sqoop Move log files using Flume, Chukwa, or Scribe

11. Writing Map/Reduce Jobs We can use multiple languages to write Map/Reduce jobs Python with Hadoop Streaming Pros: fast development Cons: slower than Java, no access to Hadoop API Java Pros: fast, access to Hadoop API Cons: verbose language PIG Pros: very small scripts, faster than streaming Cons: yet another language to learn Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: slower than PIG, more moving parts

12. Use Cases Where can we use Hadoop? Reporting Granular reports over large data set of 5-7 years Business analysis Risk analysis Predictive analysis Operational analysis Root cause analysis Latency analysis Better capacity planning (servers, people, bandwidth) Product features Recommendations (better than external parties, because of the amount of data)

Hadoop - A Very Short Introduction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (12)

Similar a Hadoop - A Very Short Introduction

Similar a Hadoop - A Very Short Introduction (20)

Último

Último (20)

Hadoop - A Very Short Introduction