SlideShare una empresa de Scribd logo
1 de 24
Hadoop
in a
Nutshell
Siva Pandeti
Cloudera Certified Developer for Apache Hadoop (CCDH)
Overview
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Examples
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Why
Hadoop?
Data Growth
OLTP
Databases for
Operations
Throw away
historical data
Relational
Oracle, DB2
OLAP
Data warehouses for
analytics
Cheaper centralized
storage -> Data
warehouses
(ETL tools)
Relational/MPP
appliances
< few hundred TB
Big Data
Data explosion
(social media, etc)
Petabyte scale
Network speeds
haven’t increased
Need Data Locality
Distributed
processing on
commodity
hardware
(Hadoop)
Non-relational
Big Data
What is Big Data?
Volume
Petabyte scale
Variety
Structured
Semi-structured
Unstructured
Velocity
Social
Sensor
Throughput
Veracity
Unclean
Imprecise
Unclear
Where is Hadoop Used?
Industry
Technology
Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
What is
Hadoop?
Hadoop
HDFS
Distributed Storage
Economical: commodity hardware
Scalable: rebalances data on new nodes
Fault Tolerant: detects faults & auto recovers
Reliable: maintains multiple copies of data
High throughput: because data is distributed
Open source
distributed
computing
framework for
storage and
processing
What is Hadoop?
MapReduce
Distributed Processing
Data Locality: process where the data resides
Fault Tolerant: auto-recover job failures
Scalable: add nodes to increase parallelism
Economical: commodity hardware
• Unlike RDBMS:
o De-normalized
o No secondary indexes
o No transactions
• Modeled after Google’s Big Table
• Random real time read/write access to Big Data
• Billions of rows x millions of columns
• Commodity hardware
• Open source, distributed, versioned, column oriented
• Integrates with MapReduce; Has Java/REST APIs
• Automatic sharding
NoSQL DBs - HBase
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Master Node
Slave Node Slave Node Slave Node
Job Tracker
Task Tracker Task Tracker Task Tracker
Name Node
Data Node Data Node Data Node
Cluster
How Does Hadoop Work?
Vendors
Apache
Hadoop
Cloudera
HortonWorks
MapR
Pentaho Informatica
Talend Clover
EMR
ETL/BI Connectors
Hadoop Distributors
Microstrategy Tableau
SASAbInitio
Comparison
Traditional ETL/BI
Expensive license
Expensive hardware
Hadoop
Open source
Cheap commodity hardware
< 100 TB
Central storage
Petabyte scale
Distributed storage
CostVolume
Quick response for processing
small data
Not as fast on large data
Even smallest job takes 15 seconds
Super fast on large data
Speed
Thousands of reads/writes per
minute
Millions of reads/writes per
minute
Thruput
How to
Hadoop?
HDFS
Hadoop
Flume
Sqoop
Ingest
Put/Get
ETL tools
RDBMS
Data
Feeds
Files
Hadoop Implementation
Reports Machine
Learning
Output
Analytics
Visualization
SAS R
MapReduce
Pig Hive Mahout
Process
Data Analysis: Pig & Hive
Pig Hive
Abstraction on top of MapReduce. Generates MapReduce jobs in the
backend. Useful for analysts who are not programmers.
Data flow language
No schema
Better with less structured Data
SQL like language
Schema, tables, joins are stored in
a meta-store.
Example
LOAD ‘file’ USING
PigStorage(‘t’) AS (id, name);
FILTER
FOREACH
GROUP
ORDER
STORE
Example
CREATE TABLE customer (id
INT, name STRING) ROW
FORMAT DELIMITED FIELDS
TERMINATED BY ‘t’;
SELECT * from customer
WHERE id < 100 limit 10;
MapReduce
Source: http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Examples
Word count - Java
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create driver
o Set configuration variables, mapper and reducer class names
• Create mapper
o Read input and emit key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar WordCount.jar WordCount input output
• Analyze output
o hadoop fs –cat output/* | head
Word count - Streaming
• Hadoop is written in Java. I don’t know Java. What
do I do?
o Hadoop Streaming (Python, Ruby, R, etc)
• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create mapper
o Read input stream (stdin) and emit (print) key value pairs
• Create reducer (optional)
o Aggregate all values for a particular key
• Execute
o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file
reducer.py -input input –output output
• Analyze output
o hdoop fs –cat output/* | head
Hadoop for R
Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
setwd("/home/istvan/rhadoop/blogs/")
gdp <- read.csv("GDP_converted.csv")
head(gdp)
hdfs.init()
gdp.values <- to.dfs(gdp)
# AAPL revenue in 2012 in millions USD
aaplRevenue = 156508
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
count <- mapreduce(input=gdp.values,
map = gdp.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
• RHadoop package
o rmr
o rhdfs
o Rhbase
• Uses Hadoop
Streaming
• Example on the right
determines how
many countries
have greater GDP
than Apple
Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
Search index example
• Crawl web
o Crawl and save websites to local directory
• Ingest files to HDFS
• Map
o Split the words & associate words with file names
• Reduce
o Build an index with words and files & count of occurrences
• Search
o Pass the word to the index to get the files it shows up in. Display the file
listing in descending order of number of occurrences of the word in a file
Recommender example
• Use web server logs with user ratings info for items
• Create Hive tables to build structure on top of this
log data
• Generate Mahout specific csv input file
(user, item, rating)
• Run Mahout to build item recommendations for
users
o mahout recommeditembased 
--input /user/hive/warehouse/mahout_input 
--output recommendations 
-s SIMILARITY_PEARSON_CORRELATION –n 20
Recap
Why
Hadoop?
What is
Hadoop?
How to
Hadoop?
Demo
Data Growth
What is Big Data?
Hadoop usage
Components
No SQL
Cluster
Vendors
Tool Comparison
Typical Implementation
Data Analysis with Pig & Hive
Opportunities
Map Reduce deep dive
Wordcount
Search index
Recommendation Engine
Q & A
Contact Siva Pandeti:
Email: siva@pandeti.com
LinkedIn: www.linkedin.com/in/SivaPandeti
Twitter: @SivaPandeti
http://pandeti.com/blog

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 

La actualidad más candente (20)

Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HDFS
HDFSHDFS
HDFS
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 

Destacado

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteriaAsis Mohanty
 

Destacado (8)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Informatica session
Informatica sessionInformatica session
Informatica session
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar a Hadoop overview

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 

Similar a Hadoop overview (20)

Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Uotm workshop
Uotm workshopUotm workshop
Uotm workshop
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Hadoop overview

  • 1. Hadoop in a Nutshell Siva Pandeti Cloudera Certified Developer for Apache Hadoop (CCDH)
  • 2. Overview Why Hadoop? What is Hadoop? How to Hadoop? Examples Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 4. Data Growth OLTP Databases for Operations Throw away historical data Relational Oracle, DB2 OLAP Data warehouses for analytics Cheaper centralized storage -> Data warehouses (ETL tools) Relational/MPP appliances < few hundred TB Big Data Data explosion (social media, etc) Petabyte scale Network speeds haven’t increased Need Data Locality Distributed processing on commodity hardware (Hadoop) Non-relational
  • 5. Big Data What is Big Data? Volume Petabyte scale Variety Structured Semi-structured Unstructured Velocity Social Sensor Throughput Veracity Unclean Imprecise Unclear
  • 6. Where is Hadoop Used? Industry Technology Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  • 8. Hadoop HDFS Distributed Storage Economical: commodity hardware Scalable: rebalances data on new nodes Fault Tolerant: detects faults & auto recovers Reliable: maintains multiple copies of data High throughput: because data is distributed Open source distributed computing framework for storage and processing What is Hadoop? MapReduce Distributed Processing Data Locality: process where the data resides Fault Tolerant: auto-recover job failures Scalable: add nodes to increase parallelism Economical: commodity hardware
  • 9. • Unlike RDBMS: o De-normalized o No secondary indexes o No transactions • Modeled after Google’s Big Table • Random real time read/write access to Big Data • Billions of rows x millions of columns • Commodity hardware • Open source, distributed, versioned, column oriented • Integrates with MapReduce; Has Java/REST APIs • Automatic sharding NoSQL DBs - HBase Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 10. Master Node Slave Node Slave Node Slave Node Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node Cluster How Does Hadoop Work?
  • 11. Vendors Apache Hadoop Cloudera HortonWorks MapR Pentaho Informatica Talend Clover EMR ETL/BI Connectors Hadoop Distributors Microstrategy Tableau SASAbInitio
  • 12. Comparison Traditional ETL/BI Expensive license Expensive hardware Hadoop Open source Cheap commodity hardware < 100 TB Central storage Petabyte scale Distributed storage CostVolume Quick response for processing small data Not as fast on large data Even smallest job takes 15 seconds Super fast on large data Speed Thousands of reads/writes per minute Millions of reads/writes per minute Thruput
  • 14. HDFS Hadoop Flume Sqoop Ingest Put/Get ETL tools RDBMS Data Feeds Files Hadoop Implementation Reports Machine Learning Output Analytics Visualization SAS R MapReduce Pig Hive Mahout Process
  • 15. Data Analysis: Pig & Hive Pig Hive Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers. Data flow language No schema Better with less structured Data SQL like language Schema, tables, joins are stored in a meta-store. Example LOAD ‘file’ USING PigStorage(‘t’) AS (id, name); FILTER FOREACH GROUP ORDER STORE Example CREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’; SELECT * from customer WHERE id < 100 limit 10;
  • 18. Word count - Java • Copy input files to HDFS o hadoop fs –put file1.txt input • Create driver o Set configuration variables, mapper and reducer class names • Create mapper o Read input and emit key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar WordCount.jar WordCount input output • Analyze output o hadoop fs –cat output/* | head
  • 19. Word count - Streaming • Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc) • Copy input files to HDFS o hadoop fs –put file1.txt input • Create mapper o Read input stream (stdin) and emit (print) key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop- stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output • Analyze output o hdoop fs –cat output/* | head
  • 20. Hadoop for R Sys.setenv(HADOOP_HOME="/home/istvan/hadoop") Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop") library(rmr2) library(rhdfs) setwd("/home/istvan/rhadoop/blogs/") gdp <- read.csv("GDP_converted.csv") head(gdp) hdfs.init() gdp.values <- to.dfs(gdp) # AAPL revenue in 2012 in millions USD aaplRevenue = 156508 gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count) • RHadoop package o rmr o rhdfs o Rhbase • Uses Hadoop Streaming • Example on the right determines how many countries have greater GDP than Apple Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
  • 21. Search index example • Crawl web o Crawl and save websites to local directory • Ingest files to HDFS • Map o Split the words & associate words with file names • Reduce o Build an index with words and files & count of occurrences • Search o Pass the word to the index to get the files it shows up in. Display the file listing in descending order of number of occurrences of the word in a file
  • 22. Recommender example • Use web server logs with user ratings info for items • Create Hive tables to build structure on top of this log data • Generate Mahout specific csv input file (user, item, rating) • Run Mahout to build item recommendations for users o mahout recommeditembased --input /user/hive/warehouse/mahout_input --output recommendations -s SIMILARITY_PEARSON_CORRELATION –n 20
  • 23. Recap Why Hadoop? What is Hadoop? How to Hadoop? Demo Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
  • 24. Q & A Contact Siva Pandeti: Email: siva@pandeti.com LinkedIn: www.linkedin.com/in/SivaPandeti Twitter: @SivaPandeti http://pandeti.com/blog