SlideShare a Scribd company logo
1 of 24
Hadoop in a Nutshell
By T. Anthony Date: 13th June, 2018
Contents
• What is Hadoop
• Why Hadoop ?
• When to Use Hadoop & When not to
• Hadoop Reference Architecture
• Hadoop Infrastructure Requirements
• Comparison of Vendors Providing Hadoop
as a Service & IAAS
• Comparison of Distributions Supporting
Hadoop on Premise
• Step By Step Approach for Hadoop
Deployment
• Hadoop Ecosystem
• Business Value
• Predictive Analytics
• Predictive Analytics Vendor Assessment
What is Hadoop
Hadoop is a distributed framework that makes it easier to
process large data sets that reside in clusters of computers.
Hadoop is made up of four core modules that are supported by a
large ecosystem of supporting technologies and products. The
modules are:
Hadoop Distributed File System (HDFS) – Provides access to
application data. Hadoop can also work with other file systems,
including FTP, Amazon S3 and Windows Azure Storage Blobs
(WASB), among others.
Hadoop YARN – Provides the framework to schedule jobs and
manage resources across the cluster that holds the data
Hadoop MapReduce – A YARN-based parallel processing system
for large data sets.
Hadoop Common – A set of utilities that supports the three
other core modules.
Why Hadoop
 Hadoop analytics help streamline manufacturing process.
 Hadoop can significantly reduce the work of employees.
 Hadoop file system makes analytical processing 10 times
faster on 75% as much computing power, even as datasets
grow 10 times larger.
 Big data analytics becomes simpler as user friendly tools
become available.
 Hadoop framework has proved to be effective in Cloud
manufacturing systems.
 Hadoop is the best solution for near real time predictive
analytics for manufacturing at various stages like reduce
manufacturing defects, improve process yield and asset
performance.
When not to use Hadoop
# 1. Real Time Analytics
If you want to do some Real Time Analytics, where you are expecting
result quickly, Hadoop should not be used directly. It is because
Hadoop works on batch processing, hence response time is high. By
using spark the processing can be done in real time and in a flash (real
quick).
When not to use Hadoop
# 2. Not a Replacement for Existing Infrastructure
Hadoop is not a replacement for your existing data processing
infrastructure. However, you can use Hadoop along with it.
All the historical big data can be stored in Hadoop HDFS and it can be
processed and transformed into a structured manageable data. After
processing the data in Hadoop you need to send the output to
relational database technologies for BI, decision support, reporting etc.
When not to use Hadoop
# 3. Multiple Smaller Datasets
Hadoop framework is not recommended for small-structured
datasets as you have other tools available in market which can do
this work quite easily and at a fast pace than Hadoop like MS Excel,
RDBMS etc. For a small data analytics, Hadoop can be costlier than
other tools.
.
When not to use Hadoop
# 4 Where Security is the primary Concern?
Many enterprises — especially within highly regulated industries
dealing with sensitive data — aren’t able to move as quickly as they
would like towards implementing Big Data projects and Hadoop.
Encrypt data while moving to Hadoop. Write a MapReduce
program using any encryption Algorithm which encrypts the data
and stores it in HDFS. Finally, use the data for further MapReduce
processing to get relevant insights.
When to use Hadoop
# 1. # 1. Data Size and Data Diversity
When we are dealing with huge volumes of data coming from various
sources and in a variety of formats Hadoop is the right technology
When to use Hadoop
# 2. Future Planning
To implement Hadoop on our data we should first understand the
level of complexity of data and the rate with which it is growing.
When to use Hadoop
# 3. Multiple Frameworks for Big Data
There are various tools for various purposes. Hadoop can be
integrated with multiple analytic tools to get the best out of it, like
Mahout for Machine-Learning, R and Python for Analytics and
visualization, Python, Spark for real time processing, MongoDB and
Hbase for Nosql database, Pentaho for BI etc.
Hadoop Reference Architecture
Hadoop Infrastructure Requirements
<<Internal>> 13
Comparison of
Vendors
Providing
Hadoop as
a Service
(SAAS)
<<Internal>> 14
Vendors Features
Amazon EMR Easy to use. Within minutes cluster can be
configured and Hadoop
application is ready to run.
Save Money with Spot instances.
Spot Instances are a way to
purchase virtual servers for your
cluster at a discount. Excess
capacity in Amazon Web
Services is offered at a
fluctuating price, based on
supply and demand.
Amazon EMR supports MapR
distribution.
Qubole Reduced Deployment
Complexity.
Reduced Management
complexity
Used in short running analysis
jobs.
Used to realize hybrid cloud
setups.
Microsoft Windows Azure
HDInsight
Deployment agility-- HDInsight
offers agility to meet the
changing needs of your
organization. With a rich library
of Powershell scripts you can
deploy and provision a Hadoop
cluster in minutes instead of
hours or days.
Simplicity. Ease of management. Offers enterprise-class security
and scalability.
Comparison of
Vendors
Providing IaaS
for Hadoop:
<<Internal>> 15
Vendors Features
Amazon Web Services EC2 Obtain and configure capacity
with minimal friction
Complete control of computing
resources
Cheap economy of computing by
allowing one to pay for capacity
that one actually use.
Offers infrastructure services like
workflows, message passing,
archival storage, in memory
caching services, search services,
both relational and NOSQL
database.
RackSpace Easy to Use control panel. Easy to create basic monitoring
checks like ping or HTTP checks.
Managed Service Offering:
Customers are able to deploy a
fully featured and supported
Hadoop Infrastructure through a
single vendor contract.
Rapidly deploy with low
operational burden. Offers
simple pricing options.
MS Azure Ready access to virtual networks,
service buses, message queues
and non relational storage
platforms.
Compute and storage services
are at ease when compared to
other Iaas providers.
Ability to specify an availability
zone.
Provides programmatic
interfaces to some of the
services.
IBM Smartcloud Offers features important to
administrators, especially
management and add on
services.
Provisioning storage is simply
straightforward.
Storage costs are based on a
combination of allocated storage
size and I/O operations.
Iaas for use in data reporting and
analytics.
Comparison of Distributions
Supporting Hadoop on Premise
Distributions Features
Cloudera Enterprise-grade security, Increased
Cost Savings
Technical support Flexible Deployment, Faster Time to
Insight
HortonWorks With YARN enables multiple
workloads, applications and
processing engines across single
clusters with greater efficiency than
ever before.
Security and High Availability Tested at scale on hundreds of
production nodes.
MapR Proven, enterprise-grade platform
that supports a broad set of mission-
critical and real-time production
uses.
Ease of use, Instant Recovery &
continuous low latency
Full Data Protection with Snapshots
&
Business Continuity with Mirroring
Step By
Step
Approach
for Hadoop
Deployment
Scenario-1
<<Internal>> 17
• Managed Hadoop cluster ready to use with less or no
configuration.
• Most commonly used tools pre-installed and configured
like Hive, Pig and Sqoop and other services
Business Requirement 1:
• Then Hadoop as a service is recommended.
Our Solution:
• Amazon Web Services Elastic MapReduce (EMR)
• Qubole data services
• Microsoft Windows Azure HDInsight
Vendors Providing - Hadoop as a Service:
Step By Step
Approach for
Hadoop
Deployment
Scenario-2
•Choice of your own Hadoop distribution
software depending on your business
requirements.
•Fast turnaround time for infrastructure.
•High elasticity and scale out
Business Requirement 2:
•Then Hadoop on IaaS is recommended.
Our Solution:
•Amazon Web Services EC2
•RackSpace
•MS Azure
•IBM Smartcloud
Vendors Providing - IaaS for Hadoop:
Step By Step Approach for Hadoop
Deployment Scenario -3
• Business Requirement 3:
• Choice of your own Hadoop distribution software
depending on your business requirements.
• High performance and low latency
• If you are having security and privacy challenges for
moving the data out of enterprise to cloud.
• Our Solution
• Then Hadoop On-Premise is recommended
• List of Hadoop distributions supporting Hadoop On-
Premise:
• Cloudera
• HortonWorks
• MapR
Hadoop
Ecosystem
Components
Pig: Apache Pig allows you to write complex MapReduce
transformations using a simple scripting language.
Hive: Apache Hive is data warehouse infrastructure built on
top of Apache Hadoop for providing data
summarization, ad-hoc query, and analysis of large
datasets.
Hbase: Apache HBase is a non-relational (NoSQL) database that
runs on top of the Hadoop Distributed File System
(HDFS).
Flume: Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data into the
Hadoop Distributed File System (HDFS).
Sqoop: Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
Mahout: Apache Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop and
using the MapReduce paradigm.
Business
Value
• Humanize ‘Big Data’ by
business decision makers
and analysts and there by
extracting real business
value.
• Organizations can
perform analytics at
much lower cost.
• Big data along with the
information from
traditional data sources
enable skilled business
analysts to discover new
insights, patterns and
trends in the business.
• Speed time to value.
• Blend data to Add
context.
• Analyze without
complexity.
Process of Predictive Modeling
Most process for creating
predictive models
incorporate the following
steps
Project Definition /
Business Understanding
•Define business objectives and
desired outcomes
Exploration / Data
Understanding
•Analyze source data to
determine appropriate data,
model building approach and
scope
Data Preparation
•Select, extract and transform
data to create models
Model Building
•Create, test and validate models,
and evaluate them
Deployment
•Apply model results to business
decisions or processes
Model Management
•Manage models to improve
performance, accuracy, control
access , promote reuse
Predictive-
Analytics Vendor
Assessment
<<Internal>> 23
Thank You

More Related Content

What's hot

Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer BuildPetteriTeikariPhD
 
Web service Introduction
Web service IntroductionWeb service Introduction
Web service IntroductionMadhukar Kumar
 
Instalación pfsense y portal cautivo
Instalación pfsense y portal cautivoInstalación pfsense y portal cautivo
Instalación pfsense y portal cautivo566689
 
Adhoc and Sensor Networks - Chapter 10
Adhoc and Sensor Networks - Chapter 10Adhoc and Sensor Networks - Chapter 10
Adhoc and Sensor Networks - Chapter 10Ali Habeeb
 
BGP Techniques for Network Operators
BGP Techniques for Network OperatorsBGP Techniques for Network Operators
BGP Techniques for Network OperatorsAPNIC
 
CCNP Switching Chapter 4
CCNP Switching Chapter 4CCNP Switching Chapter 4
CCNP Switching Chapter 4Chaing Ravuth
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Juniper Srx quickstart-12.1r3
Juniper Srx quickstart-12.1r3Juniper Srx quickstart-12.1r3
Juniper Srx quickstart-12.1r3Mohamed Al-Natour
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDuraSpace
 
BGP Graceful Shutdown - IOS XR
BGP Graceful Shutdown - IOS XR BGP Graceful Shutdown - IOS XR
BGP Graceful Shutdown - IOS XR Bertrand Duvivier
 
Zenith Networks: Jump Start JUNOS
Zenith Networks: Jump Start JUNOSZenith Networks: Jump Start JUNOS
Zenith Networks: Jump Start JUNOSZenith Networks
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introductionfredcons
 
OrientDB document or graph? Select the right model (old presentation)
OrientDB document or graph? Select the right model (old presentation)OrientDB document or graph? Select the right model (old presentation)
OrientDB document or graph? Select the right model (old presentation)Luca Garulli
 
Hadoop et son écosystème - v2
Hadoop et son écosystème - v2Hadoop et son écosystème - v2
Hadoop et son écosystème - v2Khanh Maudoux
 

What's hot (20)

Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer Build
 
Web service Introduction
Web service IntroductionWeb service Introduction
Web service Introduction
 
CISCO HSRP VRRP GLBP
CISCO HSRP VRRP GLBPCISCO HSRP VRRP GLBP
CISCO HSRP VRRP GLBP
 
Instalación pfsense y portal cautivo
Instalación pfsense y portal cautivoInstalación pfsense y portal cautivo
Instalación pfsense y portal cautivo
 
Adhoc and Sensor Networks - Chapter 10
Adhoc and Sensor Networks - Chapter 10Adhoc and Sensor Networks - Chapter 10
Adhoc and Sensor Networks - Chapter 10
 
BGP Techniques for Network Operators
BGP Techniques for Network OperatorsBGP Techniques for Network Operators
BGP Techniques for Network Operators
 
Http
HttpHttp
Http
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
 
CCNP Switching Chapter 4
CCNP Switching Chapter 4CCNP Switching Chapter 4
CCNP Switching Chapter 4
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Apache samza
Apache samzaApache samza
Apache samza
 
Juniper Srx quickstart-12.1r3
Juniper Srx quickstart-12.1r3Juniper Srx quickstart-12.1r3
Juniper Srx quickstart-12.1r3
 
Web Services - WSDL
Web Services - WSDLWeb Services - WSDL
Web Services - WSDL
 
Advanced Topics in IP Multicast Deployment
Advanced Topics in IP Multicast DeploymentAdvanced Topics in IP Multicast Deployment
Advanced Topics in IP Multicast Deployment
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
 
BGP Graceful Shutdown - IOS XR
BGP Graceful Shutdown - IOS XR BGP Graceful Shutdown - IOS XR
BGP Graceful Shutdown - IOS XR
 
Zenith Networks: Jump Start JUNOS
Zenith Networks: Jump Start JUNOSZenith Networks: Jump Start JUNOS
Zenith Networks: Jump Start JUNOS
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introduction
 
OrientDB document or graph? Select the right model (old presentation)
OrientDB document or graph? Select the right model (old presentation)OrientDB document or graph? Select the right model (old presentation)
OrientDB document or graph? Select the right model (old presentation)
 
Hadoop et son écosystème - v2
Hadoop et son écosystème - v2Hadoop et son écosystème - v2
Hadoop et son écosystème - v2
 

Similar to Hadoop in a Nutshell

Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionAppfluent Technology
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Similar to Hadoop in a Nutshell (20)

Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
paper
paperpaper
paper
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Hadoop in a Nutshell

  • 1. Hadoop in a Nutshell By T. Anthony Date: 13th June, 2018
  • 2. Contents • What is Hadoop • Why Hadoop ? • When to Use Hadoop & When not to • Hadoop Reference Architecture • Hadoop Infrastructure Requirements • Comparison of Vendors Providing Hadoop as a Service & IAAS • Comparison of Distributions Supporting Hadoop on Premise • Step By Step Approach for Hadoop Deployment • Hadoop Ecosystem • Business Value • Predictive Analytics • Predictive Analytics Vendor Assessment
  • 3. What is Hadoop Hadoop is a distributed framework that makes it easier to process large data sets that reside in clusters of computers. Hadoop is made up of four core modules that are supported by a large ecosystem of supporting technologies and products. The modules are: Hadoop Distributed File System (HDFS) – Provides access to application data. Hadoop can also work with other file systems, including FTP, Amazon S3 and Windows Azure Storage Blobs (WASB), among others. Hadoop YARN – Provides the framework to schedule jobs and manage resources across the cluster that holds the data Hadoop MapReduce – A YARN-based parallel processing system for large data sets. Hadoop Common – A set of utilities that supports the three other core modules.
  • 4. Why Hadoop  Hadoop analytics help streamline manufacturing process.  Hadoop can significantly reduce the work of employees.  Hadoop file system makes analytical processing 10 times faster on 75% as much computing power, even as datasets grow 10 times larger.  Big data analytics becomes simpler as user friendly tools become available.  Hadoop framework has proved to be effective in Cloud manufacturing systems.  Hadoop is the best solution for near real time predictive analytics for manufacturing at various stages like reduce manufacturing defects, improve process yield and asset performance.
  • 5. When not to use Hadoop # 1. Real Time Analytics If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly. It is because Hadoop works on batch processing, hence response time is high. By using spark the processing can be done in real time and in a flash (real quick).
  • 6. When not to use Hadoop # 2. Not a Replacement for Existing Infrastructure Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it. All the historical big data can be stored in Hadoop HDFS and it can be processed and transformed into a structured manageable data. After processing the data in Hadoop you need to send the output to relational database technologies for BI, decision support, reporting etc.
  • 7. When not to use Hadoop # 3. Multiple Smaller Datasets Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools. .
  • 8. When not to use Hadoop # 4 Where Security is the primary Concern? Many enterprises — especially within highly regulated industries dealing with sensitive data — aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop. Encrypt data while moving to Hadoop. Write a MapReduce program using any encryption Algorithm which encrypts the data and stores it in HDFS. Finally, use the data for further MapReduce processing to get relevant insights.
  • 9. When to use Hadoop # 1. # 1. Data Size and Data Diversity When we are dealing with huge volumes of data coming from various sources and in a variety of formats Hadoop is the right technology
  • 10. When to use Hadoop # 2. Future Planning To implement Hadoop on our data we should first understand the level of complexity of data and the rate with which it is growing.
  • 11. When to use Hadoop # 3. Multiple Frameworks for Big Data There are various tools for various purposes. Hadoop can be integrated with multiple analytic tools to get the best out of it, like Mahout for Machine-Learning, R and Python for Analytics and visualization, Python, Spark for real time processing, MongoDB and Hbase for Nosql database, Pentaho for BI etc.
  • 14. Comparison of Vendors Providing Hadoop as a Service (SAAS) <<Internal>> 14 Vendors Features Amazon EMR Easy to use. Within minutes cluster can be configured and Hadoop application is ready to run. Save Money with Spot instances. Spot Instances are a way to purchase virtual servers for your cluster at a discount. Excess capacity in Amazon Web Services is offered at a fluctuating price, based on supply and demand. Amazon EMR supports MapR distribution. Qubole Reduced Deployment Complexity. Reduced Management complexity Used in short running analysis jobs. Used to realize hybrid cloud setups. Microsoft Windows Azure HDInsight Deployment agility-- HDInsight offers agility to meet the changing needs of your organization. With a rich library of Powershell scripts you can deploy and provision a Hadoop cluster in minutes instead of hours or days. Simplicity. Ease of management. Offers enterprise-class security and scalability.
  • 15. Comparison of Vendors Providing IaaS for Hadoop: <<Internal>> 15 Vendors Features Amazon Web Services EC2 Obtain and configure capacity with minimal friction Complete control of computing resources Cheap economy of computing by allowing one to pay for capacity that one actually use. Offers infrastructure services like workflows, message passing, archival storage, in memory caching services, search services, both relational and NOSQL database. RackSpace Easy to Use control panel. Easy to create basic monitoring checks like ping or HTTP checks. Managed Service Offering: Customers are able to deploy a fully featured and supported Hadoop Infrastructure through a single vendor contract. Rapidly deploy with low operational burden. Offers simple pricing options. MS Azure Ready access to virtual networks, service buses, message queues and non relational storage platforms. Compute and storage services are at ease when compared to other Iaas providers. Ability to specify an availability zone. Provides programmatic interfaces to some of the services. IBM Smartcloud Offers features important to administrators, especially management and add on services. Provisioning storage is simply straightforward. Storage costs are based on a combination of allocated storage size and I/O operations. Iaas for use in data reporting and analytics.
  • 16. Comparison of Distributions Supporting Hadoop on Premise Distributions Features Cloudera Enterprise-grade security, Increased Cost Savings Technical support Flexible Deployment, Faster Time to Insight HortonWorks With YARN enables multiple workloads, applications and processing engines across single clusters with greater efficiency than ever before. Security and High Availability Tested at scale on hundreds of production nodes. MapR Proven, enterprise-grade platform that supports a broad set of mission- critical and real-time production uses. Ease of use, Instant Recovery & continuous low latency Full Data Protection with Snapshots & Business Continuity with Mirroring
  • 17. Step By Step Approach for Hadoop Deployment Scenario-1 <<Internal>> 17 • Managed Hadoop cluster ready to use with less or no configuration. • Most commonly used tools pre-installed and configured like Hive, Pig and Sqoop and other services Business Requirement 1: • Then Hadoop as a service is recommended. Our Solution: • Amazon Web Services Elastic MapReduce (EMR) • Qubole data services • Microsoft Windows Azure HDInsight Vendors Providing - Hadoop as a Service:
  • 18. Step By Step Approach for Hadoop Deployment Scenario-2 •Choice of your own Hadoop distribution software depending on your business requirements. •Fast turnaround time for infrastructure. •High elasticity and scale out Business Requirement 2: •Then Hadoop on IaaS is recommended. Our Solution: •Amazon Web Services EC2 •RackSpace •MS Azure •IBM Smartcloud Vendors Providing - IaaS for Hadoop:
  • 19. Step By Step Approach for Hadoop Deployment Scenario -3 • Business Requirement 3: • Choice of your own Hadoop distribution software depending on your business requirements. • High performance and low latency • If you are having security and privacy challenges for moving the data out of enterprise to cloud. • Our Solution • Then Hadoop On-Premise is recommended • List of Hadoop distributions supporting Hadoop On- Premise: • Cloudera • HortonWorks • MapR
  • 20. Hadoop Ecosystem Components Pig: Apache Pig allows you to write complex MapReduce transformations using a simple scripting language. Hive: Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. Hbase: Apache HBase is a non-relational (NoSQL) database that runs on top of the Hadoop Distributed File System (HDFS). Flume: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Mahout: Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.
  • 21. Business Value • Humanize ‘Big Data’ by business decision makers and analysts and there by extracting real business value. • Organizations can perform analytics at much lower cost. • Big data along with the information from traditional data sources enable skilled business analysts to discover new insights, patterns and trends in the business. • Speed time to value. • Blend data to Add context. • Analyze without complexity.
  • 22. Process of Predictive Modeling Most process for creating predictive models incorporate the following steps Project Definition / Business Understanding •Define business objectives and desired outcomes Exploration / Data Understanding •Analyze source data to determine appropriate data, model building approach and scope Data Preparation •Select, extract and transform data to create models Model Building •Create, test and validate models, and evaluate them Deployment •Apply model results to business decisions or processes Model Management •Manage models to improve performance, accuracy, control access , promote reuse

Editor's Notes

  1. Amazon Web Services Elastic Map/Reduce Cloud option Supported Hadoop distributions: Amazon, MapR EMR supports powerful and proven Hadoop tools such as Hive, Pig, HBase, and Impala Pros Easy to setup, can be launched and terminated on demand Cons Expensive limited choice of OS, Hadoop distributions and applications Amazon EC2 instances Cloud option Hadoop distributions Cloudera, HortonWorks … can be installed. Pros Relatively cheaper Install any OS, Hadoop distributions and applications Cons Need to manually install and manage the cluster On-Premises Hadoop Cluster Pros Cheapest Install any OS, Hadoop distribution (Cloudera, HortonWorks, MapR) Utilized physical hardware  Cons Must manually install and manage the cluster and physical hardware
  2. Hadoop as a Service offering provides a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Jobtracker, Tasktracker, Namenode, Datanode, and may provide secondary services like Zookeeper or HBase. Ideally, such a service also provides some of the most commonly used tools pre-installed and configured like Hive, Pig, and Sqoop. More advanced services expand this even further and include graphical interfaces and optimizations, which enable a wide user audience to utilize Hadoop transparently. Pros: Easy and low cost Cons: Vendor lock-in Data Transfer limitations High latency
  3. We can leverage Public IaaS and setup Hadoop on them. Hadoop distributions Cloudera, HortonWorks, MapR can be deployed on IaaS. Pros Lowering the cost of innovation Procuring large scale resources quickly Handling variable resource requirements Cons Data transfer limitations. Loading data from on-premises to cloud on network will be slow.
  4. All the major distributions Cloudera, HortonWorks, MapR support Hadoop On-Premises. Pros Hadoop runs best on physical servers High performance and low latency Cons Procurement lead time Increased cost with lower utilizations of dedicated hardware
  5. Pig: Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort.  Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data operations: standard extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Hive: It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and tools for business intelligence and visualization. Hbase : It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. Flume : It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery. Sqoop: It imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB. Mahout : Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes.