Hadoop in a Nutshell

Hadoop in a Nutshell
By T. Anthony Date: 13th June, 2018

Contents
• What is Hadoop
• Why Hadoop ?
• When to Use Hadoop & When not to
• Hadoop Reference Architecture
• Hadoop Infrastructure Requirements
• Comparison of Vendors Providing Hadoop
as a Service & IAAS
• Comparison of Distributions Supporting
Hadoop on Premise
• Step By Step Approach for Hadoop
Deployment
• Hadoop Ecosystem
• Business Value
• Predictive Analytics
• Predictive Analytics Vendor Assessment

What is Hadoop
Hadoop is a distributed framework that makes it easier to
process large data sets that reside in clusters of computers.
Hadoop is made up of four core modules that are supported by a
large ecosystem of supporting technologies and products. The
modules are:
Hadoop Distributed File System (HDFS) – Provides access to
application data. Hadoop can also work with other file systems,
including FTP, Amazon S3 and Windows Azure Storage Blobs
(WASB), among others.
Hadoop YARN – Provides the framework to schedule jobs and
manage resources across the cluster that holds the data
Hadoop MapReduce – A YARN-based parallel processing system
for large data sets.
Hadoop Common – A set of utilities that supports the three
other core modules.

Why Hadoop
 Hadoop analytics help streamline manufacturing process.
 Hadoop can significantly reduce the work of employees.
 Hadoop file system makes analytical processing 10 times
faster on 75% as much computing power, even as datasets
grow 10 times larger.
 Big data analytics becomes simpler as user friendly tools
become available.
 Hadoop framework has proved to be effective in Cloud
manufacturing systems.
 Hadoop is the best solution for near real time predictive
analytics for manufacturing at various stages like reduce
manufacturing defects, improve process yield and asset
performance.

When not to use Hadoop
# 1. Real Time Analytics
If you want to do some Real Time Analytics, where you are expecting
result quickly, Hadoop should not be used directly. It is because
Hadoop works on batch processing, hence response time is high. By
using spark the processing can be done in real time and in a flash (real
quick).

# 2. Not a Replacement for Existing Infrastructure
Hadoop is not a replacement for your existing data processing
infrastructure. However, you can use Hadoop along with it.
All the historical big data can be stored in Hadoop HDFS and it can be
processed and transformed into a structured manageable data. After
processing the data in Hadoop you need to send the output to
relational database technologies for BI, decision support, reporting etc.

# 3. Multiple Smaller Datasets
Hadoop framework is not recommended for small-structured
datasets as you have other tools available in market which can do
this work quite easily and at a fast pace than Hadoop like MS Excel,
RDBMS etc. For a small data analytics, Hadoop can be costlier than
other tools.
.

# 4 Where Security is the primary Concern?
Many enterprises — especially within highly regulated industries
dealing with sensitive data — aren’t able to move as quickly as they
would like towards implementing Big Data projects and Hadoop.
Encrypt data while moving to Hadoop. Write a MapReduce
program using any encryption Algorithm which encrypts the data
and stores it in HDFS. Finally, use the data for further MapReduce
processing to get relevant insights.

When to use Hadoop
# 1. # 1. Data Size and Data Diversity
When we are dealing with huge volumes of data coming from various
sources and in a variety of formats Hadoop is the right technology

When to use Hadoop
# 2. Future Planning
To implement Hadoop on our data we should first understand the
level of complexity of data and the rate with which it is growing.

When to use Hadoop
# 3. Multiple Frameworks for Big Data
There are various tools for various purposes. Hadoop can be
integrated with multiple analytic tools to get the best out of it, like
Mahout for Machine-Learning, R and Python for Analytics and
visualization, Python, Spark for real time processing, MongoDB and
Hbase for Nosql database, Pentaho for BI etc.

Hadoop Infrastructure Requirements
<<Internal>> 13

Comparison of
Vendors
Providing
Hadoop as
a Service
(SAAS)
<<Internal>> 14
Vendors Features
Amazon EMR Easy to use. Within minutes cluster can be
configured and Hadoop
application is ready to run.
Save Money with Spot instances.
Spot Instances are a way to
purchase virtual servers for your
cluster at a discount. Excess
capacity in Amazon Web
Services is offered at a
fluctuating price, based on
supply and demand.
Amazon EMR supports MapR
distribution.
Qubole Reduced Deployment
Complexity.
Reduced Management
complexity
Used in short running analysis
jobs.
Used to realize hybrid cloud
setups.
Microsoft Windows Azure
HDInsight
Deployment agility-- HDInsight
offers agility to meet the
changing needs of your
organization. With a rich library
of Powershell scripts you can
deploy and provision a Hadoop
cluster in minutes instead of
hours or days.
Simplicity. Ease of management. Offers enterprise-class security
and scalability.

Comparison of
Vendors
Providing IaaS
for Hadoop:
<<Internal>> 15
Vendors Features
Amazon Web Services EC2 Obtain and configure capacity
with minimal friction
Complete control of computing
resources
Cheap economy of computing by
allowing one to pay for capacity
that one actually use.
Offers infrastructure services like
workflows, message passing,
archival storage, in memory
caching services, search services,
both relational and NOSQL
database.
RackSpace Easy to Use control panel. Easy to create basic monitoring
checks like ping or HTTP checks.
Managed Service Offering:
Customers are able to deploy a
fully featured and supported
Hadoop Infrastructure through a
single vendor contract.
Rapidly deploy with low
operational burden. Offers
simple pricing options.
MS Azure Ready access to virtual networks,
service buses, message queues
and non relational storage
platforms.
Compute and storage services
are at ease when compared to
other Iaas providers.
Ability to specify an availability
zone.
Provides programmatic
interfaces to some of the
services.
IBM Smartcloud Offers features important to
administrators, especially
management and add on
services.
Provisioning storage is simply
straightforward.
Storage costs are based on a
combination of allocated storage
size and I/O operations.
Iaas for use in data reporting and
analytics.

Comparison of Distributions
Supporting Hadoop on Premise
Distributions Features
Cloudera Enterprise-grade security, Increased
Cost Savings
Technical support Flexible Deployment, Faster Time to
Insight
HortonWorks With YARN enables multiple
workloads, applications and
processing engines across single
clusters with greater efficiency than
ever before.
Security and High Availability Tested at scale on hundreds of
production nodes.
MapR Proven, enterprise-grade platform
that supports a broad set of mission-
critical and real-time production
uses.
Ease of use, Instant Recovery &
continuous low latency
Full Data Protection with Snapshots
&
Business Continuity with Mirroring

Step By
Step
Approach
for Hadoop
Deployment
Scenario-1
<<Internal>> 17
• Managed Hadoop cluster ready to use with less or no
configuration.
• Most commonly used tools pre-installed and configured
like Hive, Pig and Sqoop and other services
Business Requirement 1:
• Then Hadoop as a service is recommended.
Our Solution:
• Amazon Web Services Elastic MapReduce (EMR)
• Qubole data services
• Microsoft Windows Azure HDInsight
Vendors Providing - Hadoop as a Service:

Step By Step
Approach for
Hadoop
Deployment
Scenario-2
•Choice of your own Hadoop distribution
software depending on your business
requirements.
•Fast turnaround time for infrastructure.
•High elasticity and scale out
Business Requirement 2:
•Then Hadoop on IaaS is recommended.
Our Solution:
•Amazon Web Services EC2
•RackSpace
•MS Azure
•IBM Smartcloud
Vendors Providing - IaaS for Hadoop:

Step By Step Approach for Hadoop
Deployment Scenario -3
• Business Requirement 3:
• Choice of your own Hadoop distribution software
depending on your business requirements.
• High performance and low latency
• If you are having security and privacy challenges for
moving the data out of enterprise to cloud.
• Our Solution
• Then Hadoop On-Premise is recommended
• List of Hadoop distributions supporting Hadoop On-
Premise:
• Cloudera
• HortonWorks
• MapR

Hadoop
Ecosystem
Components
Pig: Apache Pig allows you to write complex MapReduce
transformations using a simple scripting language.
Hive: Apache Hive is data warehouse infrastructure built on
top of Apache Hadoop for providing data
summarization, ad-hoc query, and analysis of large
datasets.
Hbase: Apache HBase is a non-relational (NoSQL) database that
runs on top of the Hadoop Distributed File System
(HDFS).
Flume: Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data into the
Hadoop Distributed File System (HDFS).
Sqoop: Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
Mahout: Apache Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop and
using the MapReduce paradigm.

Business
Value
• Humanize ‘Big Data’ by
business decision makers
and analysts and there by
extracting real business
value.
• Organizations can
perform analytics at
much lower cost.
• Big data along with the
information from
traditional data sources
enable skilled business
analysts to discover new
insights, patterns and
trends in the business.
• Speed time to value.
• Blend data to Add
context.
• Analyze without
complexity.

Process of Predictive Modeling
Most process for creating
predictive models
incorporate the following
steps
Project Definition /
Business Understanding
•Define business objectives and
desired outcomes
Exploration / Data
Understanding
•Analyze source data to
determine appropriate data,
model building approach and
scope
Data Preparation
•Select, extract and transform
data to create models
Model Building
•Create, test and validate models,
and evaluate them
Deployment
•Apply model results to business
decisions or processes
Model Management
•Manage models to improve
performance, accuracy, control
access , promote reuse

Predictive-
Analytics Vendor
Assessment
<<Internal>> 23

Hadoop in a Nutshell

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop in a Nutshell

Similar to Hadoop in a Nutshell (20)

Recently uploaded

Recently uploaded (20)

Hadoop in a Nutshell

Editor's Notes