CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Hadoop in a Nutshell
1. Hadoop in a Nutshell
By T. Anthony Date: 13th June, 2018
2. Contents
• What is Hadoop
• Why Hadoop ?
• When to Use Hadoop & When not to
• Hadoop Reference Architecture
• Hadoop Infrastructure Requirements
• Comparison of Vendors Providing Hadoop
as a Service & IAAS
• Comparison of Distributions Supporting
Hadoop on Premise
• Step By Step Approach for Hadoop
Deployment
• Hadoop Ecosystem
• Business Value
• Predictive Analytics
• Predictive Analytics Vendor Assessment
3. What is Hadoop
Hadoop is a distributed framework that makes it easier to
process large data sets that reside in clusters of computers.
Hadoop is made up of four core modules that are supported by a
large ecosystem of supporting technologies and products. The
modules are:
Hadoop Distributed File System (HDFS) – Provides access to
application data. Hadoop can also work with other file systems,
including FTP, Amazon S3 and Windows Azure Storage Blobs
(WASB), among others.
Hadoop YARN – Provides the framework to schedule jobs and
manage resources across the cluster that holds the data
Hadoop MapReduce – A YARN-based parallel processing system
for large data sets.
Hadoop Common – A set of utilities that supports the three
other core modules.
4. Why Hadoop
Hadoop analytics help streamline manufacturing process.
Hadoop can significantly reduce the work of employees.
Hadoop file system makes analytical processing 10 times
faster on 75% as much computing power, even as datasets
grow 10 times larger.
Big data analytics becomes simpler as user friendly tools
become available.
Hadoop framework has proved to be effective in Cloud
manufacturing systems.
Hadoop is the best solution for near real time predictive
analytics for manufacturing at various stages like reduce
manufacturing defects, improve process yield and asset
performance.
5. When not to use Hadoop
# 1. Real Time Analytics
If you want to do some Real Time Analytics, where you are expecting
result quickly, Hadoop should not be used directly. It is because
Hadoop works on batch processing, hence response time is high. By
using spark the processing can be done in real time and in a flash (real
quick).
6. When not to use Hadoop
# 2. Not a Replacement for Existing Infrastructure
Hadoop is not a replacement for your existing data processing
infrastructure. However, you can use Hadoop along with it.
All the historical big data can be stored in Hadoop HDFS and it can be
processed and transformed into a structured manageable data. After
processing the data in Hadoop you need to send the output to
relational database technologies for BI, decision support, reporting etc.
7. When not to use Hadoop
# 3. Multiple Smaller Datasets
Hadoop framework is not recommended for small-structured
datasets as you have other tools available in market which can do
this work quite easily and at a fast pace than Hadoop like MS Excel,
RDBMS etc. For a small data analytics, Hadoop can be costlier than
other tools.
.
8. When not to use Hadoop
# 4 Where Security is the primary Concern?
Many enterprises — especially within highly regulated industries
dealing with sensitive data — aren’t able to move as quickly as they
would like towards implementing Big Data projects and Hadoop.
Encrypt data while moving to Hadoop. Write a MapReduce
program using any encryption Algorithm which encrypts the data
and stores it in HDFS. Finally, use the data for further MapReduce
processing to get relevant insights.
9. When to use Hadoop
# 1. # 1. Data Size and Data Diversity
When we are dealing with huge volumes of data coming from various
sources and in a variety of formats Hadoop is the right technology
10. When to use Hadoop
# 2. Future Planning
To implement Hadoop on our data we should first understand the
level of complexity of data and the rate with which it is growing.
11. When to use Hadoop
# 3. Multiple Frameworks for Big Data
There are various tools for various purposes. Hadoop can be
integrated with multiple analytic tools to get the best out of it, like
Mahout for Machine-Learning, R and Python for Analytics and
visualization, Python, Spark for real time processing, MongoDB and
Hbase for Nosql database, Pentaho for BI etc.
14. Comparison of
Vendors
Providing
Hadoop as
a Service
(SAAS)
<<Internal>> 14
Vendors Features
Amazon EMR Easy to use. Within minutes cluster can be
configured and Hadoop
application is ready to run.
Save Money with Spot instances.
Spot Instances are a way to
purchase virtual servers for your
cluster at a discount. Excess
capacity in Amazon Web
Services is offered at a
fluctuating price, based on
supply and demand.
Amazon EMR supports MapR
distribution.
Qubole Reduced Deployment
Complexity.
Reduced Management
complexity
Used in short running analysis
jobs.
Used to realize hybrid cloud
setups.
Microsoft Windows Azure
HDInsight
Deployment agility-- HDInsight
offers agility to meet the
changing needs of your
organization. With a rich library
of Powershell scripts you can
deploy and provision a Hadoop
cluster in minutes instead of
hours or days.
Simplicity. Ease of management. Offers enterprise-class security
and scalability.
15. Comparison of
Vendors
Providing IaaS
for Hadoop:
<<Internal>> 15
Vendors Features
Amazon Web Services EC2 Obtain and configure capacity
with minimal friction
Complete control of computing
resources
Cheap economy of computing by
allowing one to pay for capacity
that one actually use.
Offers infrastructure services like
workflows, message passing,
archival storage, in memory
caching services, search services,
both relational and NOSQL
database.
RackSpace Easy to Use control panel. Easy to create basic monitoring
checks like ping or HTTP checks.
Managed Service Offering:
Customers are able to deploy a
fully featured and supported
Hadoop Infrastructure through a
single vendor contract.
Rapidly deploy with low
operational burden. Offers
simple pricing options.
MS Azure Ready access to virtual networks,
service buses, message queues
and non relational storage
platforms.
Compute and storage services
are at ease when compared to
other Iaas providers.
Ability to specify an availability
zone.
Provides programmatic
interfaces to some of the
services.
IBM Smartcloud Offers features important to
administrators, especially
management and add on
services.
Provisioning storage is simply
straightforward.
Storage costs are based on a
combination of allocated storage
size and I/O operations.
Iaas for use in data reporting and
analytics.
16. Comparison of Distributions
Supporting Hadoop on Premise
Distributions Features
Cloudera Enterprise-grade security, Increased
Cost Savings
Technical support Flexible Deployment, Faster Time to
Insight
HortonWorks With YARN enables multiple
workloads, applications and
processing engines across single
clusters with greater efficiency than
ever before.
Security and High Availability Tested at scale on hundreds of
production nodes.
MapR Proven, enterprise-grade platform
that supports a broad set of mission-
critical and real-time production
uses.
Ease of use, Instant Recovery &
continuous low latency
Full Data Protection with Snapshots
&
Business Continuity with Mirroring
17. Step By
Step
Approach
for Hadoop
Deployment
Scenario-1
<<Internal>> 17
• Managed Hadoop cluster ready to use with less or no
configuration.
• Most commonly used tools pre-installed and configured
like Hive, Pig and Sqoop and other services
Business Requirement 1:
• Then Hadoop as a service is recommended.
Our Solution:
• Amazon Web Services Elastic MapReduce (EMR)
• Qubole data services
• Microsoft Windows Azure HDInsight
Vendors Providing - Hadoop as a Service:
18. Step By Step
Approach for
Hadoop
Deployment
Scenario-2
•Choice of your own Hadoop distribution
software depending on your business
requirements.
•Fast turnaround time for infrastructure.
•High elasticity and scale out
Business Requirement 2:
•Then Hadoop on IaaS is recommended.
Our Solution:
•Amazon Web Services EC2
•RackSpace
•MS Azure
•IBM Smartcloud
Vendors Providing - IaaS for Hadoop:
19. Step By Step Approach for Hadoop
Deployment Scenario -3
• Business Requirement 3:
• Choice of your own Hadoop distribution software
depending on your business requirements.
• High performance and low latency
• If you are having security and privacy challenges for
moving the data out of enterprise to cloud.
• Our Solution
• Then Hadoop On-Premise is recommended
• List of Hadoop distributions supporting Hadoop On-
Premise:
• Cloudera
• HortonWorks
• MapR
20. Hadoop
Ecosystem
Components
Pig: Apache Pig allows you to write complex MapReduce
transformations using a simple scripting language.
Hive: Apache Hive is data warehouse infrastructure built on
top of Apache Hadoop for providing data
summarization, ad-hoc query, and analysis of large
datasets.
Hbase: Apache HBase is a non-relational (NoSQL) database that
runs on top of the Hadoop Distributed File System
(HDFS).
Flume: Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data into the
Hadoop Distributed File System (HDFS).
Sqoop: Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
Mahout: Apache Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop and
using the MapReduce paradigm.
21. Business
Value
• Humanize ‘Big Data’ by
business decision makers
and analysts and there by
extracting real business
value.
• Organizations can
perform analytics at
much lower cost.
• Big data along with the
information from
traditional data sources
enable skilled business
analysts to discover new
insights, patterns and
trends in the business.
• Speed time to value.
• Blend data to Add
context.
• Analyze without
complexity.
22. Process of Predictive Modeling
Most process for creating
predictive models
incorporate the following
steps
Project Definition /
Business Understanding
•Define business objectives and
desired outcomes
Exploration / Data
Understanding
•Analyze source data to
determine appropriate data,
model building approach and
scope
Data Preparation
•Select, extract and transform
data to create models
Model Building
•Create, test and validate models,
and evaluate them
Deployment
•Apply model results to business
decisions or processes
Model Management
•Manage models to improve
performance, accuracy, control
access , promote reuse
Amazon Web Services Elastic Map/Reduce
Cloud option
Supported Hadoop distributions: Amazon, MapR
EMR supports powerful and proven Hadoop tools such as Hive, Pig, HBase, and Impala
Pros
Easy to setup, can be launched and terminated on demand
Cons
Expensive
limited choice of OS, Hadoop distributions and applications
Amazon EC2 instances
Cloud option
Hadoop distributions Cloudera, HortonWorks … can be installed.
Pros
Relatively cheaper
Install any OS, Hadoop distributions and applications
Cons
Need to manually install and manage the cluster
On-Premises Hadoop Cluster
Pros
Cheapest
Install any OS, Hadoop distribution (Cloudera, HortonWorks, MapR)
Utilized physical hardware
Cons
Must manually install and manage the cluster and physical hardware
Hadoop as a Service offering provides a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Jobtracker, Tasktracker, Namenode, Datanode, and may provide secondary services like Zookeeper or HBase.
Ideally, such a service also provides some of the most commonly used tools pre-installed and configured like Hive, Pig, and Sqoop.
More advanced services expand this even further and include graphical interfaces and optimizations, which enable a wide user audience to utilize Hadoop transparently.
Pros:
Easy and low cost
Cons:
Vendor lock-in
Data Transfer limitations
High latency
We can leverage Public IaaS and setup Hadoop on them. Hadoop distributions Cloudera, HortonWorks, MapR can be deployed on IaaS.
Pros
Lowering the cost of innovation
Procuring large scale resources quickly
Handling variable resource requirements
Cons
Data transfer limitations. Loading data from on-premises to cloud on network will be slow.
All the major distributions Cloudera, HortonWorks, MapR support Hadoop On-Premises.
Pros
Hadoop runs best on physical servers
High performance and low latency
Cons
Procurement lead time
Increased cost with lower utilizations of dedicated hardware
Pig: Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data operations: standard extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data.
Hive: It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and tools for business intelligence and visualization.
Hbase : It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
Flume : It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
Sqoop: It imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.
Mahout : Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes.