I have given quick introduction about Hadoop, Big Data, Business Intelligence and other core services and program involved to use Hadoop as a successful tool for Big Data analysis.
My true understanding in Big-Data:
“Data” become “information” but now big data bring information to “Knowledge” and ‘knowledge” becomes “Wisdom” and “Wisdom” turn into “Business” or “Revenue”, All if you use promptly & timely manner
2. What is Hadoop?
Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain
insight from massive amounts of structured and unstructured data quickly and without significant investment.
Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists
of three main functions: storage, processing and resource management.
3. Core services on Hadoop
MapReduce:
MapReduce is a framework for writing applications that process large amounts of structured and
unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant.
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of
large data sets on compute clusters of commodity hardware.
The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.
The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the
reduce tasks. Typically, both the input and the output of the job are stored in a file system.
4. Core services on Hadoop
HDFS:
Hadoop Distributed File System is a java-based file system that provides scalable and reliable
data storage for large group of clusters.
This Apache Software Foundation project is designed to provide a fault-tolerant file system
designed to run on commodity hardware.
The primary objective of HDFS is to store data reliably even in the presence of failures
including NameNode failures, DataNode failures and network partitions.
The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in
the Hadoop file management system
5. Core services on Hadoop
Hadoop Yarn:
Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by
supporting non-MapReduce workloads associated with other programming models.
Its a resource-management platform responsible for managing compute resources in clusters and using
them for scheduling of users' applications.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common and thus should be automatically handled in
software by the framework and is now commonly considered to consist of a number of related projects
as well
6. Core services on Hadoop
Apache Tez:
Tez generalizes the MapReduce paradigm tois a generic data-processing pipeline engine envisioned
as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache
Pig, Apache Hive etc.
The data-processing pipeline engine where-in one can plug-in input, processing and output
implementations to perform arbitrary data-processing.
Every 'task' in tez has the following,Input to consume key/value pairs from,Processor to process
them,Output to collect the processed key/value pairs a more powerful framework for executing a
complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
7. Hadoop Data Services
Apache Pig:
Its a high-level procedural language platform developed to simplify querying large data sets in
Apache Hadoop and MapReduce.
Apache Pig features a “Pig Latin” language layer that enables SQL-like queries to be performed
on distributed datasets within Hadoop applications.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets
8. Hadoop Data Services
Apache Hbase:
(HBase) is the Hadoop database.
It is a distributed, scalable, big data store.
HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read
and write access to your big data.
9. Hadoop Data Services
Apache Hive:
Data warehouse software facilitates querying and managing large datasets residing in distributed
storage.
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like
language called HiveQL.
At the same time this language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a
subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.
10. Hadoop Data Services
Apache flume:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.
It has a simple and flexible architecture based on streaming data flows.
It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
It uses a simple extensible data model that allows for online analytic application. lume’s high-level
architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend.
11. Hadoop Data Services
Apache Mahout:
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise
scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering
and classification, often leveraging, but not limited to, the Hadoop platform.
Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of
Apache Hadoop using the map/reduce paradigm.
Classification learns from exisiting categorized documents what documents of a specific category look
like and is able to assign unlabelled documents to the (hopefully) correct category.
12. Hadoop Data Services
Apache Accumulo :
Is a sorted, distributed key/value store and is at the core of Sqrrl Enterprise.
It handles large amounts of structured, semi-structured, and unstructured data as a
robust, scalable, and real-time data storage and retrieval system.
Fine-grained security controls allow organizations to control data at the cell-level and promote a datacentric security model without degrading performance.
Accumulo can support a wide variety of real-time analytics, including statistics and graph analytics, via
Accumulo’s server-side programming framework called iterators.
13. Hadoop Data Services
Apache Storm:
Storm is a distributed realtime computation system.
Storm provides a set of general primitives for doing realtime computation.
Storm is simple, can be used with any programming language, and is a lot of fun to use!
14. Hadoop Data Services
Apache Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational
databases and data warehouses – into Hadoop.
It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from
Oracle, Teradata or other relational databases to the target.
15. Hadoop Data Services
Apache Catalog:
HCatalog is a table and storage management layer for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data
on the grid
HCatalog is a set of interfaces that open up access to Hive's metastore for tools inside and outside of the
Hadoop grid.
It includes providing a shared schema and data type mechanism for Hadoop tools.
HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File
System (HDFS) and ensures that users need not worry about where or in what format their data is stored.
16. Hadoop Operational Services
Apache Zookeeper :
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical
name space of data registers, known as nodes.
Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the
root, every znode has a parent, and a znode cannot be deleted if it has children.
A service is replicated over a set of machines and each maintains an in-memory image of the the data
tree and transaction logs.
Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send
requests and receive responses.
17. Hadoop Operational Services
Apache Falcon:
Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop.
It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster
recovery and data retention use cases.
Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache
Falcon for these functions, maximizing reuse and consistency across Hadoop applications.
Falcon simplifies the development and management of data processing pipelines with introduction of
higher layer of abstractions for users to work with.
18. Hadoop Operational Services
Apache Ambari :
Apache Ambari is a 100-percent open source operational framework for provisioning, managing and
monitoring Apache Hadoop clusters.
Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the
complexity of Hadoop, simplifying the operation of clusters.
Ambari includes an intuitive Web interface that allows you to easily provision, configure and test all
the Hadoop services and core components.
Ambari provides tools to simplify cluster management. The Web interface allows you to
start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.
19. Hadoop Operational Services
Apache knox :
The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for
Apache™ Hadoop® services in a cluster.
The goal of the project is to simplify Hadoop security for users who access the cluster data and
execute jobs, and for operators who control access and manage the cluster.
Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.
20. Hadoop Operational Services
Apache Oozie :
Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.
Oozie combines multiple jobs sequentially into one logical unit of work.
It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache
Pig, Apache Hive, and Apache Sqoop.
Apache Oozie allows Hadoop administrators to build complex data transformations out of multiple
component tasks.
Apache Oozie helps administrators derive more value from their Hadoop investment.
21. What Hadoop can, and can't do
What Hadoop can't do
You can't use Hadoop for
Structured data
Transactional data
What Hadoop can do
You can use Hadoop for
Big Data
22. Support & Partner
Getting Hadoop Started or Need Support –
Muthu Natarajan
muthu.n@msquaresystems.com
www.msquaresystems.com
Phone: 212-941-6000/703-222-5500