2. Apache Hadoop
Apache Hadoop
• is a popular open-source
framework for storing and
processing large data sets across
clusters of computers.
• HDP 2.2 on Sandbox system
Requirements:
– Now runs on 32-bit and 64-bit OS
(Windows XP, Windows 7,
Windows 8 and Mac OSX)
– Minimum 4GB RAM; 8Gb required
to run Ambari and Hbase
– Virtualization enabled on BIOS
– Browser: Chrome 25+, IE 9+, Safari
6+ recommended. (Sandbox will
not run on IE 10)
• An ideal way to get started Enterprise
Hadoop. Sandbox is a self-contained
virtual machine with Apache Hadoop
pre-configured alongside a set of
hands-on, step-by-step Hadoop
tutorials.
• Sandbox is a personal, portable Hadoop
environment that comes with a dozen
interactive Hadoop tutorials.
• It includes many of the most exciting
developments from the latest HDP
distribution, packaged up in a virtual
environment that you can get up and
running in 15 minutes!
3. Hadoop… Getting Started
Terminologies
• Hadoop
• YARN – the Hadoop Operating system
– enables a user to interact with all data in multiple
ways simultaneously, making Hadoop a true multi-use
data platform and allowing it to take its place in a
modern data architecture.
– A framework for job scheduling and cluster resource
management.
– This means that many different processing engines can
operate simultaneously across a Hadoop cluster, on
the same data, at the same time.
• the Hadoop Distributed File System (HDFS)
– A distributed file system that provides high-
throughput access to application data.
• MapReduce
– A YARN-based system for parallel processing of large
data sets.
• Sqoop
• theHiveODBC Driver
Hortonworks Data Platform(HDP)
• is a 100% open source
distribution of Apache
Hadoop that is truly
enterprise grade having
been built, tested and
hardened with enterprise
rigor.
4. Introducing Apache Hadoop to
Developers
• Apache Hadoop is a community driven open-source project
governed by the Apache Software Foundation.
• originally implemented at Yahoo based on papers published
by Google in 2003 and 2004.
• Since then Apache Hadoop has matured and developed to
become a data platform for not just processing humongous
amount of data in batch but with the advent of YARN it now
supports many diverse workloads such as Interactive
queries over large data with Hive on Tez, Realtime data
processing with Apache Storm, super scalable NoSQL
datastore like HBase, in-memory datastore like Spark and
the list goes on.
6. Core of Hadoop
• A set of machines running
HDFS and MapReduce is
known as a Hadoop Cluster.
Individual machines are
known as nodes. A cluster
can have as few as one node
to as many as several
thousands. For most
application scenarios Hadoop
is linearly scalable, which
means you can expect better
performance by simply
adding more nodes.
• The Hadoop
Distributed File
System (HDFS)
• MapReduce
7. MapReduce
• a method for distributing a task across multiple nodes. Each node
processes data stored on that node to the extent possible.
• A running Map Reduce job consists of various phases such as Map -
> Sort -> Shuffle -> Reduce
• Advantages:
– Automatic parallelization and distribution of data in blocks across a
distributed, scale-out infrastructure.
– Fault-tolerance against failure of storage, compute and network
infrastructure
– Deployment, monitoring and security capability
– A clean abstraction for programmers
• Most MapReduce programs are written in Java. It can also be
written in any scripting language using the Streaming API of
Hadoop.
8. The MapReduce Concepts and
Terminology
• MapReduce jobs are controlled by a software daemon
known as the JobTracker. The JobTracker resides on a
'master node'. Clients submit MapReduce jobs to the
JobTracker. The JobTracker assigns Map and Reduce tasks to
other nodes on the cluster.
• These nodes each run a software daemon known as the
TaskTracker. The TaskTracker is responsible for actually
instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
• A job is a program with the ability of complete execution of
Mappers and Reducers over a dataset. A task is the
execution of a single Mapper or Reducer over a slice of
data.
9. Hadoop Distributed File System
• the foundation of the Hadoop cluster.
• manages how the datasets are stored in the
Hadoop cluster.
• responsible for distributing the data across the
data nodes, managing replication for
redundancy and administrative tasks like
adding, removing and recovery of data nodes.
10. Apache Hive
• provides a data warehouse view of the data in HDFS.
• Using a SQL-like language Hive lets you create
summarizations of your data, perform ad-hoc queries,
and analysis of large datasets in the Hadoop cluster.
• The overall approach with Hive is to project a table
structure on the dataset and then manipulate it with
HiveQL.
• Since you are using data in HDFS your operations can
be scaled across all the datanodes and you can
manipulate huge datasets.
11. Apache HCatalog
• Used to hold location and metadata about the
data in a Hadoop cluster. This allows scripts and
MapReduce jobs to be decoupled from data
location and metadata like the schema.
• since it supports many tools, like Hive and Pig,
the location and metadata can be shared
between tools. Using the open APIs of HCatalog
other tools like Teradata Aster can also use the
location and metadata in HCatalog.
• how can we reference data by name and inherit
the location and metadata???
12. Apache Pig
• a language for expressing data analysis and
infrastructure processes.
• is translated into a series of MapReduce jobs that
are run by the Hadoop cluster.
• is extensible through user-defined functions that
can be written in Java and other languages.
• Pig scripts provide a high level language to create
the MapReduce jobs needed to process data in a
Hadoop cluster.