2. What Is ? inspired by
• System for Processing mind-boggingly large amount
of data.
• The Apache Hadoop software library is a framework:
• that allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability,
• The library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be
prone to failures.
3. Hadoop Core
• Open sourced, flexible and available architecture
for large scale computation and data processing on
a network of commodity hardware
• Open Source Software + Hardware Commodity
• IT Costs Reduction
MapReduce Computation
HDFS Storage
4. Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
• Failure is expected, rather than exceptional.
• The number of nodes in a cluster is not constant.
• Need common infrastructure
• Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
5. Hadoop History
• 2004—Initial versions of what is now Hadoop Distributed Filesystem and
MapReduce implemented by Doug Cutting and Mike Cafarella.
• December 2005—Nutch ported to the new framework. Hadoop runs reliably on
20 nodes.
• January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project officially started to support the
standalone development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by Yahoo! Grid team.
• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
• May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than
April benchmark).
• October 2006—Research cluster reaches 600 nodes.
• December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
• January 2007—Research cluster reaches 900 nodes.
• April 2007—Research clusters—2 clusters of 1000 nodes.
• April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
• October 2008—Loading 10 terabytes of data per a day on to research clusters.
• March 2009—17 clusters with a total of 24,000 nodes
• April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400
nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
7. How does HDFS Work? MapReduce has
undergone a
Let Suppose we have a file complete
overhaul in
Size : 300MB hadoop-0.23 and
we now
have, what we
call, MapReduce
2.0 (MRv2) or
0
YARN.
0 The fundamental
M idea of MRv2 is
B to split up the
two major
functionalities of
the
JobTracker, resou
rce management
and job
scheduling/monit
oring, into
separate daemo
8. How does HDFS Work? 1 MapReduce has
HDFS splits it into blocks 2 undergone a complete
8 overhaul in hadoop-0.23
Size of each block is 128 MB. M
and we now have, what
we call, MapReduce 2.0
B (MRv2) or YARN.
The fundamental idea of
MRv2 is to split up the
1 two major functionalities
2 of the JobTracker,
8
M
B
4 resource management
4 and job
M scheduling/monitoring, i
B nto separate daemo
9. How does HDFS Work? MapReduce has
undergone a complete
HDFS will keep 3 copies of each overhaul in hadoop-0.23
and we now have, what
Block. we call, MapReduce 2.0
HDFS store these blocks on (MRv2) or YARN.
datanodes,
HDFS distributes the block to
the DNs
10. How does HDFS Work?
The Name Node tracks blocks and Data nodes.
DN DN DN
DN DN
DN
Name Node
DN DN DN
11. How does HDFS Work?
Sometimes a datanode will die. Not a problem,
16. MapReduce: Programming Model
Process data using special map() and reduce()
functions
The map() function is called on every item in the
input and emits a series of intermediate key/value
pairs
All values associated with a given key are grouped
together
The reduce() function is called on every unique
key, and its value list, and emits a value that is
added to the output
17. MapReduce:Programming Model
M <How,1>
<now,1> <How,1 1>
How now <brown,1> <now,1 1> brown 1
Brown M <cow,1> <brown,1> R cow 1
cow <How,1> <cow,1> does 1
<does,1> <does,1> How 2
M <it,1> <it,1>
R it 1
How does
It work now <work,1> <work,1> now 2
<now,1> work 1
M Reduce
MapReduce
Framework
Map
Input Output
19. MapReduce Life Cycle
Map function
Reduce function
Run this program as a
MapReduce job
20. Hadoop Environment
Hadoop has become the kernel of the distributed
operating system for Big Data
The project includes these modules:
Hadoop Common: The common utilities that support
the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and
cluster resource management.
Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
23. What is ZooKeeper
• A centralized service for maintaining
• Configuration information, naming
• Providing distributed synchronization,
• and providing group services.
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
• Status information
• Configuration
• Location information
24. ZooKeeper
• ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical name
space of data registers (we call these registers
znodes), much like a file system.
26. FLUME
• Flume is a distributed:
• A distributed, reliable, and data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
• It has a simple and flexible architecture based on
streaming data flows.
29. Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured data stores such as relational databases.
• Easy, parallel database import/export
• What you want do?
• Insert data from RDBMS to HDFS
• Export data from HDFS back into RDBMS
32. Why Hive and Pig?
Although MapReduce is very powerful, it can also be
complex to master
Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing
Java code
Many organizations have programmers who are skilled
at writing code in scripting languages
Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data
via MapReduce
Hive was initially developed at Facebook, Pig at Yahoo!
33. Hive
What is Hive?
An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides easy data
summarization and ad hoc querying and the analysis
of large datasets stored in Hadoop compatible file
systems.
MapRuduce for execution
HDFS for storage
Hive Query Language
Basic-SQL : Select, From, Join, Group-By
Equi-Join, Muti-Table Insert, Multi-Group-By
Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
35. Pig
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which
you need to do some processing or analysis , one
way is to write Map Reduce code and then run that
processing on data.
Other way is to write Pig scripts which would inturn be
converted to Map Reduce code and would process
your data.
Pig consists of two parts
• Pig latin language
• Pig engine A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
36. Pig Latin & Pig Engine
• Pig latin is a scripting language which allows you to
describe how data flow from one or more inputs
should be read , how it should be processed and
then where it should be stored.
• The flows can be simple or complex where some
processing is applied in between. Data can be picked
from multiple inputs.
• We can say Pig Latin describes a directed acyclic
graphs where edges are data flows and the nodes
are operators that process the data
• Pig Engine:
• The job of engine is to execute the data flow written
in Pig latin in parallel on hadoop infrastructure.
37. Why Pig is required when we can code all in MR
• Pig provides all standard data processing operations
like sort , group , join , filter , order by , union right
inside pig latin
• In MR we have to lots of manual coding.
Pig does optimization of Pig latin scripts while
creating them into MR jobs.
• It creates optimized version of Map reduce to run on
hadoop
• It takes very less time to write Pig latin script then to
write corresponding MR code
• Where Pig is useful
Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing
39. WordCount Example
• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
• theBye, 1> just sums up the values
< reduce
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
40. WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
41. WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as
(token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
42. WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
45. • Apache HBase™ is the Hadoop database, a
distributed, scalable, big data store.
• Apache HBase is an open-source, distributed, versioned, column-
oriented store modeled after Google's Bigtable: A Distributed Storage
System for Structured Data by Chang et al.
• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
– PUT
– GET
– DELETE
– SCANE
46. Hbase
HBase is a type of "NoSQL" database. NoSQL?
"NoSQL" is a general term meaning that the database
isn't an RDBMS which supports SQL as its primary
access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local
NoSQL database, whereas HBase is very much a
distributed database.
Technically speaking, HBase is really more a "Data
Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed
columns, secondary indexes, triggers, and advanced
query languages, etc.
49. What is ?
Oozie is a server-based workflow scheduler system to
manage Apache Hadoop jobs (e.g. load data, storing
data, analyze data, cleaning data, running map
reduce jobs, etc.)
A Java Web Application
Oozie is a workflow scheduler for Hadoop
Triggered
Time
Job 1 Job 2
Data
Job 3
Job 4 Job 5
51. What is
Machine-learning tool
Distributed and scalable machine learning algorithms
on the Hadoop platform
Building intelligent applications easier and faster
Our core algorithms for clustering, classfication and
batch based collaborative filtering are implemented
on top of Apache Hadoop using the map/reduce
paradigm.