Apache Hadoop Talk at QCon

Apache Hadoop,
Big Data, and You
Philip Zeyliger
philip@cloudera.com
@philz42 @cloudera
November 18, 2009

Wednesday, November 18, 2009

Hi there!

Software Engineer
Worked at


I work on stuff...


Outline
Why should you care? (Intro)
Challenging yesteryear’s assumptions
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions


Data is everywhere.

Data is important.


“I keep saying that the sexy job
in the next 10 years will be
statisticians, and I’m not kidding.”

Hal Varian
(Google’s chief economist)


So, what’s Hadoop?

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry


Apache Hadoop is an open-source
system (written in Java!) to store and
process
gobs of data
across many commodity computers.

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry


Two Big
Components
HDFS Map/Reduce

Self-healing high-
bandwidth Fault-tolerant
clustered storage. distributed computing.


Challenging some of
yesteryear’s
assumptions...


Assumption 1: Machines can be reliable...

Image: MadMan the Mighty CC BY-NC-SA

Hadoop Goal:

Separate distributed
system fault-tolerance
code from application logic.
Systems
Programmers Statisticians


Assumption 2: Machines have identities...
Image:Laughing Squid CC BY-
NC-SA

Hadoop Goal:

Users should interact with
clusters, not machines.


Assumption 3: A data set ﬁts on one machine...
Image: Matthew J. Stinson CC-
BY-NC

Hadoop Goal:

System should scale
linearly (or better) with
data size.


The M/R
Programming Model


You specify map()
and reduce()
functions.

The framework does
the rest.

map()
map: K₁,V₁→list K₂,V₂

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException,
InterruptedException {
// context.write() can be called many times
// this is default “identity mapper” implementation
context.write((KEYOUT) key, (VALUEOUT) value);
}
}


(the shufﬂe)

map output is assigned to a “reducer”

map output is sorted by key


reduce()
K₂, iter(V₂)→list(K₃,V₃)

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}


Putting it together...
Logical

Physical Flow

Physical


Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes
for text classiﬁcation)
Augment traditional BI/DW
technologies (by archiving raw data).

There’s more than the
Java API
Streaming Pig Hive
perl, python, Higher-level SQL interface.
ruby, whatever. dataﬂow language
for easy ad-hoc Great for
stdin/stdout/ analysis. analysts.
stderr
Developed at Developed at
Yahoo! Facebook

Friday,
@10:10

A typical look...
Commodity servers (8-core, 8-16GB
RAM, 4-12 TB, 2x1 gE NIC)
2-level network architecture
20-40 nodes per rack


The cast...
Starring...

NameNode (metadata server and database)

SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)

The Chorus…
DataNodes TaskTrackers
(block storage) (task execution)

Thanks to Zak Stone for earmuff image!

HDFS
3x64MB file, 3 rep
4x64MB file, 3 rep
Namenode
Small file, 7 rep
Datanodes

One Rack A Different Rack

HDFS Write Path


HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance
Namenode crash?
uh-oh


M/R
Job on stars
Tasktrackers on the same Different job
machines as datanodes Idle

One Rack A Different Rack

M/R


M/R Failures
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence


Hadoop in the Wild

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)

Google: 40 GB/s GFS read/write load (Jeff Dean,
LADIS ’09) [~3,500 TB/day]

Facebook: 4TB new data per day; DW: 4800 cores, 5.5
PB (Dhruba Borthakur, HadoopWorld)


The Hadoop Ecosystem
ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) Sqoop
Zookeepr (Coordination)

Avro (Serialization)
MapReduce (Job Scheduling/Execution System)

HBase (Key-Value store)

HDFS
(Hadoop Distributed File System)


Ok, ﬁne, what next?
Get Hadoop!
http://hadoop.apache.org/
Cloudera Distribution for Hadoop
Try it out! (Locally, or on EC2)


Just one slide...

Software: Cloudera Distribution for
Hadoop, Cloudera Desktop, more…
Training and certiﬁcation…
Free on-line training materials
(including video)
Support & Professional Services
@cloudera, blog, etc.

Questions?

philip@cloudera.com


Backup Slides


Important APIs
→ is 1:many
Input Format data→K₁,V₁
Writable
Mapper K₁,V₁→K₂,V₂
JobClient
M/R Flow

Other
Combiner K₂,iter(V₂)→K₂,V₂

Partitioner K₂,V₂→int *Context

Reducer K₂, iter(V₂)→K₃,V₃ Filesystem
Out. Format K₃,V₃→data

$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop

$ bin/hadoop jar hadoop-0.18.3-
examples.jar grep input.txt output1
'dunster|adams'

$ cat output1/part-00000
4 dunster
2 adams


public int run(String[] args)
throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort
if (args.length < 3) { cer.class); Job, new Path(args[1]));
System.out.println("Grep // sort by decreasing freq
<inDir> <outDir> <regex>
[<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass
Job, tempDir); (LongWritable.DecreasingComparator.
ToolRunner.printGenericCommandUsage class);
(System.out); grepJob.setOutputFormat(SequenceFil
return -1; eOutputFormat.class); JobClient.runJob(sortJob);
} } finally {
Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas
temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp
Random().nextInt(Integer.MAX_VALUE) Dir, true);
)); grepJob.setOutputValueClass(LongWri }
JobConf grepJob = new table.class); return 0;
JobConf(getConf(), Grep.class); }
try { JobClient.runJob(grepJob);
grepJob.setJobName("grep-
search"); JobConf sortJob = new
JobConf(Grep.class);
sortJob.setJobName("grep-
FileInputFormat.setInputPaths(grepJ sort");

the “grep”
ob, args[0]);

FileInputFormat.setInputPaths(sortJ
grepJob.setMapperClass(RegexMapper. ob, tempDir);
class);

example
sortJob.setInputFormat(SequenceFile
grepJob.set("mapred.mapper.regex", InputFormat.class);
args[2]);
if (args.length == 4)
sortJob.setMapperClass(InverseMappe
grepJob.set("mapred.mapper.regex.gr r.class);
oup", args[3]);
// write a single file
sortJob.setNumReduceTasks(1);
grepJob.setCombinerClass(LongSumRed
ucer.class);


JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

Job
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
1of 2
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);
} ...


JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir);

Job
sortJob.setInputFormat(SequenceFileInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);
(implicit identity reducer)
// write a single file
sortJob.setNumReduceTasks(1); 2 of 2
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}


The types there...
?, Text

Text, Long

Text, list(Long)

Text, Long

Long, Text


A Simple Join
Id Last First
People

1 Washington George
2 Lincoln Abraham
Key Entry Log

Location Id Time
Dunster 1 11:00am
Dunster 2 11:02am
Kirkland 2 11:08am

You want to track individuals throughout the day.
How would you do this in M/R, if you had to?

Apache Hadoop Talk at QCon

Recomendados

Recomendados

Más contenido relacionado

Similar a Apache Hadoop Talk at QCon

Similar a Apache Hadoop Talk at QCon (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Apache Hadoop Talk at QCon