5. e Hadoop users. .
Notabl
Yahoo! LinkedIn
Facebook New York Times
Twitter Rackspace
Baidu eHarmony
eBay Powerset
http://wiki.apache.org/hadoop/PoweredBy
10. Inspired by Google BigTable and
MapReduce papers circa 2004
Created by Doug Cutting
Originally built to support distribution
for Nutch search engine
Named after a stuffed elephant
12. An open source...
batch/offline oriented...
data & I/O intensive...
general purpose framework for
creating distributed applications that
process huge amounts of data.
13. One definition of "huge"
25,000 machines
More than 10 clusters
3 petabytes of data (compressed, unreplicated)
700+ users
10,000+ jobs/week
14. Had oop
M ajor nts:
C omp one
Distributed File System
(HDFS)
Map/Reduce System
18. Hadoop Relational
Scale-out Scale-up(*)
Key/value pairs Tables
Say how to process Say what you want
the data (SQL)
Offline/batch Online/real-time
(*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
20. Data is distributed and replicated
over multiple machines
Designed for large files
(where "large" means GB to TB)
Block oriented
Linux-style commands, e.g. ls, cp,
mv, rm, etc.
34. public class SimpleWordCount
extends Configured implements Tool {
public static class MapClass
extends Mapper<Object, Text, Text, IntWritable> {
...
}
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
...
}
public int run(String[] args) throws Exception { ... }
public static void main(String[] args) { ... }
}
35. public static class MapClass
extends Mapper<Object, Text, Text, IntWritable> {
private static final IntWritable ONE = new IntWritable(1L);
private Text word = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer st = new StringTokenizer(value.toString());
while (st.hasMoreTokens()) {
word.set(st.nextToken());
context.write(word, ONE);
}
}
}
36. public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable count = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
count.set(sum);
context.write(key, count);
}
}
37. public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "Counting Words");
job.setJarByClass(SimpleWordCount.class);
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
38. public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(),
new SimpleWordCount(),
args);
System.exit(result);
}
39. aF low
uce Dat
p/Red
M a
(Image from Hadoop in Action...great book!)
40. Partitioning
Deciding which keys go to which reducer
Desire even distribution across reducers
Skewed data can overload a single reducer!
43. Shuffling WordCount
data # k/v pairs shuffled
without combiner ("the", 1) 1000
with combiner ("the", 1000) 1
(looking at one mapper that sees the word "the" 1000 times)
44. Advanced Map/Reduce
Hadoop Streaming
Chaining Map/Reduce jobs
Joining data
Bloom filters
67. Speculative execution Use a Combiner
(on by default)
Reduce amount of JVM Re-use
input data (be careful)
Refactor code/
Data compression
algorithms
71. Simulate structure for data stored in Hadoop
Query language analogous to SQL (Hive QL)
Translates queries into Map/Reduce job(s)...
...so not for real-time processing!
74. create external table patent_citations (citing string, cited string)
row format delimited fields terminated by ','
stored as textfile
location '/user/sleberkn/nber-patent/tables/patent_citation';
create table citation_histogram (num_citations int, count int)
stored as sequencefile;
75. insert overwrite table citation_histogram
select num_citations, count(num_citations) from
(select cited, count(cited) as num_citations
from patent_citations group by cited) citation_counts
group by num_citations
order by num_citations;