Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Apache Hadoop Talk at QCon
1. Apache Hadoop,
Big Data, and You
Philip Zeyliger
philip@cloudera.com
@philz42 @cloudera
November 18, 2009
Wednesday, November 18, 2009
2. Hi there!
Software Engineer
Worked at
Wednesday, November 18, 2009
3. I work on stuff...
Wednesday, November 18, 2009
4. Outline
Why should you care? (Intro)
Challenging yesteryear’s assumptions
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions
Wednesday, November 18, 2009
9. “I keep saying that the sexy job
in the next 10 years will be
statisticians, and I’m not kidding.”
Hal Varian
(Google’s chief economist)
Wednesday, November 18, 2009
10. So, what’s Hadoop?
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Wednesday, November 18, 2009
11. Apache Hadoop is an open-source
system (written in Java!) to store and
process
gobs of data
across many commodity computers.
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Wednesday, November 18, 2009
12. Two Big
Components
HDFS Map/Reduce
Self-healing high-
bandwidth Fault-tolerant
clustered storage. distributed computing.
Wednesday, November 18, 2009
13. Challenging some of
yesteryear’s
assumptions...
Wednesday, November 18, 2009
14. Assumption 1: Machines can be reliable...
Image: MadMan the Mighty CC BY-NC-SA
Wednesday, November 18, 2009
15. Hadoop Goal:
Separate distributed
system fault-tolerance
code from application logic.
Systems
Programmers Statisticians
Wednesday, November 18, 2009
16. Assumption 2: Machines have identities...
Image:Laughing Squid CC BY-
NC-SA
Wednesday, November 18, 2009
17. Hadoop Goal:
Users should interact with
clusters, not machines.
Wednesday, November 18, 2009
18. Assumption 3: A data set fits on one machine...
Image: Matthew J. Stinson CC-
BY-NC
Wednesday, November 18, 2009
19. Hadoop Goal:
System should scale
linearly (or better) with
data size.
Wednesday, November 18, 2009
20. The M/R
Programming Model
Wednesday, November 18, 2009
21. You specify map()
and reduce()
functions.
The framework does
the rest.
Wednesday, November 18, 2009
22. map()
map: K₁,V₁→list K₂,V₂
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException,
InterruptedException {
// context.write() can be called many times
// this is default “identity mapper” implementation
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
Wednesday, November 18, 2009
23. (the shuffle)
map output is assigned to a “reducer”
map output is sorted by key
Wednesday, November 18, 2009
24. reduce()
K₂, iter(V₂)→list(K₃,V₃)
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}
Wednesday, November 18, 2009
26. Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes
for text classification)
Augment traditional BI/DW
technologies (by archiving raw data).
Wednesday, November 18, 2009
27. There’s more than the
Java API
Streaming Pig Hive
perl, python, Higher-level SQL interface.
ruby, whatever. dataflow language
for easy ad-hoc Great for
stdin/stdout/ analysis. analysts.
stderr
Developed at Developed at
Yahoo! Facebook
Friday,
@10:10
Wednesday, November 18, 2009
28. A typical look...
Commodity servers (8-core, 8-16GB
RAM, 4-12 TB, 2x1 gE NIC)
2-level network architecture
20-40 nodes per rack
Wednesday, November 18, 2009
29. The cast...
Starring...
NameNode (metadata server and database)
SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)
The Chorus…
DataNodes TaskTrackers
(block storage) (task execution)
Thanks to Zak Stone for earmuff image!
Wednesday, November 18, 2009
30. HDFS
3x64MB file, 3 rep
4x64MB file, 3 rep
Namenode
Small file, 7 rep
Datanodes
Wednesday, November 18, 2009
One Rack A Different Rack
38. Ok, fine, what next?
Get Hadoop!
http://hadoop.apache.org/
Cloudera Distribution for Hadoop
Try it out! (Locally, or on EC2)
Wednesday, November 18, 2009
39. Just one slide...
Software: Cloudera Distribution for
Hadoop, Cloudera Desktop, more…
Training and certification…
Free on-line training materials
(including video)
Support & Professional Services
@cloudera, blog, etc.
Wednesday, November 18, 2009
40. Questions?
philip@cloudera.com
Wednesday, November 18, 2009
46. JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
Job
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
(implicit identity reducer)
// write a single file
sortJob.setNumReduceTasks(1); 2 of 2
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
Wednesday, November 18, 2009
47. The types there...
?, Text
Text, Long
Text, list(Long)
Text, Long
Long, Text
Wednesday, November 18, 2009
48. A Simple Join
Id Last First
People
1 Washington George
2 Lincoln Abraham
Key Entry Log
Location Id Time
Dunster 1 11:00am
Dunster 2 11:02am
Kirkland 2 11:08am
You want to track individuals throughout the day.
How would you do this in M/R, if you had to?
Wednesday, November 18, 2009