5. HDFS
Hadoop Distributed File System
Yes it is file system written in Java
And you can do normal file system operations
like [ls, mkdir, ...].
Works best with large files. HDFS splits file into
blocks of 128 MB (can be configures)
6. HDFS
HDFS will keep 3 copies of each block
The NameNode tracks blocks and datanodes
DN1 DN2 DN3
NN
DN4 DN5 DN5
Namenode
DN1, DN4, DN7
DN3, DN5, DN8
DN5 DN8 DN9 DN3, DN4, DN5
7. Map-Reduce
• Write a mapper that takes a key and value,
emits zero or more new keys and values
• Write a reducer all the values of one key and
emits zero or more new keys and values
8. Map-Reduce JS example
var map = function ( key, value, context ) {
var words = value.split(/[^a-zA-Z]/);
for ( var i=0; i < words.length; i++ ) {
if ( words[i] !== „“ ) {
context.write( words[i].toLowerCase(), 1 );
}
}
}; var reduce = function ( key, values, context ) {
var sum = 0;
while ( values.hasNext() ) {
sum += parseInt( values.next() );
}
context.write( key, sum );
}
11. Does hadoop solve all my DATA
problems or is are there something
else out there?
12. • PIG High-level MapReduce Language
• HIVE SQL Like high-level MapReduce Language
• HBase Realtime processing (based on google
BigTable)
• Accumulo NSA fork of Hbase
• Avro Data Serialization
• ZooKeeper Low level coordination
• HCatalog Storage Management and interoperability
between all systems
• OOZIE Job scheduler
• Flume Log and data aggregation
• Whirr Automated cloud cluster on ec2, rackspace etc
• Sqoop Relational data importer
• MrUnit Unit testing job
• Mahout Machine learning libraries
• BigTop Interoperability
• Crunch MapReduce pipelines in Java and Scala
• Giraph Processing math on huge distribute graphs