2012 apache hadoop_map_reduce_windows_azure

Apache Hadoop, MapReduce
&
Windows Azure

Guðmundur Jón Halldórsson
Five Degrees
July 2012

Web crawler! „No this isn‘t about that“

What is Hadoop?

System for processing
mind-boggingly large
amount of data

Hadoop

Map-Reduce = Computation
HDFS = Storage

HDFS
Hadoop Distributed File System

Yes it is file system written in Java 
And you can do normal file system operations
like [ls, mkdir, ...].

Works best with large files. HDFS splits file into
blocks of 128 MB (can be configures)

HDFS
HDFS will keep 3 copies of each block
The NameNode tracks blocks and datanodes

DN1 DN2 DN3
NN

DN4 DN5 DN5
Namenode
DN1, DN4, DN7
DN3, DN5, DN8
DN5 DN8 DN9 DN3, DN4, DN5

Map-Reduce
• Write a mapper that takes a key and value,
emits zero or more new keys and values
• Write a reducer all the values of one key and
emits zero or more new keys and values

Map-Reduce JS example
var map = function ( key, value, context ) {
var words = value.split(/[^a-zA-Z]/);
for ( var i=0; i < words.length; i++ ) {
if ( words[i] !== „“ ) {
context.write( words[i].toLowerCase(), 1 );
}
}
}; var reduce = function ( key, values, context ) {
var sum = 0;
while ( values.hasNext() ) {
sum += parseInt( values.next() );
}
context.write( key, sum );
}

Data Systems and Their Timeframes

Does hadoop solve all my DATA
problems or is are there something
else out there?

• PIG High-level MapReduce Language
• HIVE SQL Like high-level MapReduce Language
• HBase Realtime processing (based on google
BigTable)
• Accumulo NSA fork of Hbase
• Avro Data Serialization
• ZooKeeper Low level coordination
• HCatalog Storage Management and interoperability
between all systems
• OOZIE Job scheduler
• Flume Log and data aggregation
• Whirr Automated cloud cluster on ec2, rackspace etc
• Sqoop Relational data importer
• MrUnit Unit testing job
• Mahout Machine learning libraries
• BigTop Interoperability
• Crunch MapReduce pipelines in Java and Scala
• Giraph Processing math on huge distribute graphs

2012 apache hadoop_map_reduce_windows_azure

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a 2012 apache hadoop_map_reduce_windows_azure

Similar a 2012 apache hadoop_map_reduce_windows_azure (20)

Último

Último (20)

2012 apache hadoop_map_reduce_windows_azure