2. “Big data” is data that
becomes large enough
that it cannot be
processed using
conventional methods
~ O’Reilly Radar
3. Hadoop
Apache Hadoop is not a database
Apache Hadoop is not a single program, tool or application but a set of projects with a
common goal integrated under one umbrella / term Hadoop (Core)
5. Anatomy of a Hadoop Cluster
Distributed Computing (MapReduce)
Distributed storage (HDFS)
Commodity Hardware
6. Hadoop Architecture
The MapReduce master is
responsible for organizing
where computational work
should be scheduled on the
slave nodes.
Name Node
Job Tracker
HDFS
The HDFS master is
responsible for
partitioning the storage
across the slave nodes and
keeping track of where
data is located.
Data Node
Data Node
Data Node
Task Tracker
HDFS
Task Tracker
HDFS
Task Tracker
HDFS
Let the data remain where it is and move the executable code to its hosting machine.
8. MapReduce
Stated simply, the mapper is meant to filter and
transform the input into something that the reducer can
aggregate over.
MapReduce uses lists and (key/value) pairs as its main
data primitives.
Example next
Shapes are keys, its colors are values.
11. Writing Map/Reduce Jobs
We can use multiple languages to write Map/Reduce jobs
Python with Hadoop Streaming
Pros: fast development
Cons: slower than Java, no access to Hadoop API
Java
Pros: fast, access to Hadoop API
Cons: verbose language
PIG
Pros: very small scripts, faster than streaming
Cons: yet another language to learn
Hive
Pros: SQL like syntax (easy for non-programmers) and relational data model
Cons: slower than PIG, more moving parts
12. Use Cases
Where can we use Hadoop?
Reporting
Granular reports over large data set of 5-7 years
Business analysis
Risk analysis
Predictive analysis
Operational analysis
Root cause analysis
Latency analysis
Better capacity planning (servers, people, bandwidth)
Product features
Recommendations (better than external parties, because of the amount of data)