1. Hadoop Presentation
2012
Presenter : Pham Thai Hoa
Email : thaihoabo@gmail.com
Web : http://mobion.com/hoa
4/14/2012 Pham Thai Hoa
2. Topic
Introduce to Hadoop
Introduce to Hive
Introduce to Logger
Using Hadoop at Mobion
Warehouse at Mobion
Q&A
4/14/2012 Pham Thai Hoa
3. What is Hadoop
It’s a framework for the distributed
processing
Inspired by Google’s architecture: Map
Reduce and GFS
A top-level Apache project
Hadoop is the open source
Hadoop have the two important
elements
+ Map – Reduce core
+ Hadoop Distributed File System
4/14/2012 Pham Thai Hoa
4. Why use Hadoop
Fault-tolerant hardware is expensive
Hadoop is designed to run on cheap
commodity hardware
It automatically handles data
replication and node failure
It does the hard work – you can focus
on processing data
It has the three supported modes :
Local, Pseudo-Distributed, Fully-
Distributed Mode
4/14/2012 Pham Thai Hoa
6. Who use Hadoop
Amazon's product search indices
using the streaming API and pre-
existing C++, Perl, and Python tools
Yahoo : More than 100,000 CPUs in
>40,000 computers running Hadoop
Facebook use Hadoop to store copies
of internal log and dimension data
sources and use it as a source for
reporting/analytics and machine
learning
4/14/2012 Pham Thai Hoa
7. What is Hive
Hive is a data warehouse system for
Hadoop
Using Map-Reduce for execution
Using HDFS for storage
Metadata in an RDBMS
Scalability and performance
Interoperability
Using a SQL-like language called
HiveQL
4/14/2012 Pham Thai Hoa
9. Hive Data Model
Tables
+ Typed columns (int, float, string,…)
+ Also, array/map/struct for JSON-like
data
Partitions
+ e.g., to range-partition tables by
date
Buckets
+ Hash partitions within ranges (useful
for sampling, join optimization)
4/14/2012 Pham Thai Hoa
10. Hive Metastore
Database: namespace containing a
set of tables
Holds Table/Partition definitions
(column types,mappings to HDFS
directories)
Statistics
Implemented with DataNucleus ORM.
Runs on Derby, MySQL, and many
other relational databases
4/14/2012 Pham Thai Hoa
11. Introduce to Logger
A logging system has three broad
components
+ Client Code Interface
+ Distribution System
+ Do Something Usefullizer
Scribe is a server for aggregating
streaming log data. It is designed to
scale to a very large number of nodes
and be robust to network and node
failures
4/14/2012 Pham Thai Hoa
12. Why use Scribe
Scalability and performance
Event Notification library
Thrift framework
Hadoop is optional
Client using
Distributed scribe system
Over 1 million messages per second
for logging
Hierarchy stores
4/14/2012 Pham Thai Hoa
13. Warehouse at Mobion
Log Collector
Log/Data Transformer
Data Analyzer
Web Reporter
Log define
Log integrate (into application)
Log/Data analyze
Report develop (API, Mobion, Music
…)
4/14/2012 Pham Thai Hoa
14. Warehouse at Mobion
Data mining
Music Recommendation
Spam Detection
Application performance
Export data and import into MySQL for
web report
Analytic system
4/14/2012 Pham Thai Hoa
15. Q&A
Why use hadoop ?
Why use Hive ?
Why need a logging system ?
What is the warehouse system
architecture ?
Do we use these system for voting,
chat, message and feed ??
How can we use them for
recommendation, suggestion ?
4/14/2012 Pham Thai Hoa
16. Following Link
http://facebook.com
http://highscalability.com/product-
scribe-facebooks-scalable-logging-
system
http://hadoop.apache.org/
http://hive.apache.org/
http://wiki.apache.org/hadoop/Powere
dBy
http://www.apache.org/foundation/than
ks.html 4/14/2012 Pham Thai Hoa