3. 背景:大规模数据计算
通信、网络、存储、传感器等电子信息技术飞速发展导致
数据规模极大增加 – Big Data
传统的存储并处理这些数据的技术手段遇到瓶颈
数据
为王
Processing 100TB datasets
One node
Scanning@50MB/s = 35,000 min
1000 node
Scanning@50MB/s=35 min
Search Engine
Data Warehousing
Log Processing/User
Behavior Analyzing
Online/Realtime/Stream
ing Data Analysis
9. Hadoop
Apache Nutch, 2002
NDFS + MapReduce, 2004
Hadoop, 2006
Apache Hadoop, 2008
Doug Cutting,
Apache软件基
金会主席
http://hadoop.apache.org/
Book:
http://oreilly.com/catalog/9780596521998/index.html
Clone of Google’s GFS and
Written in Java
MapReduce
• Does work with other languages
• Can process large scale Web pages Runs on
• Linux, Windows and more
• Commodity hardware with high
failure rate
20. RCFile技术性能优势和应用情况
•Compared with SequenceFile, which was
the default row store technology in
Apache Hive, RCFile can achieve up to
20% space savings without affecting
query performance.
•Compared with column group technology
used in Apache Pig, which is another big
data analysis system, RCFile’s data
loading is 23% faster as far as the disk
space utilization ratio is almost equal.
•Obviously, RCFile has become the de
facto standard of data storage structure
inside distributed offline data analysis
systems such as Apache Hive.