2. • Hadoop Core, our flagship sub-project,
provides a distributed filesystem (HDFS) and
support for the MapReduce distributed
computing metaphor.
• Pig is a high-level data-flow language and
execution framework for parallel computation.
It is built on top of Hadoop Core.
3. ZooKeeper
• ZooKeeper is a highly available and reliable
coordination system. Distributed applications
use ZooKeeper to store and mediate updates
for critical shared state.
4. JobTracker
• JobTracker: The JobTracker provides command
and control for job management. It supplies
the primary user interface to a MapReduce
cluster. It also handles the distribution and
management of tasks. There is one instance of
this server running on a cluster. The machine
running the JobTracker server is the
MapReduce master.
5. TaskTracker
• TaskTracker: The TaskTracker provides
execution services for the submitted jobs.
Each TaskTracker manages the execution of
tasks on an individual compute node in the
MapReduce cluster. The JobTracker manages
all of the TaskTracker processes. There is one
instance of this server per compute node.
6. NameNode
• NameNode: The NameNode provides metadata
storage for the shared file system. The
NameNode supplies the primary user interface to
the HDFS. It also manages all of the metadata for
the HDFS. There is one instance of this server
running on a cluster. The metadata includes such
critical information as the file directory structure
and which DataNodes have copies of the data
blocks that contain each file’s data. The machine
running the NameNode server process is the
HDFS master.
7. Secondary NameNode
• Secondary NameNode: The secondary
NameNode provides both file system metadata
backup and metadata compaction. It supplies
near real-time backup of the metadata for the
NameNode. There is at least one instance of this
server running on a cluster, ideally on a separate
physical machine than the one running the
NameNode. The secondary NameNode also
merges the metadata change history, the edit log,
into the NameNode’s file system image.
8. Design of HDFS
• Design of HDFS
– Very large files
– Streaming data access
– Commodity hardware
• not a good fit
– Low-latency data access
– Lots of small files
– Multiple writers, arbitrary file modifications
24. could only be replicated
• java.io.IOException: could only be replicated
to 0 nodes, instead of 1.
• 解决:
– XML的配置不正确,要保证slave的mapred-
site.xml和core-site.xml的地址都跟master一致
27. error in shuffle in fetcher
• org.apache.hadoop.mapreduce.task.reduce.Sh
uffle$ShuffleError: error in shuffle in fetcher
• 解决方式:
– 问题出在hosts文件的配置上,在所有节点的
/etc/hosts文件中加入其他节点的主机名和IP映
射