Instrumentation, measurement and control of bio process parameters ( Temperat...
Hadoop
1. Curriculum Seminar(NCS-654) Report
On
“Hadoop”
B.Tech CS (IIIrd
Year)
Session 2015-2016
Submitted to
Mr. Amit Karmakar
Department of Computer Science & Engineering
Shri Ram Murti Smarak College Of Engineering & Technology
Dr. A.P.J. Abdul Kalam Technical University (APJAKTU)
April,2016
Submitted By
Himanshu Soni (1301410040)
2. ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B.
Tech Curriculum Seminar undertaken during B. Tech. 2015-16.
We owe special debt of gratitude to Department of Computer
Science and Technology for conducting Big Data and Hadoop
Workshop, SRMSCET, Bareilly for his constant support and
guidance throughout the course of our work. Her sincerity,
thoroughness and perseverance have been a constant source of
inspiration for us. It is only her cognizant efforts that our
endeavors have seen light of the day.
We also take the opportunity to acknowledge the contribution of
Mr. L. S. Maurya, Head, Department of Computer Science and
Technology, SRMSCET, Bareilly, for his full support and
assistance for conducting seminar. Last but not the least, we
acknowledge our friends for their contribution in the completion
of the project.
3. OUTLINES
• 1) Need for New Technology
• 2) History of Origin
• 3) Hadoop & Its Component
• 4) Architecture
• 5) List of Analysis Possible using Hadoop
• 6) Hadoop Ecosystem
• 7) RDBMS vs MapReduce
• 8) Disadvantage
• 9) Hadoop Supported OS
• 10) Hadoop Alternative
• 11) Conclusion
• 12) References
4. NEED FOR NEW TECHNOLOGY
Size of Data:
We live in the data age. It’s not easy to measure the total volume
of data stored electronically,
but an IDC estimate put the size of the “digital universe” at 0.18
zettabytes in 2006, and is forecasting a tenfold growth by 2011 to
1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one
thousand exabytes, one million petabytes, or one billion terabytes.
That’s roughly the same order of magnitude as one disk drive for
every person in the world.
This flood of data is coming from many sources. Consider the
following:†
• The New York Stock Exchange generates about one terabyte of
new
trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of
data.
5. • The Internet Archive stores around 2 petabytes of data, and is
growing at a rate of 20 terabytes per month.
Varity of Data:
1. Structured data(RDBMS)
2. Semi-Structured(XML and JSON)
3. Unstructured(Videos, Log, Audio, Binary data,etc)
Speed:
The problem is simple: while the storage capacities of hard drives
have increased massively over the years, access speeds—the rate
at which data can be read from drives—have not kept up. One
typical drive from 1990 could store 1,370 MB of data and had a
transfer speed of 4.4 MB/s,§ so you could read all the data from a
full drive in around five minutes. Over 20 years later, one terabyte
drives are the norm, but the transfer speed is around 100 MB/s, so
it takes more than two and a half hours to read all the data off the
disk.
This is a long time to read all data on a single drive—and writing
is even slower. The obvious way to reduce the time is to read from
multiple disks at once. Imagine if we had 100 drives, each holding
one hundredth of the data. Working in parallel, we could read the
data in under two minutes.
6. HISTORY OF ORIGIN
2004—Initial versions of what is now Hadoop Distributed
Filesystem and Map-Reduce implemented by Doug Cutting and
Mike Cafarella.
December 2005—Nutch ported to the new framework. Hadoop
runs reliably on 20 nodes.
January 2006—Doug Cutting joins Yahoo!.
February 2006—Apache Hadoop project officially started to
support the standalone development of MapReduce and HDFS
February 2006—Adoption of Hadoop by Yahoo! Grid team.
April 2006—Sort benchmark (10 GB/node) run on 188 nodes in
47.9 hours.
May 2006—Yahoo! set up a Hadoop research cluster—300
nodes.4
7. May 2006—Sort benchmark run on 500 nodes in 42 hours (better
hardware than April benchmark).
October 2006—Research cluster reaches 600 nodes.
December 2006—Sort benchmark run on 20 nodes in 1.8 hours,
100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8
hours.
January 2007—Research cluster reaches 900 nodes.
April 2007—Research clusters—2 clusters of 1000 nodes.
April 2008—Won the 1 terabyte sort benchmark in 209 seconds
on 900 nodes.
October 2008—Loading 10 terabytes of data per day on to
research clusters.
March 2009—17 clusters with a total of 24,000 nodes.
April 2009—Won the minute sort by sorting 500 GB in 59
seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes
(on 3,400 nodes).
—Owen
O’Malley
8. HADOOP AND ITS COMPONENT
Definition:
Hadoop(Apache Foundation Project) is a Big Data
technology provides reliable shared storage and analysis
system for large scale data processing.
Components:
HDFS(Hadoop Distributed File System):
The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It has
many similarities with existing distributed file systems.
9. However, the differences from other distributed file systems
are significant.
• highly fault-tolerant and is designed to be deployed on
low-cost hardware.
• provides high throughput access to application data
and is suitable for applications that have large data
sets.
• relaxes a few POSIX requirements to enable streaming
access to file system data.
MapReduce:
• Programming model developed at Google
• Sort/merge based distributed computing
• Initially, it was intended for their internal search/indexing
application, but now used extensively by more organizations
(e.g., Yahoo, Amazon.com, IBM, etc.)
ARCHITECTURE
11. • Index building
• Graph creation and analysis
• Pattern recognition
• Collaborative filtering
• Prediction models
• Sentiment analysis
• Risk assessment
HADOOP ECOSYSTEM
Although Hadoop is best known for MapReduce and its
distributed filesystem (HDFS, renamed from NDFS), the term is
12. also used for a family of related projects that fall under the
umbrella of infrastructure for distributed computing and large-
scale data processing.
Most of the core projects covered in this book are hosted by the
Apache Software Foundation, which provides support for a
community of open source software projects, including the
original HTTP Server from which it gets its name. As the Hadoop
ecosystem
grows, more projects are appearing, not necessarily hosted at
Apache, which provide complementary services to Hadoop, or
build on the core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described
briefly here:
Common
A set of components and interfaces for distributed filesystems and
general I/O (serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and
persistent data storage.
MapReduce
A distributed data processing model and execution environment
that runs on large clusters of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity
machines.
Pig
A data flow language and execution environment for exploring
very large datasets.Pig runs on HDFS and MapReduce clusters.
Hive
13. A distributed data warehouse. Hive manages data stored in HDFS
and provides a query language based on SQL (and which is
translated by the runtime engine to MapReduce jobs) for querying
the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its
underlying storage, and supports both batch-style computations
using MapReduce and point queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for
building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases
and HDFS.
14. RDBMS VS MAPREDUCE
Why can’t we use databases with lots of disks to do large-scale
batch analysis? Why is MapReduce needed?
The answer to these questions comes from another trend in disk
drives: seek time is improving more slowly than transfer rate.
Seeking is the process of moving the disk’s head to a particular
place on the disk to read or write data. It characterizes the latency
of a disk operation, whereas the transfer rate corresponds to a
disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer
to read or write large portions of the dataset than streaming
through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a
database, a traditional B-Tree (the data structure used in relational
databases, which is limited by the rate it can perform seeks) works
well. For updating the majority of a database, a B-Tree is less
efficient than MapReduce, which uses Sort/Merge to rebuild the
database.
In many ways, MapReduce can be seen as a complement to an
RDBMS. (The differences between the two systems are shown in
Table 1-1.) MapReduce is a good fit for problems that need to
analyze the whole dataset, in a batch fashion, particularly for ad
hoc analysis.
An RDBMS is good for point queries or updates, where the
dataset has been indexed to deliver low-latency retrieval and
update times of a relatively small amount of data. MapReduce
suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are
continually updated.
15.
16. DISADVANTAGE
1. Security Concerns
Just managing a complex application such as Hadoop can be
challenging. A classic example can be seen in the Hadoop security
model, which is disabled by default due to sheer complexity. If
whoever’s managing the platform lacks the know how to enable
it,
your data could be at huge risk. Hadoop is also missing encryption
at the storage and network levels, which is a major selling point
for government agencies and others that prefer to keep their data
under wraps.
2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running
it a risky proposition. The framework is written almost entirely in
Java, one of the most widely used yet controversial programming
languages in existence. Java has been heavily exploited by
cybercriminals and as a result, implicated in numerous security
breaches. For this reason, several experts have suggested dumping
it in favor of safer, more efficient alternatives.
3. Not Fit for Small Data
While big data isn't exclusively made for big businesses, not all
big data platforms are suited for small data needs. Unfortunately,
Hadoop happens to be one of them. Due to its high capacity
design, the Hadoop Distributed File System or HDFS, lacks the
ability to
17. efficiently support the random reading of small files. As a result,
it is not recommended for organizations with small quantities of
data.
4. Potential Stability Issues
Hadoop is an open source platform. That essentially means it is
created by the contributions of the many developers who continue
to work on the project. While improvements are constantly being
made,
like all open source software, Hadoop has had its fair share of
stability issues. To avoid these issues, organizations are strongly
recommended to make sure they are running the latest stable
version, or run it under a thirdparty vendor equipped to handle
such problems.
5. General Limitations
When it comes to making the most of big data, Hadoop may not
be the only answer. Apache Flume , Millwheel, and Google’s own
Cloud Dataflow as possible solutions. What each of these
platforms have in common is the ability to improve the efficiency
and reliability of data collection, aggregation, and integration.
18. HADOOP SUPPORTED OS
• – Red Hat Enterprise
• – CentOS
• – Oracle Linux
• – Ubuntu
• – SUSE Linux Enterprise Server
19. HADOOP ALTERNATIVES
Disco
Disco can be broadly defined as an open-source and lightweight
framework for distributed computing, which is based on the
MapReduce paradigm. Due to Python, it is easy and powerful to
use.
Misco
Misco can be broadly defined as distributed
computing framework, which has been especially designed for
mobile devices. Highly portable, Misco is being 100%
implemented in Python and should be able to run on
any system, which supports Python.
Cloud MapReduce
20. Initially developed at Accenture Technology Labs, Cloud
Mapreduce can be broadly defined as Mapreduce Implementation
on Amazon Cloud OS. When compared with other open source
implementation, it is completely different architecture than others.
Skynet
Skynet can be broadly defined as the open-source Ruby
implementation of Google MapReduce framework. Skynet was
created at Geni and anyone can use this
Sphere
Sphere can be broadly defined as the support for distributed
storage of data, processing and distribution over many clusters of
commodity computers across multiple data centers or in a single
data center. It can be further defined as scalable, high performance
and secure distributed file system.
Storm
Storm by Nathan Storm is a fully distributed realtime computation
system which is similar to Hadoop and offers a set of general
primitives for batch processing. It is very simple,which can be
used with any program language.
MongoDB
MongoDB can be defined as a very popular tool which is used for
cloud computing. This tool uses the map reduce algorithm in a
simple and elegant manner.
21. CONCLUSION
Hadoop has been very effective solution for companies
dealing with the data in petabytes.
It has solved many problems related to the huge data
-management and distribution system.
As it is open source adopted by many companies widely.
22. REFERENCES
1-Hadoop: The Definitive Guide; Tom White (Author)
O'Reilly Media; 3rd Edition (May6, 2012)
2-www.hadoop.apache.org/
3-www.tutorialspoint.com/hadoop/