SlideShare una empresa de Scribd logo
1 de 23
Curriculum Seminar(NCS-654) Report
On
“Hadoop”
B.Tech CS (IIIrd
Year)
Session 2015-2016
Submitted to
Mr. Amit Karmakar
Department of Computer Science & Engineering
Shri Ram Murti Smarak College Of Engineering & Technology
Dr. A.P.J. Abdul Kalam Technical University (APJAKTU)
April,2016
Submitted By
Himanshu Soni (1301410040)
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B.
Tech Curriculum Seminar undertaken during B. Tech. 2015-16.
We owe special debt of gratitude to Department of Computer
Science and Technology for conducting Big Data and Hadoop
Workshop, SRMSCET, Bareilly for his constant support and
guidance throughout the course of our work. Her sincerity,
thoroughness and perseverance have been a constant source of
inspiration for us. It is only her cognizant efforts that our
endeavors have seen light of the day.
We also take the opportunity to acknowledge the contribution of
Mr. L. S. Maurya, Head, Department of Computer Science and
Technology, SRMSCET, Bareilly, for his full support and
assistance for conducting seminar. Last but not the least, we
acknowledge our friends for their contribution in the completion
of the project.
OUTLINES
• 1) Need for New Technology
• 2) History of Origin
• 3) Hadoop & Its Component
• 4) Architecture
• 5) List of Analysis Possible using Hadoop
• 6) Hadoop Ecosystem
• 7) RDBMS vs MapReduce
• 8) Disadvantage
• 9) Hadoop Supported OS
• 10) Hadoop Alternative
• 11) Conclusion
• 12) References
NEED FOR NEW TECHNOLOGY
Size of Data:
We live in the data age. It’s not easy to measure the total volume
of data stored electronically,
but an IDC estimate put the size of the “digital universe” at 0.18
zettabytes in 2006, and is forecasting a tenfold growth by 2011 to
1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one
thousand exabytes, one million petabytes, or one billion terabytes.
That’s roughly the same order of magnitude as one disk drive for
every person in the world.
This flood of data is coming from many sources. Consider the
following:†
• The New York Stock Exchange generates about one terabyte of
new
trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of
data.
• The Internet Archive stores around 2 petabytes of data, and is
growing at a rate of 20 terabytes per month.
Varity of Data:
1. Structured data(RDBMS)
2. Semi-Structured(XML and JSON)
3. Unstructured(Videos, Log, Audio, Binary data,etc)
Speed:
The problem is simple: while the storage capacities of hard drives
have increased massively over the years, access speeds—the rate
at which data can be read from drives—have not kept up. One
typical drive from 1990 could store 1,370 MB of data and had a
transfer speed of 4.4 MB/s,§ so you could read all the data from a
full drive in around five minutes. Over 20 years later, one terabyte
drives are the norm, but the transfer speed is around 100 MB/s, so
it takes more than two and a half hours to read all the data off the
disk.
This is a long time to read all data on a single drive—and writing
is even slower. The obvious way to reduce the time is to read from
multiple disks at once. Imagine if we had 100 drives, each holding
one hundredth of the data. Working in parallel, we could read the
data in under two minutes.
HISTORY OF ORIGIN
2004—Initial versions of what is now Hadoop Distributed
Filesystem and Map-Reduce implemented by Doug Cutting and
Mike Cafarella.
December 2005—Nutch ported to the new framework. Hadoop
runs reliably on 20 nodes.
January 2006—Doug Cutting joins Yahoo!.
February 2006—Apache Hadoop project officially started to
support the standalone development of MapReduce and HDFS
February 2006—Adoption of Hadoop by Yahoo! Grid team.
April 2006—Sort benchmark (10 GB/node) run on 188 nodes in
47.9 hours.
May 2006—Yahoo! set up a Hadoop research cluster—300
nodes.4
May 2006—Sort benchmark run on 500 nodes in 42 hours (better
hardware than April benchmark).
October 2006—Research cluster reaches 600 nodes.
December 2006—Sort benchmark run on 20 nodes in 1.8 hours,
100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8
hours.
January 2007—Research cluster reaches 900 nodes.
April 2007—Research clusters—2 clusters of 1000 nodes.
April 2008—Won the 1 terabyte sort benchmark in 209 seconds
on 900 nodes.
October 2008—Loading 10 terabytes of data per day on to
research clusters.
March 2009—17 clusters with a total of 24,000 nodes.
April 2009—Won the minute sort by sorting 500 GB in 59
seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes
(on 3,400 nodes).
—Owen
O’Malley
HADOOP AND ITS COMPONENT
Definition:
Hadoop(Apache Foundation Project) is a Big Data
technology provides reliable shared storage and analysis
system for large scale data processing.
Components:
HDFS(Hadoop Distributed File System):
The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It has
many similarities with existing distributed file systems.
However, the differences from other distributed file systems
are significant.
• highly fault-tolerant and is designed to be deployed on
low-cost hardware.
• provides high throughput access to application data
and is suitable for applications that have large data
sets.
• relaxes a few POSIX requirements to enable streaming
access to file system data.
MapReduce:
• Programming model developed at Google
• Sort/merge based distributed computing
• Initially, it was intended for their internal search/indexing
application, but now used extensively by more organizations
(e.g., Yahoo, Amazon.com, IBM, etc.)
ARCHITECTURE
LIST OF ANALYSIS POSSIBLE USING HADOOP
• Text mining
• Index building
• Graph creation and analysis
• Pattern recognition
• Collaborative filtering
• Prediction models
• Sentiment analysis
• Risk assessment
HADOOP ECOSYSTEM
Although Hadoop is best known for MapReduce and its
distributed filesystem (HDFS, renamed from NDFS), the term is
also used for a family of related projects that fall under the
umbrella of infrastructure for distributed computing and large-
scale data processing.
Most of the core projects covered in this book are hosted by the
Apache Software Foundation, which provides support for a
community of open source software projects, including the
original HTTP Server from which it gets its name. As the Hadoop
ecosystem
grows, more projects are appearing, not necessarily hosted at
Apache, which provide complementary services to Hadoop, or
build on the core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described
briefly here:
Common
A set of components and interfaces for distributed filesystems and
general I/O (serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and
persistent data storage.
MapReduce
A distributed data processing model and execution environment
that runs on large clusters of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity
machines.
Pig
A data flow language and execution environment for exploring
very large datasets.Pig runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS
and provides a query language based on SQL (and which is
translated by the runtime engine to MapReduce jobs) for querying
the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its
underlying storage, and supports both batch-style computations
using MapReduce and point queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for
building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases
and HDFS.
RDBMS VS MAPREDUCE
Why can’t we use databases with lots of disks to do large-scale
batch analysis? Why is MapReduce needed?
The answer to these questions comes from another trend in disk
drives: seek time is improving more slowly than transfer rate.
Seeking is the process of moving the disk’s head to a particular
place on the disk to read or write data. It characterizes the latency
of a disk operation, whereas the transfer rate corresponds to a
disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer
to read or write large portions of the dataset than streaming
through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a
database, a traditional B-Tree (the data structure used in relational
databases, which is limited by the rate it can perform seeks) works
well. For updating the majority of a database, a B-Tree is less
efficient than MapReduce, which uses Sort/Merge to rebuild the
database.
In many ways, MapReduce can be seen as a complement to an
RDBMS. (The differences between the two systems are shown in
Table 1-1.) MapReduce is a good fit for problems that need to
analyze the whole dataset, in a batch fashion, particularly for ad
hoc analysis.
An RDBMS is good for point queries or updates, where the
dataset has been indexed to deliver low-latency retrieval and
update times of a relatively small amount of data. MapReduce
suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are
continually updated.
DISADVANTAGE
1. Security Concerns
Just managing a complex application such as Hadoop can be
challenging. A classic example can be seen in the Hadoop security
model, which is disabled by default due to sheer complexity. If
whoever’s managing the platform lacks the know how to enable
it,
your data could be at huge risk. Hadoop is also missing encryption
at the storage and network levels, which is a major selling point
for government agencies and others that prefer to keep their data
under wraps.
2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running
it a risky proposition. The framework is written almost entirely in
Java, one of the most widely used yet controversial programming
languages in existence. Java has been heavily exploited by
cybercriminals and as a result, implicated in numerous security
breaches. For this reason, several experts have suggested dumping
it in favor of safer, more efficient alternatives.
3. Not Fit for Small Data
While big data isn't exclusively made for big businesses, not all
big data platforms are suited for small data needs. Unfortunately,
Hadoop happens to be one of them. Due to its high capacity
design, the Hadoop Distributed File System or HDFS, lacks the
ability to
efficiently support the random reading of small files. As a result,
it is not recommended for organizations with small quantities of
data.
4. Potential Stability Issues
Hadoop is an open source platform. That essentially means it is
created by the contributions of the many developers who continue
to work on the project. While improvements are constantly being
made,
like all open source software, Hadoop has had its fair share of
stability issues. To avoid these issues, organizations are strongly
recommended to make sure they are running the latest stable
version, or run it under a thirdparty vendor equipped to handle
such problems.
5. General Limitations
When it comes to making the most of big data, Hadoop may not
be the only answer. Apache Flume , Millwheel, and Google’s own
Cloud Dataflow as possible solutions. What each of these
platforms have in common is the ability to improve the efficiency
and reliability of data collection, aggregation, and integration.
HADOOP SUPPORTED OS
• – Red Hat Enterprise
• – CentOS
• – Oracle Linux
• – Ubuntu
• – SUSE Linux Enterprise Server
HADOOP ALTERNATIVES
Disco
Disco can be broadly defined as an open-source and lightweight
framework for distributed computing, which is based on the
MapReduce paradigm. Due to Python, it is easy and powerful to
use.
Misco
Misco can be broadly defined as distributed
computing framework, which has been especially designed for
mobile devices. Highly portable, Misco is being 100%
implemented in Python and should be able to run on
any system, which supports Python.
Cloud MapReduce
Initially developed at Accenture Technology Labs, Cloud
Mapreduce can be broadly defined as Mapreduce Implementation
on Amazon Cloud OS. When compared with other open source
implementation, it is completely different architecture than others.
Skynet
Skynet can be broadly defined as the open-source Ruby
implementation of Google MapReduce framework. Skynet was
created at Geni and anyone can use this
Sphere
Sphere can be broadly defined as the support for distributed
storage of data, processing and distribution over many clusters of
commodity computers across multiple data centers or in a single
data center. It can be further defined as scalable, high performance
and secure distributed file system.
Storm
Storm by Nathan Storm is a fully distributed realtime computation
system which is similar to Hadoop and offers a set of general
primitives for batch processing. It is very simple,which can be
used with any program language.
MongoDB
MongoDB can be defined as a very popular tool which is used for
cloud computing. This tool uses the map reduce algorithm in a
simple and elegant manner.
CONCLUSION
 Hadoop has been very effective solution for companies
dealing with the data in petabytes.
 It has solved many problems related to the huge data
-management and distribution system.
 As it is open source adopted by many companies widely.
REFERENCES
1-Hadoop: The Definitive Guide; Tom White (Author)
O'Reilly Media; 3rd Edition (May6, 2012)
2-www.hadoop.apache.org/
3-www.tutorialspoint.com/hadoop/
4-www.en.wikipedia.org/wiki/Apache_Hadoop

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 

La actualidad más candente (17)

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big data
Big dataBig data
Big data
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 

Similar a Hadoop

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 

Similar a Hadoop (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
HDFS
HDFSHDFS
HDFS
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
hadoop
hadoophadoop
hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 

Último

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 

Último (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 

Hadoop

  • 1. Curriculum Seminar(NCS-654) Report On “Hadoop” B.Tech CS (IIIrd Year) Session 2015-2016 Submitted to Mr. Amit Karmakar Department of Computer Science & Engineering Shri Ram Murti Smarak College Of Engineering & Technology Dr. A.P.J. Abdul Kalam Technical University (APJAKTU) April,2016 Submitted By Himanshu Soni (1301410040)
  • 2. ACKNOWLEDGEMENT It gives us a great sense of pleasure to present the report of the B. Tech Curriculum Seminar undertaken during B. Tech. 2015-16. We owe special debt of gratitude to Department of Computer Science and Technology for conducting Big Data and Hadoop Workshop, SRMSCET, Bareilly for his constant support and guidance throughout the course of our work. Her sincerity, thoroughness and perseverance have been a constant source of inspiration for us. It is only her cognizant efforts that our endeavors have seen light of the day. We also take the opportunity to acknowledge the contribution of Mr. L. S. Maurya, Head, Department of Computer Science and Technology, SRMSCET, Bareilly, for his full support and assistance for conducting seminar. Last but not the least, we acknowledge our friends for their contribution in the completion of the project.
  • 3. OUTLINES • 1) Need for New Technology • 2) History of Origin • 3) Hadoop & Its Component • 4) Architecture • 5) List of Analysis Possible using Hadoop • 6) Hadoop Ecosystem • 7) RDBMS vs MapReduce • 8) Disadvantage • 9) Hadoop Supported OS • 10) Hadoop Alternative • 11) Conclusion • 12) References
  • 4. NEED FOR NEW TECHNOLOGY Size of Data: We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world. This flood of data is coming from many sources. Consider the following:† • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
  • 5. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. Varity of Data: 1. Structured data(RDBMS) 2. Semi-Structured(XML and JSON) 3. Unstructured(Videos, Log, Audio, Binary data,etc) Speed: The problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives—have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,§ so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
  • 6. HISTORY OF ORIGIN 2004—Initial versions of what is now Hadoop Distributed Filesystem and Map-Reduce implemented by Doug Cutting and Mike Cafarella. December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006—Doug Cutting joins Yahoo!. February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS February 2006—Adoption of Hadoop by Yahoo! Grid team. April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.4
  • 7. May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). October 2006—Research cluster reaches 600 nodes. December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. January 2007—Research cluster reaches 900 nodes. April 2007—Research clusters—2 clusters of 1000 nodes. April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. October 2008—Loading 10 terabytes of data per day on to research clusters. March 2009—17 clusters with a total of 24,000 nodes. April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes). —Owen O’Malley
  • 8. HADOOP AND ITS COMPONENT Definition: Hadoop(Apache Foundation Project) is a Big Data technology provides reliable shared storage and analysis system for large scale data processing. Components: HDFS(Hadoop Distributed File System): The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems.
  • 9. However, the differences from other distributed file systems are significant. • highly fault-tolerant and is designed to be deployed on low-cost hardware. • provides high throughput access to application data and is suitable for applications that have large data sets. • relaxes a few POSIX requirements to enable streaming access to file system data. MapReduce: • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) ARCHITECTURE
  • 10. LIST OF ANALYSIS POSSIBLE USING HADOOP • Text mining
  • 11. • Index building • Graph creation and analysis • Pattern recognition • Collaborative filtering • Prediction models • Sentiment analysis • Risk assessment HADOOP ECOSYSTEM Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the term is
  • 12. also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large- scale data processing. Most of the core projects covered in this book are hosted by the Apache Software Foundation, which provides support for a community of open source software projects, including the original HTTP Server from which it gets its name. As the Hadoop ecosystem grows, more projects are appearing, not necessarily hosted at Apache, which provide complementary services to Hadoop, or build on the core to add higher-level abstractions. The Hadoop projects that are covered in this book are described briefly here: Common A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures). Avro A serialization system for efficient, cross-language RPC, and persistent data storage. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets.Pig runs on HDFS and MapReduce clusters. Hive
  • 13. A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficiently moving data between relational databases and HDFS.
  • 14. RDBMS VS MAPREDUCE Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. In many ways, MapReduce can be seen as a complement to an RDBMS. (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.
  • 15.
  • 16. DISADVANTAGE 1. Security Concerns Just managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoever’s managing the platform lacks the know how to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps. 2. Vulnerable By Nature Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives. 3. Not Fit for Small Data While big data isn't exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to
  • 17. efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data. 4. Potential Stability Issues Hadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the project. While improvements are constantly being made, like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a thirdparty vendor equipped to handle such problems. 5. General Limitations When it comes to making the most of big data, Hadoop may not be the only answer. Apache Flume , Millwheel, and Google’s own Cloud Dataflow as possible solutions. What each of these platforms have in common is the ability to improve the efficiency and reliability of data collection, aggregation, and integration.
  • 18. HADOOP SUPPORTED OS • – Red Hat Enterprise • – CentOS • – Oracle Linux • – Ubuntu • – SUSE Linux Enterprise Server
  • 19. HADOOP ALTERNATIVES Disco Disco can be broadly defined as an open-source and lightweight framework for distributed computing, which is based on the MapReduce paradigm. Due to Python, it is easy and powerful to use. Misco Misco can be broadly defined as distributed computing framework, which has been especially designed for mobile devices. Highly portable, Misco is being 100% implemented in Python and should be able to run on any system, which supports Python. Cloud MapReduce
  • 20. Initially developed at Accenture Technology Labs, Cloud Mapreduce can be broadly defined as Mapreduce Implementation on Amazon Cloud OS. When compared with other open source implementation, it is completely different architecture than others. Skynet Skynet can be broadly defined as the open-source Ruby implementation of Google MapReduce framework. Skynet was created at Geni and anyone can use this Sphere Sphere can be broadly defined as the support for distributed storage of data, processing and distribution over many clusters of commodity computers across multiple data centers or in a single data center. It can be further defined as scalable, high performance and secure distributed file system. Storm Storm by Nathan Storm is a fully distributed realtime computation system which is similar to Hadoop and offers a set of general primitives for batch processing. It is very simple,which can be used with any program language. MongoDB MongoDB can be defined as a very popular tool which is used for cloud computing. This tool uses the map reduce algorithm in a simple and elegant manner.
  • 21. CONCLUSION  Hadoop has been very effective solution for companies dealing with the data in petabytes.  It has solved many problems related to the huge data -management and distribution system.  As it is open source adopted by many companies widely.
  • 22. REFERENCES 1-Hadoop: The Definitive Guide; Tom White (Author) O'Reilly Media; 3rd Edition (May6, 2012) 2-www.hadoop.apache.org/ 3-www.tutorialspoint.com/hadoop/