INTRODUCTION TO BIG DATA AND HADOOP
9
Introduction to Big Data, Types of Digital Data, Challenges of conventional systems - Web data, Evolution of analytic processes and tools, Analysis Vs reporting - Big Data Analytics, Introduction to Hadoop - Distributed Computing
Challenges - History of Hadoop, Hadoop Eco System - Use case of Hadoop – Hadoop Distributors – HDFS – Processing Data with Hadoop – Map Reduce.
1. IT19741
Cloud and Big Data
Analytics
Dr G Geetha
Dean innovation and Professor CSE
Women Scientist, Qualified Patent agent
Rajalakshmi Engineering College
2. UNIT-III
INTRODUCTION
TO BIG DATA AND
HADOOP
Introduction to Big Data, Types of Digital Data,
Challenges of conventional systems - Web
data, Evolution of analytic processes and
tools, Analysis Vs reporting - Big Data
Analytics, Introduction to Hadoop -
Distributed Computing Challenges - History of
Hadoop, Hadoop Eco System - Use case of
Hadoop – Hadoop Distributors – HDFS –
Processing Data with Hadoop – Map Reduce.
4. Introduction to Big Data
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
8. Tradational Data Big Data
Traditional data is generated in enterprise level. Big data is generated outside the enterprise level.
Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
secondTraditional database system deals with structured data.
Big data system deals with structured, semi-
structured,database, and unstructured data.
Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
Traditional data source is centralized and it is managed in
centralized form.
Big data source is distributed and it is managed in distributed
form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is capable to process traditional
data.
High system configuration is required to process big data.
The size of the data is very small. The size is more than the traditional data size.
Traditional data base tools are required to perform any data base
operation.
Special kind of data base tools are required to perform any
database schema-based operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is strict schema based and it is static. Its data model is a flat schema based and it is dynamic.
Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
Its data sources includes ERP transaction data, CRM transaction
data, financial data, organizational data, web transaction data
Its data sources includes social media, device data, sensor data,
9. Types of Digital Data
• Structured data: When data follows a pre-defined
schema/structure we say it is structured data. This is
the data which is in an organized form (e.g., in rows
and columns) and be easily used by a computer
program.
• Relationships exist between entities of data, such
as classes and their objects. About 10% data of an
organization is in this format.
• Data stored in databases is an example of
structured data.
10. Types of Digital Data
• Unstructured data: This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer program.
About 80% data of an organization is in this format; for example, memos,
chat rooms, PowerPoint presentations, images, videos, letters. researches,
white papers, body of an email, etc.
11.
12. Types of Digital Data
• Semi-structured data: Semi-structured data is also referred to as self
describing structure. This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program. About 10% data of
an organization is in this format; for example, HTML, XML, JSON,
email data etc.
14. web data
• Data that is sourced and structured from websites is
referred to as "web data"
• In the world of big data, data comes from multiple sources and in
huge amount. In which one source is web itself. Web data extraction
is one of the medium of collecting data from this source i.e. web
30. Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other SQL
Reporting Tools
Interactive Business
Intelligence &
In-memory RDBMS
QliqView, Tableau, HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
31. Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications
• Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
• Shared
parallel
nothing, massively
processing, scale out
architectures are well-suited for
big data apps
23
39. 5Vs” of Big Data
• Volume: With increasing dependence on technology, data is producing at a large volume.
Common examples are data being produced by various social networking sites, sensors,
scanners, airlines and other organizations.
• Velocity: Huge amount of data is generated per second. It is estimated that by the end of
2020, every individual will produce 3mb data per second. This large volume of data is being
generated with a great velocity.
• Variety: The data being produced by different means is of three types:
Structured Data: It is the relational data which is stored in the form of rows and columns.
Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data
which can’t be stored in the form of rows and columns.
Semi Structured Data: Log files are the examples of this type of data.
• Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in
the generation of doubtful or uncertain Information. Often data inconsistency arises because
of the volume or amount of data e.g. data in bulk could create confusion whereas less
amount of data could convey half or incomplete Information.
• Value: After having the 4 V’s into account there comes one more V which stands for Value!.
Bulk of Data having no Value is of no good to the company, unless you turn it into something
useful. Data in itself is of no use or importance but it needs to be converted into something
valuable to extract Information. Hence, you can state that Value! is the most important V of
all the 5V’s
40. Distributed Computing
Challenges
• Lack of performance and scalability.
• Lack of flexible resource management.
• Lack of application deployment support.
• Lack of quality of service.
• Lack of multiple data source support.
53. Hadoop ecosystem
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
54.
55. HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
• Name node
• Data Node
Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
56. Apache HBase
• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be processed
within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data
57.
58. YARN
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
59. MapReduce
By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is lateron processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.
60.
61. PIG
Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care of.
After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization
and hence is a major segment of the Hadoop Ecosystem.
62. HIVE
• With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query language is
called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by Hive
thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in
the processing of queries.
63. Mahout
• Mahout, allows Machine Learnability to a system or application.
Machine Learning, as the name suggests helps the system to develop
itself based on some patterns, user/environmental interaction or on
the basis of algorithms.
• It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts
of Machine learning. It allows invoking algorithms as per our need
with the help of its own libraries.
64. Apache Spark
• It’s a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most
of the companies interchangeably.
65.
66. Data Management
• Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination
and synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper overcame
all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There is two kinds of
jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow
is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.
68. Main components of Hadoop
• Namenode—controls operation of the data jobs.
• Datanode—this writes data in blocks to local storage. And it replicates data
blocks to other datanodes. DataNodes are also rack-aware. You would not
want to replicate all your data to the same rack of servers as an outage
there would cause you to loose all your data.
• SecondaryNameNode—this one take over if the primary Namenode goes
offline.
• JobTracker—sends MapReduce jobs to nodes in the cluster.
• TaskTracker—accepts tasks from the Job Tracker.
69. Main components of Hadoop
• Yarn—runs the Yarn components ResourceManager and NodeManager.
This is a resource manager that can also run as a stand-alone
component to provide other applications the ability to run in a
distributed architecture. For example you can use Apache Spark with
Yarn. You could also write your own program to use Yarn. But that is
complicated.
• Client Application—this is whatever program you have written or some
other client like Apache Pig. Apache Pig is an easy-to-use shell that
takes SQL-like commands and translates them to Java MapReduce
programs and runs them on Hadoop.
• Application Master—runs shell commands in a container as directed by
Yarn.
73. Cluster versus single node
•When you first install Hadoop, such as to learn it, it
runs in single node. But in production you would set
it up to run in cluster node, meaning assign data
nodes to run on different machines. The whole set
of machines is called the cluster. A Hadoop cluster
can scale immensely to store petabytes of data.
80. MapReduce -two programs
Map means to take items like a string from a csv file
and run an operation over every line in the file, like to
split it into a list of fields. Those become (key->value)
pairs.
Reduce groups these (key->value) pairs and runs an
operation to, for example, concatenate them into one
string or sum them like (key->sum)