SlideShare una empresa de Scribd logo
1 de 96
IT19741
Cloud and Big Data
Analytics
Dr G Geetha
Dean innovation and Professor CSE
Women Scientist, Qualified Patent agent
Rajalakshmi Engineering College
UNIT-III
INTRODUCTION
TO BIG DATA AND
HADOOP
Introduction to Big Data, Types of Digital Data,
Challenges of conventional systems - Web
data, Evolution of analytic processes and
tools, Analysis Vs reporting - Big Data
Analytics, Introduction to Hadoop -
Distributed Computing Challenges - History of
Hadoop, Hadoop Eco System - Use case of
Hadoop – Hadoop Distributors – HDFS –
Processing Data with Hadoop – Map Reduce.
Customer Challenges:The Data Deluge
The Economist, Feb 25, 2010
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
3
Introduction to Big Data
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
Questions from Businesses will Vary
Reporting,
Dashboards
What
happened?
Why did it
happen?
Forensics & Data
Mining
Real-Time
Analytics
What is
happening?
Why is it
happening?
Real-Time
Data Mining
Predictive
Analytics
What is likely to
happen?
What should I do
about it?
Prescriptive
Analytics
Past
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
6
Future
Types of Digital Data
Tradational Data Big Data
Traditional data is generated in enterprise level. Big data is generated outside the enterprise level.
Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
secondTraditional database system deals with structured data.
Big data system deals with structured, semi-
structured,database, and unstructured data.
Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
Traditional data source is centralized and it is managed in
centralized form.
Big data source is distributed and it is managed in distributed
form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is capable to process traditional
data.
High system configuration is required to process big data.
The size of the data is very small. The size is more than the traditional data size.
Traditional data base tools are required to perform any data base
operation.
Special kind of data base tools are required to perform any
database schema-based operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is strict schema based and it is static. Its data model is a flat schema based and it is dynamic.
Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
Its data sources includes ERP transaction data, CRM transaction
data, financial data, organizational data, web transaction data
Its data sources includes social media, device data, sensor data,
Types of Digital Data
• Structured data: When data follows a pre-defined
schema/structure we say it is structured data. This is
the data which is in an organized form (e.g., in rows
and columns) and be easily used by a computer
program.
• Relationships exist between entities of data, such
as classes and their objects. About 10% data of an
organization is in this format.
• Data stored in databases is an example of
structured data.
Types of Digital Data
• Unstructured data: This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer program.
About 80% data of an organization is in this format; for example, memos,
chat rooms, PowerPoint presentations, images, videos, letters. researches,
white papers, body of an email, etc.
Types of Digital Data
• Semi-structured data: Semi-structured data is also referred to as self
describing structure. This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program. About 10% data of
an organization is in this format; for example, HTML, XML, JSON,
email data etc.
Big Data: Different than Business Intelligence
“TRADITIONAL BI”
Repetitive
Structured
Operational
GBs to 10s ofTBs
“BIG DATA A N ALYTIC S”
Experimental,Ad Hoc
Mostly Semi-Structured
External + Operational
10s ofTB to 100’
s of PB’
s
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
13
web data
• Data that is sourced and structured from websites is
referred to as "web data"
• In the world of big data, data comes from multiple sources and in
huge amount. In which one source is web itself. Web data extraction
is one of the medium of collecting data from this source i.e. web
W eb 2.0 is “Data-Driven”
“The future is here,it’s just not evenly distributed yet.”
W illiam Gibson
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
15
The World of Data-Driven Applications
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
16
Attributes of Big Data
Volume
Velocity Variety
Batch
N earTime
RealTime
Streams
Structured
Unstructured
Semistructured
Terabytes
Transactions
Tables
Records
Files
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
18
What is Big Data Analytics?
Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other SQL
Reporting Tools
Interactive Business
Intelligence &
In-memory RDBMS
QliqView, Tableau, HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications
• Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
• Shared
parallel
nothing, massively
processing, scale out
architectures are well-suited for
big data apps
23
Big Data Technology
25
Analysis Vs reporting
Ten Common Big Data Problems
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
36
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. Ad targeting
5. PoS transaction analysis
6. Analyzing network data to
predict failure
7. Threat analysis
8. Trade surveillance
9. Search quality
10. Data “sandbox”
The Big Data Opportunity
Financial Services Healthcare
Retail Web/Social/Mobile
Manufacturing Government
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
37
Industries Are Embracing Big Data
Retail
•CRM – Customer Scoring
•Store Siting and Layout
•Fraud Detection /Prevention
•Supply Chain O ptimization
Advertising & Public Relations
•Demand Signaling
•Ad Targeting
•Sentiment Analysis
•Customer Acquisition
Financial Services
•Algorithmic Trading
•Risk Analysis
•Fraud Detection
•Portfolio Analysis
Media & Telecommunications
•Network Optimization
•Customer Scoring
•Churn Prevention
•Fraud Prevention
Manufacturing
•Product Research
•Engineering Analytics
•Process & Q uality Analysis
•Distribution Optimization
Energy
•Smart Grid
•Exploration
Government
•Market Governance
•Counter-Terrorism
•Econometrics
•Health Informatics
Healthcare & Life Sciences
•Pharmaco-Genomics
•Bio-Informatics
•Pharmaceutical Research
•Clinical Outcomes Research
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
38
5Vs” of Big Data
• Volume: With increasing dependence on technology, data is producing at a large volume.
Common examples are data being produced by various social networking sites, sensors,
scanners, airlines and other organizations.
• Velocity: Huge amount of data is generated per second. It is estimated that by the end of
2020, every individual will produce 3mb data per second. This large volume of data is being
generated with a great velocity.
• Variety: The data being produced by different means is of three types:
Structured Data: It is the relational data which is stored in the form of rows and columns.
Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data
which can’t be stored in the form of rows and columns.
Semi Structured Data: Log files are the examples of this type of data.
• Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in
the generation of doubtful or uncertain Information. Often data inconsistency arises because
of the volume or amount of data e.g. data in bulk could create confusion whereas less
amount of data could convey half or incomplete Information.
• Value: After having the 4 V’s into account there comes one more V which stands for Value!.
Bulk of Data having no Value is of no good to the company, unless you turn it into something
useful. Data in itself is of no use or importance but it needs to be converted into something
valuable to extract Information. Hence, you can state that Value! is the most important V of
all the 5V’s
Distributed Computing
Challenges
• Lack of performance and scalability.
• Lack of flexible resource management.
• Lack of application deployment support.
• Lack of quality of service.
• Lack of multiple data source support.
Storage & Memory B/W lagging CPU
CPU B/W requirements out-pacing memory and storage
Disk & memory getting “further” away from CPU
Large sequential transfers better for both memory & disk
CPU DRAM LAN Disk
Annual bandwidth improvement (all
milestones)
1.5 1.27 1.39 1.28
Annual latency improvement (all milestones) 1.17 1.07 1.12 1.11
MemoryW all Storage Chasm
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
41
Commodity Hardware Economics
For $1000
One computer can
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
42
Process
~32GB
Store
~15TB
99.9%
Of data is Underutilized
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
Enterprise + Big Data = Big Opportunity
16
Why Hadoop?
Big Data analytics and the Apache Hadoop open source
project are rapidly emerging as the preferred solution to
address business and technology trends that are disrupting
traditional data management and processing.
Enterprises can gain a competitive advantage by being
early adopters of big data analytics.
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
44
Evolution of Hadoop
Introduction to Analytics and Big Data - Hadoop
Hadoop Adoption
HDFS
MapReduce
Ecosystem Projects
© 2014 Storage Networking Industry Association. All Rights Reserved.
46
Introduction to Analytics and Big Data - Hadoop
Hadoop Adoption in the Industry
2007 2008 2009 2010
The DatagraphBlog
Source: Hadoop Summit Presentations
© 2014 Storage Networking Industry Association. All Rights Reserved.
47
Use case of Hadoop
Hadoop Distributors
What is Hadoop?
A scalable fault-tolerant distributed system for data storage
and processing
Core Hadoop has two main components
Hadoop Distributed File System (HDFS): self-healing, high-
bandwidth clustered storage
Reliable, redundant, distributed file system optimized for large files
MapReduce: fault-tolerant distributed processing
Programming model for processing sets of data
Mapping inputs to outputs and reducing the output of multiple Mappers to one (or
a few) answer(s)
Operates on unstructured and structured data
A large and active ecosystem
Open source under the friendly Apache License
http://wiki.apache.org/hadoop/
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
51
Hadoop ecosystem
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
• Name node
• Data Node
Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
Apache HBase
• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be processed
within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data
YARN
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce
By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is lateron processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.
PIG
Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care of.
After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization
and hence is a major segment of the Hadoop Ecosystem.
HIVE
• With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query language is
called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by Hive
thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in
the processing of queries.
Mahout
• Mahout, allows Machine Learnability to a system or application.
Machine Learning, as the name suggests helps the system to develop
itself based on some patterns, user/environmental interaction or on
the basis of algorithms.
• It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts
of Machine learning. It allows invoking algorithms as per our need
with the help of its own libraries.
Apache Spark
• It’s a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most
of the companies interchangeably.
Data Management
• Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination
and synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper overcame
all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There is two kinds of
jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow
is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.
High Level Hadoop Architecture
Main components of Hadoop
• Namenode—controls operation of the data jobs.
• Datanode—this writes data in blocks to local storage. And it replicates data
blocks to other datanodes. DataNodes are also rack-aware. You would not
want to replicate all your data to the same rack of servers as an outage
there would cause you to loose all your data.
• SecondaryNameNode—this one take over if the primary Namenode goes
offline.
• JobTracker—sends MapReduce jobs to nodes in the cluster.
• TaskTracker—accepts tasks from the Job Tracker.
Main components of Hadoop
• Yarn—runs the Yarn components ResourceManager and NodeManager.
This is a resource manager that can also run as a stand-alone
component to provide other applications the ability to run in a
distributed architecture. For example you can use Apache Spark with
Yarn. You could also write your own program to use Yarn. But that is
complicated.
• Client Application—this is whatever program you have written or some
other client like Apache Pig. Apache Pig is an easy-to-use shell that
takes SQL-like commands and translates them to Java MapReduce
programs and runs them on Hadoop.
• Application Master—runs shell commands in a container as directed by
Yarn.
HDFS Concepts
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
70
 Sits on top of a native (ext3, xfs, etc..) file system
 Performs best with a ‘modest’ number of large files
 Files in HDFS are ‘write once’
 HDFS is optimized for large, streaming reads of files
H D FS
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
71
 Hadoop Distributed File System
– Data is organized into files & directories
– Files are divided into blocks, distributed across cluster
nodes
– Block placement known at runtime by map-reduce =
computation co-located with data
– Blocks replicated to handle failure
– Checksums used to ensure data integrity
 Replication: one and only strategy for error
handling, recovery and fault tolerance
– Self Healing
– Make multiple copies
Hadoop Server Roles
Slave
Task
T
racker
Data
Node
Slave
T
askT
racker
Data
Node
Slave
T
askT
racker
Data
Node
Master
Name
Node
Master
Secondary
Node
Job
T
racker
Client Client Client Client Client Client Client Client
Slave
Task
T
racker
Data
Node
Slave
T
askT
racker
Data
Node
Slave
Task
T
racker
Data
Node
Up to 4K
Nodes
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
72
Cluster versus single node
•When you first install Hadoop, such as to learn it, it
runs in single node. But in production you would set
it up to run in cluster node, meaning assign data
nodes to run on different machines. The whole set
of machines is called the cluster. A Hadoop cluster
can scale immensely to store petabytes of data.
Hadoop Cluster
DN,TT
Up to 4K
Nodes
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
NN
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
JT
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
SNN
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
1GbE/10GbE
CORE SWITCH
DN,TT
CORE SWITCH Client
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
74
HDFS File Write Operation
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
75
HDFS File Read Operation
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
76
Functional Programming meets
Distributed Processing
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
77
What is MapReduce?
A method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
1. Map
2. Reduce
In between Map and Reduce is the Shuffle and Sort
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
78
MapReduce Provides
Automatic parallelization and distribution
Fault Tolerance
Status and Monitoring Tools
A clean abstraction for programmers
Google Technology RoundTable: MapReduce
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
79
MapReduce -two programs
Map means to take items like a string from a csv file
and run an operation over every line in the file, like to
split it into a list of fields. Those become (key->value)
pairs.
Reduce groups these (key->value) pairs and runs an
operation to, for example, concatenate them into one
string or sum them like (key->sum)
Key MapReduce Terminology Concepts
A user runs a client program on a client computer
The client program submits a job to Hadoop
The job is sent to the JobTracker process on the Master Node
Each Slave Node runs a process called the TaskTracker
The JobTracker instructs TaskTrackers to run and monitor tasks
A task attempt is an instance of a task running on a slave node
There will be at least as many task attempts as there are tasks which
need to be performed
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
81
MapReduce: Basic Concepts
Each Mapper processes single input split from HDFS
Hadoop passes developer’s Map code one record at a time
Each record has a key and a value
Intermediate data written by the Mapper to local disk
During shuffle and sort phase, all values associated with
same intermediate key are transferred to same Reducer
Reducer is passed each key and a list of all its values
Output from Reducers is written to HDFS
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
82
MapReduce Operation
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
83
W hat was the max/
min temperature for the last century?
Sample Dataset
The requirement:
you need to find out grouped by type of customer how many of
each type are in each country with the name of the country listed in
the countries.dat in the final result (and not the 2 digit country
name). Each record has a key and a value
To do this you need to:
Join the data sets
Key on country
Count type of customer per country
Output the results
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
84
MapReduce Paradigm
Input Map Shuffle and Sort Reduce Output
Map
Reduce
cat grep sort uniq output
Map
Map
Reduce
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
85
MapReduce Example
Problem: Count the number of times that each word appears in the following paragraph:
J
ohn has a red car
,which has no radio.Mary has a red
bicycle.Bill has no car or bicycle.
Map
Server 1: John has a red car, which has no radio.
John: 1
has: 2
a: 1
red: 1
car: 1
which: 1
no: 1
radio: 1
Server 2: Mary has a red bicycle.
Mary: 1
has: 1
a: 1
red: 1
bicycle: 1
Server 3: Bill has no car or bicycle.
Bill: 1
has: 1
no: 1
car: 1
or: 1
biclycle:1
Reduce
John:
1
has 2
has: 1
has: 1
a: 1
a: 1
red: 1
red: 1
car: 1
car: 1
which: 1
no: 1
no: 1
radio: 1
Mary: 1
bicycle: 1
bicycle: 1
Bill: 1
or: 1
John: 1
has 4
a: 2
red: 2
car: 2
which: 1
no: 2
radio: 1
Mary: 1
bicycle: 2
Bill: 1
or: 1
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
86
Server 1 Server 2 Server 3 Server 1 Server 2 Server 3
Putting it all Together: MapReduce and HDFS
Task Tracker
Task Tracker
Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Large Data Set
(Log files,Sensor Data)
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Client/Dev
2
Map Job
1
3
4
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
87
• Hadoop is a‘top-level’Apache project
• Created and managed under the auspices of theApache Software Foundation
• Several other projects exist that rely on some or all of Hadoop
• Typically either both HDFS and MapReduce,or just HDFS
• Ecosystem Projects Include
• Hive
• Pig
• HBase
• Many more… ..
Hadoop Ecosystem Projects
Introductih
on
tttp
o :
A
/n
/h
aly
a
tid
cs
oa
o
nd
pB
.a
igp
D
a
ac
tah
-e
H.
ao
do
ro
g
p
/
© 2014 Storage Networking Industry Association. All Rights Reserved. 88
Hadoop, SQL & MPP Systems
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
89
Hadoop Traditional SQL
Systems
MPP Systems
Scale-Out Scale-Up Scale-Out
Key/Value Pairs Relational Tables Relational Tables
Functional
Programming
Declarative Queries Declarative Queries
Offline Batch
Processing
Online Transactions Online Transactions
Comparing RDBMS and MapReduce
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
90
Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Exabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
Diagnostics and Customer Churn
Issues
W hat make and model systems are deployed?
Are certain set top boxes in need of replacement based on system
diagnostic data?
Is the acorrelation between make,model or vintageof set top box and
customer churn?
W hat are the most expensive boxes to maintain?
Which systems should we pro-actively replace to keep customers happy?
Big Data Solution
Collect unstructured data from set top boxes—multiple terabytes
Analyze system data in Hadoop in near real time
Pull data in to Hive for interactive query and modeling
Analytics with Hadoop increases customer satisfaction
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
91
Pay PerView Advertising
Issues
Fixed inventory of ad space is provided by national content providers. For
example, 100 ads offered to provider for 1 month of programming
Provider can use this space to advertise its products and services, such as
pay per view
Do we advertise “The Longest Yard” in the middle of afootball gameor in
the middle of a romantic comedy?
10% increase in pay per view movie rentals = $10M in incremental revenue
• Big Data Solution
Collect programming data and viewer rental data in a large data repository
Develop models to correlate proclivity to rent to programming format
Find the most productive time slots and programs to advertise pay per
view inventory
Improve ad placement and pay-per-view conversion with Hadoop
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
92
Risk Modeling
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
93
 Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers. i.e,ifdirect deposits stop
coming into checking acct, it’s likely that customer lost his/her job, which
impacts creditworthiness for other products (CC, mortgage, etc.)
– Data existing in silos across multiple LOB’s and acquired bank systems
– Data size approached 1 petabyte
 W hy do this in Hadoop?
– Ability to cost-effectively integrate + 1 PB of data from multiple data
sources: data warehouse,call center
, chat and email
– Platform for more analysis with poly-structured data sources; i.e.,
combining bank data with credit bureau data;Twitter, etc.
– Offload intensive computation from DW
Sentiment Analysis
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
94
 Sentiment Analysis
– Hadoop used frequently to monitor what customers think of company’s
products or services
– Data loaded from social media sources (Twitter, blogs, Facebook, emails,
chats,etc.) into Hadoop cluster
– Map/Reduce jobs run continuously to identifysentiment (i.e.,Acme
Company’s rates are “outrageous” or “rip off”)
– Negative/positive comments can be acted upon (special offer
, coupon, etc.)
 W hy Hadoop
– Social media/
web data is unstructured
– Amount ofdata is immense
– New data sources arise weekly
tutorial is by th SNIA
Resources: The Big Data Conversation
World Economic Forum: “Personal Data: The Emergence of a New
Asset Class” 2011
McKinsey Global Institute: Big Data: The next frontier for innovation,
competition, and productivity
Big Data: Harnessing a game-changing asset
IDC: 2011 Digital Universe Study: Extracting Value from Chaos
The Economist: Data, Data Everywhere
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning
New Field
O’Reilly – What is Data Science?
O’Reilly – Building Data Science Teams?
O’Reilly – Data for the public good
Obama Administration “Big Data Research and Development Initiative.”
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
96

Más contenido relacionado

La actualidad más candente

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaEdureka!
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf8840VinayShelke
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 

La actualidad más candente (20)

Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
 
Temporal database
Temporal databaseTemporal database
Temporal database
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 

Similar a INTRODUCTION TO BIG DATA AND HADOOP

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopSamiraChandan
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Aditya205306
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataPrakalp Agarwal
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1RUHULAMINHAZARIKA
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data FundamentalsSmarak Das
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion ahmed alshikh
 

Similar a INTRODUCTION TO BIG DATA AND HADOOP (20)

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 

Más de Dr Geetha Mohan

CLOUD ARCHITECTURE AND SERVICES.pptx
CLOUD ARCHITECTURE AND SERVICES.pptxCLOUD ARCHITECTURE AND SERVICES.pptx
CLOUD ARCHITECTURE AND SERVICES.pptxDr Geetha Mohan
 
CLOUD ENABLING TECHNOLOGIES.pptx
 CLOUD ENABLING TECHNOLOGIES.pptx CLOUD ENABLING TECHNOLOGIES.pptx
CLOUD ENABLING TECHNOLOGIES.pptxDr Geetha Mohan
 
IPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaIPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaDr Geetha Mohan
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithmsDr Geetha Mohan
 
Resource management techniques
Resource management techniquesResource management techniques
Resource management techniquesDr Geetha Mohan
 
Ge6075 professional ethics in engineering unit 1
Ge6075 professional ethics in engineering  unit 1Ge6075 professional ethics in engineering  unit 1
Ge6075 professional ethics in engineering unit 1Dr Geetha Mohan
 
Cp7101 design and management of computer networks-flow analysis
Cp7101 design and management of computer networks-flow analysisCp7101 design and management of computer networks-flow analysis
Cp7101 design and management of computer networks-flow analysisDr Geetha Mohan
 
Cp7101 design and management of computer networks-requirements analysis 2
Cp7101 design and management of computer networks-requirements analysis 2 Cp7101 design and management of computer networks-requirements analysis 2
Cp7101 design and management of computer networks-requirements analysis 2 Dr Geetha Mohan
 
Cp7101 design and management of computer networks-requirements analysis
Cp7101 design and management of computer networks-requirements analysisCp7101 design and management of computer networks-requirements analysis
Cp7101 design and management of computer networks-requirements analysisDr Geetha Mohan
 
Cp7101 design and management of computer networks-design concepts
Cp7101 design and management of computer networks-design conceptsCp7101 design and management of computer networks-design concepts
Cp7101 design and management of computer networks-design conceptsDr Geetha Mohan
 
Cp7101 design and management of computer networks -network
Cp7101 design and management of computer networks -networkCp7101 design and management of computer networks -network
Cp7101 design and management of computer networks -networkDr Geetha Mohan
 

Más de Dr Geetha Mohan (13)

CLOUD ARCHITECTURE AND SERVICES.pptx
CLOUD ARCHITECTURE AND SERVICES.pptxCLOUD ARCHITECTURE AND SERVICES.pptx
CLOUD ARCHITECTURE AND SERVICES.pptx
 
CLOUD ENABLING TECHNOLOGIES.pptx
 CLOUD ENABLING TECHNOLOGIES.pptx CLOUD ENABLING TECHNOLOGIES.pptx
CLOUD ENABLING TECHNOLOGIES.pptx
 
IPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G GeethaIPR in Academic Research:Dr G Geetha
IPR in Academic Research:Dr G Geetha
 
How to file a patent
How to file a patentHow to file a patent
How to file a patent
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithms
 
Machine learning
Machine learningMachine learning
Machine learning
 
Resource management techniques
Resource management techniquesResource management techniques
Resource management techniques
 
Ge6075 professional ethics in engineering unit 1
Ge6075 professional ethics in engineering  unit 1Ge6075 professional ethics in engineering  unit 1
Ge6075 professional ethics in engineering unit 1
 
Cp7101 design and management of computer networks-flow analysis
Cp7101 design and management of computer networks-flow analysisCp7101 design and management of computer networks-flow analysis
Cp7101 design and management of computer networks-flow analysis
 
Cp7101 design and management of computer networks-requirements analysis 2
Cp7101 design and management of computer networks-requirements analysis 2 Cp7101 design and management of computer networks-requirements analysis 2
Cp7101 design and management of computer networks-requirements analysis 2
 
Cp7101 design and management of computer networks-requirements analysis
Cp7101 design and management of computer networks-requirements analysisCp7101 design and management of computer networks-requirements analysis
Cp7101 design and management of computer networks-requirements analysis
 
Cp7101 design and management of computer networks-design concepts
Cp7101 design and management of computer networks-design conceptsCp7101 design and management of computer networks-design concepts
Cp7101 design and management of computer networks-design concepts
 
Cp7101 design and management of computer networks -network
Cp7101 design and management of computer networks -networkCp7101 design and management of computer networks -network
Cp7101 design and management of computer networks -network
 

Último

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 

Último (20)

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

INTRODUCTION TO BIG DATA AND HADOOP

  • 1. IT19741 Cloud and Big Data Analytics Dr G Geetha Dean innovation and Professor CSE Women Scientist, Qualified Patent agent Rajalakshmi Engineering College
  • 2. UNIT-III INTRODUCTION TO BIG DATA AND HADOOP Introduction to Big Data, Types of Digital Data, Challenges of conventional systems - Web data, Evolution of analytic processes and tools, Analysis Vs reporting - Big Data Analytics, Introduction to Hadoop - Distributed Computing Challenges - History of Hadoop, Hadoop Eco System - Use case of Hadoop – Hadoop Distributors – HDFS – Processing Data with Hadoop – Map Reduce.
  • 3. Customer Challenges:The Data Deluge The Economist, Feb 25, 2010 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 3
  • 4. Introduction to Big Data Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
  • 5.
  • 6. Questions from Businesses will Vary Reporting, Dashboards What happened? Why did it happen? Forensics & Data Mining Real-Time Analytics What is happening? Why is it happening? Real-Time Data Mining Predictive Analytics What is likely to happen? What should I do about it? Prescriptive Analytics Past Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 6 Future
  • 8. Tradational Data Big Data Traditional data is generated in enterprise level. Big data is generated outside the enterprise level. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes. secondTraditional database system deals with structured data. Big data system deals with structured, semi- structured,database, and unstructured data. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds. Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. Data integration is very easy. Data integration is very difficult. Normal system configuration is capable to process traditional data. High system configuration is required to process big data. The size of the data is very small. The size is more than the traditional data size. Traditional data base tools are required to perform any data base operation. Special kind of data base tools are required to perform any database schema-based operation. Normal functions can manipulate data. Special kind of functions can manipulate data. Its data model is strict schema based and it is static. Its data model is a flat schema based and it is dynamic. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data. Its data sources includes ERP transaction data, CRM transaction data, financial data, organizational data, web transaction data Its data sources includes social media, device data, sensor data,
  • 9. Types of Digital Data • Structured data: When data follows a pre-defined schema/structure we say it is structured data. This is the data which is in an organized form (e.g., in rows and columns) and be easily used by a computer program. • Relationships exist between entities of data, such as classes and their objects. About 10% data of an organization is in this format. • Data stored in databases is an example of structured data.
  • 10. Types of Digital Data • Unstructured data: This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. About 80% data of an organization is in this format; for example, memos, chat rooms, PowerPoint presentations, images, videos, letters. researches, white papers, body of an email, etc.
  • 11.
  • 12. Types of Digital Data • Semi-structured data: Semi-structured data is also referred to as self describing structure. This is the data which does not conform to a data model but has some structure. However, it is not in a form which can be used easily by a computer program. About 10% data of an organization is in this format; for example, HTML, XML, JSON, email data etc.
  • 13. Big Data: Different than Business Intelligence “TRADITIONAL BI” Repetitive Structured Operational GBs to 10s ofTBs “BIG DATA A N ALYTIC S” Experimental,Ad Hoc Mostly Semi-Structured External + Operational 10s ofTB to 100’ s of PB’ s Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 13
  • 14. web data • Data that is sourced and structured from websites is referred to as "web data" • In the world of big data, data comes from multiple sources and in huge amount. In which one source is web itself. Web data extraction is one of the medium of collecting data from this source i.e. web
  • 15. W eb 2.0 is “Data-Driven” “The future is here,it’s just not evenly distributed yet.” W illiam Gibson Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 15
  • 16. The World of Data-Driven Applications Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 16
  • 17.
  • 18. Attributes of Big Data Volume Velocity Variety Batch N earTime RealTime Streams Structured Unstructured Semistructured Terabytes Transactions Tables Records Files Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 18
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. What is Big Data Analytics?
  • 29.
  • 30. Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases The Evolution of Business Intelligence 1990’s 2000’s 2010’s Speed Scale Scale Speed
  • 31. Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps • Shared parallel nothing, massively processing, scale out architectures are well-suited for big data apps 23
  • 32.
  • 33.
  • 36. Ten Common Big Data Problems Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 36 1. Modeling true risk 2. Customer churn analysis 3. Recommendation engine 4. Ad targeting 5. PoS transaction analysis 6. Analyzing network data to predict failure 7. Threat analysis 8. Trade surveillance 9. Search quality 10. Data “sandbox”
  • 37. The Big Data Opportunity Financial Services Healthcare Retail Web/Social/Mobile Manufacturing Government Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 37
  • 38. Industries Are Embracing Big Data Retail •CRM – Customer Scoring •Store Siting and Layout •Fraud Detection /Prevention •Supply Chain O ptimization Advertising & Public Relations •Demand Signaling •Ad Targeting •Sentiment Analysis •Customer Acquisition Financial Services •Algorithmic Trading •Risk Analysis •Fraud Detection •Portfolio Analysis Media & Telecommunications •Network Optimization •Customer Scoring •Churn Prevention •Fraud Prevention Manufacturing •Product Research •Engineering Analytics •Process & Q uality Analysis •Distribution Optimization Energy •Smart Grid •Exploration Government •Market Governance •Counter-Terrorism •Econometrics •Health Informatics Healthcare & Life Sciences •Pharmaco-Genomics •Bio-Informatics •Pharmaceutical Research •Clinical Outcomes Research Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 38
  • 39. 5Vs” of Big Data • Volume: With increasing dependence on technology, data is producing at a large volume. Common examples are data being produced by various social networking sites, sensors, scanners, airlines and other organizations. • Velocity: Huge amount of data is generated per second. It is estimated that by the end of 2020, every individual will produce 3mb data per second. This large volume of data is being generated with a great velocity. • Variety: The data being produced by different means is of three types: Structured Data: It is the relational data which is stored in the form of rows and columns. Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows and columns. Semi Structured Data: Log files are the examples of this type of data. • Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in the generation of doubtful or uncertain Information. Often data inconsistency arises because of the volume or amount of data e.g. data in bulk could create confusion whereas less amount of data could convey half or incomplete Information. • Value: After having the 4 V’s into account there comes one more V which stands for Value!. Bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s
  • 40. Distributed Computing Challenges • Lack of performance and scalability. • Lack of flexible resource management. • Lack of application deployment support. • Lack of quality of service. • Lack of multiple data source support.
  • 41. Storage & Memory B/W lagging CPU CPU B/W requirements out-pacing memory and storage Disk & memory getting “further” away from CPU Large sequential transfers better for both memory & disk CPU DRAM LAN Disk Annual bandwidth improvement (all milestones) 1.5 1.27 1.39 1.28 Annual latency improvement (all milestones) 1.17 1.07 1.12 1.11 MemoryW all Storage Chasm Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 41
  • 42. Commodity Hardware Economics For $1000 One computer can Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 42 Process ~32GB Store ~15TB 99.9% Of data is Underutilized
  • 43. Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. Enterprise + Big Data = Big Opportunity 16
  • 44. Why Hadoop? Big Data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing. Enterprises can gain a competitive advantage by being early adopters of big data analytics. Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 44
  • 46. Introduction to Analytics and Big Data - Hadoop Hadoop Adoption HDFS MapReduce Ecosystem Projects © 2014 Storage Networking Industry Association. All Rights Reserved. 46
  • 47. Introduction to Analytics and Big Data - Hadoop Hadoop Adoption in the Industry 2007 2008 2009 2010 The DatagraphBlog Source: Hadoop Summit Presentations © 2014 Storage Networking Industry Association. All Rights Reserved. 47
  • 48. Use case of Hadoop
  • 49.
  • 51. What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Core Hadoop has two main components Hadoop Distributed File System (HDFS): self-healing, high- bandwidth clustered storage Reliable, redundant, distributed file system optimized for large files MapReduce: fault-tolerant distributed processing Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Operates on unstructured and structured data A large and active ecosystem Open source under the friendly Apache License http://wiki.apache.org/hadoop/ Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 51
  • 52.
  • 53. Hadoop ecosystem • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling
  • 54.
  • 55. HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. HDFS consists of two core components i.e. • Name node • Data Node Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost effective. HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system.
  • 56. Apache HBase • It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively. • At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data
  • 57.
  • 58. YARN Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e. • Resource Manager • Nodes Manager • Application Manager Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
  • 59. MapReduce By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is: • Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is lateron processed by the Reduce() method. • Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
  • 60.
  • 61. PIG Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. • It is a platform for structuring the data flow, processing and analyzing huge data sets. • Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS. • Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM. • Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
  • 62. HIVE • With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). • It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. • Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line. • JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.
  • 63. Mahout • Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms. • It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.
  • 64. Apache Spark • It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. • It consumes in memory resources hence, thus being faster than the prior in terms of optimization. • Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably.
  • 65.
  • 66. Data Management • Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr. • Zookeeper: There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter-component based communication, grouping, and maintenance. • Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
  • 67. High Level Hadoop Architecture
  • 68. Main components of Hadoop • Namenode—controls operation of the data jobs. • Datanode—this writes data in blocks to local storage. And it replicates data blocks to other datanodes. DataNodes are also rack-aware. You would not want to replicate all your data to the same rack of servers as an outage there would cause you to loose all your data. • SecondaryNameNode—this one take over if the primary Namenode goes offline. • JobTracker—sends MapReduce jobs to nodes in the cluster. • TaskTracker—accepts tasks from the Job Tracker.
  • 69. Main components of Hadoop • Yarn—runs the Yarn components ResourceManager and NodeManager. This is a resource manager that can also run as a stand-alone component to provide other applications the ability to run in a distributed architecture. For example you can use Apache Spark with Yarn. You could also write your own program to use Yarn. But that is complicated. • Client Application—this is whatever program you have written or some other client like Apache Pig. Apache Pig is an easy-to-use shell that takes SQL-like commands and translates them to Java MapReduce programs and runs them on Hadoop. • Application Master—runs shell commands in a container as directed by Yarn.
  • 70. HDFS Concepts Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 70  Sits on top of a native (ext3, xfs, etc..) file system  Performs best with a ‘modest’ number of large files  Files in HDFS are ‘write once’  HDFS is optimized for large, streaming reads of files
  • 71. H D FS Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 71  Hadoop Distributed File System – Data is organized into files & directories – Files are divided into blocks, distributed across cluster nodes – Block placement known at runtime by map-reduce = computation co-located with data – Blocks replicated to handle failure – Checksums used to ensure data integrity  Replication: one and only strategy for error handling, recovery and fault tolerance – Self Healing – Make multiple copies
  • 72. Hadoop Server Roles Slave Task T racker Data Node Slave T askT racker Data Node Slave T askT racker Data Node Master Name Node Master Secondary Node Job T racker Client Client Client Client Client Client Client Client Slave Task T racker Data Node Slave T askT racker Data Node Slave Task T racker Data Node Up to 4K Nodes Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 72
  • 73. Cluster versus single node •When you first install Hadoop, such as to learn it, it runs in single node. But in production you would set it up to run in cluster node, meaning assign data nodes to run on different machines. The whole set of machines is called the cluster. A Hadoop cluster can scale immensely to store petabytes of data.
  • 74. Hadoop Cluster DN,TT Up to 4K Nodes DN,TT DN,TT DN,TT DN,TT DN,TT NN 1GbE/10GbE DN,TT DN,TT DN,TT DN,TT DN,TT DN,TT JT 1GbE/10GbE DN,TT DN,TT DN,TT DN,TT DN,TT DN,TT SNN 1GbE/10GbE DN,TT DN,TT DN,TT DN,TT DN,TT DN,TT 1GbE/10GbE CORE SWITCH DN,TT CORE SWITCH Client Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 74
  • 75. HDFS File Write Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 75
  • 76. HDFS File Read Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 76
  • 77. Functional Programming meets Distributed Processing Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 77
  • 78. What is MapReduce? A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases 1. Map 2. Reduce In between Map and Reduce is the Shuffle and Sort Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 78
  • 79. MapReduce Provides Automatic parallelization and distribution Fault Tolerance Status and Monitoring Tools A clean abstraction for programmers Google Technology RoundTable: MapReduce Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 79
  • 80. MapReduce -two programs Map means to take items like a string from a csv file and run an operation over every line in the file, like to split it into a list of fields. Those become (key->value) pairs. Reduce groups these (key->value) pairs and runs an operation to, for example, concatenate them into one string or sum them like (key->sum)
  • 81. Key MapReduce Terminology Concepts A user runs a client program on a client computer The client program submits a job to Hadoop The job is sent to the JobTracker process on the Master Node Each Slave Node runs a process called the TaskTracker The JobTracker instructs TaskTrackers to run and monitor tasks A task attempt is an instance of a task running on a slave node There will be at least as many task attempts as there are tasks which need to be performed Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 81
  • 82. MapReduce: Basic Concepts Each Mapper processes single input split from HDFS Hadoop passes developer’s Map code one record at a time Each record has a key and a value Intermediate data written by the Mapper to local disk During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer Reducer is passed each key and a list of all its values Output from Reducers is written to HDFS Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 82
  • 83. MapReduce Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 83 W hat was the max/ min temperature for the last century?
  • 84. Sample Dataset The requirement: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name). Each record has a key and a value To do this you need to: Join the data sets Key on country Count type of customer per country Output the results Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 84
  • 85. MapReduce Paradigm Input Map Shuffle and Sort Reduce Output Map Reduce cat grep sort uniq output Map Map Reduce Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 85
  • 86. MapReduce Example Problem: Count the number of times that each word appears in the following paragraph: J ohn has a red car ,which has no radio.Mary has a red bicycle.Bill has no car or bicycle. Map Server 1: John has a red car, which has no radio. John: 1 has: 2 a: 1 red: 1 car: 1 which: 1 no: 1 radio: 1 Server 2: Mary has a red bicycle. Mary: 1 has: 1 a: 1 red: 1 bicycle: 1 Server 3: Bill has no car or bicycle. Bill: 1 has: 1 no: 1 car: 1 or: 1 biclycle:1 Reduce John: 1 has 2 has: 1 has: 1 a: 1 a: 1 red: 1 red: 1 car: 1 car: 1 which: 1 no: 1 no: 1 radio: 1 Mary: 1 bicycle: 1 bicycle: 1 Bill: 1 or: 1 John: 1 has 4 a: 2 red: 2 car: 2 which: 1 no: 2 radio: 1 Mary: 1 bicycle: 2 Bill: 1 or: 1 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 86 Server 1 Server 2 Server 3 Server 1 Server 2 Server 3
  • 87. Putting it all Together: MapReduce and HDFS Task Tracker Task Tracker Task Tracker Job Tracker Hadoop Distributed File System (HDFS) Large Data Set (Log files,Sensor Data) Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Client/Dev 2 Map Job 1 3 4 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 87
  • 88. • Hadoop is a‘top-level’Apache project • Created and managed under the auspices of theApache Software Foundation • Several other projects exist that rely on some or all of Hadoop • Typically either both HDFS and MapReduce,or just HDFS • Ecosystem Projects Include • Hive • Pig • HBase • Many more… .. Hadoop Ecosystem Projects Introductih on tttp o : A /n /h aly a tid cs oa o nd pB .a igp D a ac tah -e H. ao do ro g p / © 2014 Storage Networking Industry Association. All Rights Reserved. 88
  • 89. Hadoop, SQL & MPP Systems Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 89 Hadoop Traditional SQL Systems MPP Systems Scale-Out Scale-Up Scale-Out Key/Value Pairs Relational Tables Relational Tables Functional Programming Declarative Queries Declarative Queries Offline Batch Processing Online Transactions Online Transactions
  • 90. Comparing RDBMS and MapReduce Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 90 Traditional RDBMS MapReduce Data Size Gigabytes (Terabytes) Petabytes (Exabytes) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear DBA Ratio 1:40 1:3000 Reference: Tom White’s Hadoop: The Definitive Guide
  • 91. Diagnostics and Customer Churn Issues W hat make and model systems are deployed? Are certain set top boxes in need of replacement based on system diagnostic data? Is the acorrelation between make,model or vintageof set top box and customer churn? W hat are the most expensive boxes to maintain? Which systems should we pro-actively replace to keep customers happy? Big Data Solution Collect unstructured data from set top boxes—multiple terabytes Analyze system data in Hadoop in near real time Pull data in to Hive for interactive query and modeling Analytics with Hadoop increases customer satisfaction Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 91
  • 92. Pay PerView Advertising Issues Fixed inventory of ad space is provided by national content providers. For example, 100 ads offered to provider for 1 month of programming Provider can use this space to advertise its products and services, such as pay per view Do we advertise “The Longest Yard” in the middle of afootball gameor in the middle of a romantic comedy? 10% increase in pay per view movie rentals = $10M in incremental revenue • Big Data Solution Collect programming data and viewer rental data in a large data repository Develop models to correlate proclivity to rent to programming format Find the most productive time slots and programs to advertise pay per view inventory Improve ad placement and pay-per-view conversion with Hadoop Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 92
  • 93. Risk Modeling Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 93  Risk Modeling – Bank had customer data across multiple lines of business and needed to develop a better risk picture of its customers. i.e,ifdirect deposits stop coming into checking acct, it’s likely that customer lost his/her job, which impacts creditworthiness for other products (CC, mortgage, etc.) – Data existing in silos across multiple LOB’s and acquired bank systems – Data size approached 1 petabyte  W hy do this in Hadoop? – Ability to cost-effectively integrate + 1 PB of data from multiple data sources: data warehouse,call center , chat and email – Platform for more analysis with poly-structured data sources; i.e., combining bank data with credit bureau data;Twitter, etc. – Offload intensive computation from DW
  • 94. Sentiment Analysis Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 94  Sentiment Analysis – Hadoop used frequently to monitor what customers think of company’s products or services – Data loaded from social media sources (Twitter, blogs, Facebook, emails, chats,etc.) into Hadoop cluster – Map/Reduce jobs run continuously to identifysentiment (i.e.,Acme Company’s rates are “outrageous” or “rip off”) – Negative/positive comments can be acted upon (special offer , coupon, etc.)  W hy Hadoop – Social media/ web data is unstructured – Amount ofdata is immense – New data sources arise weekly
  • 95. tutorial is by th SNIA
  • 96. Resources: The Big Data Conversation World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011 McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity Big Data: Harnessing a game-changing asset IDC: 2011 Digital Universe Study: Extracting Value from Chaos The Economist: Data, Data Everywhere Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field O’Reilly – What is Data Science? O’Reilly – Building Data Science Teams? O’Reilly – Data for the public good Obama Administration “Big Data Research and Development Initiative.” Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 96