INTRODUCTION TO BIG DATA AND HADOOP

IT19741
Cloud and Big Data
Analytics
Dr G Geetha
Dean innovation and Professor CSE
Women Scientist, Qualified Patent agent
Rajalakshmi Engineering College

UNIT-III
INTRODUCTION
TO BIG DATA AND
HADOOP
Introduction to Big Data, Types of Digital Data,
Challenges of conventional systems - Web
data, Evolution of analytic processes and
tools, Analysis Vs reporting - Big Data
Analytics, Introduction to Hadoop -
Distributed Computing Challenges - History of
Hadoop, Hadoop Eco System - Use case of
Hadoop – Hadoop Distributors – HDFS –
Processing Data with Hadoop – Map Reduce.

Customer Challenges:The Data Deluge
The Economist, Feb 25, 2010
Introduction to Analytics and Big Data - Hadoop
© 2014 Storage Networking Industry Association. All Rights Reserved.
3

Introduction to Big Data
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.

Questions from Businesses will Vary
Reporting,
Dashboards
What
happened?
Why did it
happen?
Forensics & Data
Mining
Real-Time
Analytics
What is
happening?
Why is it
happening?
Real-Time
Data Mining
Predictive
Analytics
What is likely to
happen?
What should I do
about it?
Prescriptive
Analytics
Past
6
Future

Tradational Data Big Data
Traditional data is generated in enterprise level. Big data is generated outside the enterprise level.
Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
secondTraditional database system deals with structured data.
Big data system deals with structured, semi-
structured,database, and unstructured data.
Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
Traditional data source is centralized and it is managed in
centralized form.
Big data source is distributed and it is managed in distributed
form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is capable to process traditional
data.
High system configuration is required to process big data.
The size of the data is very small. The size is more than the traditional data size.
Traditional data base tools are required to perform any data base
operation.
Special kind of data base tools are required to perform any
database schema-based operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is strict schema based and it is static. Its data model is a flat schema based and it is dynamic.
Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
Its data sources includes ERP transaction data, CRM transaction
data, financial data, organizational data, web transaction data
Its data sources includes social media, device data, sensor data,

Types of Digital Data
• Structured data: When data follows a pre-defined
schema/structure we say it is structured data. This is
the data which is in an organized form (e.g., in rows
and columns) and be easily used by a computer
program.
• Relationships exist between entities of data, such
as classes and their objects. About 10% data of an
organization is in this format.
• Data stored in databases is an example of
structured data.

• Unstructured data: This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer program.
About 80% data of an organization is in this format; for example, memos,
chat rooms, PowerPoint presentations, images, videos, letters. researches,
white papers, body of an email, etc.

• Semi-structured data: Semi-structured data is also referred to as self
describing structure. This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program. About 10% data of
an organization is in this format; for example, HTML, XML, JSON,
email data etc.

Big Data: Different than Business Intelligence
“TRADITIONAL BI”
Repetitive
Structured
Operational
GBs to 10s ofTBs
“BIG DATA A N ALYTIC S”
Experimental,Ad Hoc
Mostly Semi-Structured
External + Operational
10s ofTB to 100’
s of PB’
s
13

web data
• Data that is sourced and structured from websites is
referred to as "web data"
• In the world of big data, data comes from multiple sources and in
huge amount. In which one source is web itself. Web data extraction
is one of the medium of collecting data from this source i.e. web

W eb 2.0 is “Data-Driven”
“The future is here,it’s just not evenly distributed yet.”
W illiam Gibson
15

The World of Data-Driven Applications
16

Attributes of Big Data
Volume
Velocity Variety
Batch
N earTime
RealTime
Streams
Structured
Unstructured
Semistructured
Terabytes
Transactions
Tables
Records
Files
18

Big Data:
Batch Processing &
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other SQL
Reporting Tools
Interactive Business
Intelligence &
In-memory RDBMS
QliqView, Tableau, HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed

Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications
• Traditional DW architectures
(e.g. Exadata, Teradata) are not
well-suited for big data apps
• Shared
parallel
nothing, massively
processing, scale out
architectures are well-suited for
big data apps
23

Ten Common Big Data Problems
36
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. Ad targeting
5. PoS transaction analysis
6. Analyzing network data to
predict failure
7. Threat analysis
8. Trade surveillance
9. Search quality
10. Data “sandbox”

The Big Data Opportunity
Financial Services Healthcare
Retail Web/Social/Mobile
Manufacturing Government
37

Industries Are Embracing Big Data
Retail
•CRM – Customer Scoring
•Store Siting and Layout
•Fraud Detection /Prevention
•Supply Chain O ptimization
Advertising & Public Relations
•Demand Signaling
•Ad Targeting
•Sentiment Analysis
•Customer Acquisition
Financial Services
•Algorithmic Trading
•Risk Analysis
•Fraud Detection
•Portfolio Analysis
Media & Telecommunications
•Network Optimization
•Customer Scoring
•Churn Prevention
•Fraud Prevention
Manufacturing
•Product Research
•Engineering Analytics
•Process & Q uality Analysis
•Distribution Optimization
Energy
•Smart Grid
•Exploration
Government
•Market Governance
•Counter-Terrorism
•Econometrics
•Health Informatics
Healthcare & Life Sciences
•Pharmaco-Genomics
•Bio-Informatics
•Pharmaceutical Research
•Clinical Outcomes Research
38

5Vs” of Big Data
• Volume: With increasing dependence on technology, data is producing at a large volume.
Common examples are data being produced by various social networking sites, sensors,
scanners, airlines and other organizations.
• Velocity: Huge amount of data is generated per second. It is estimated that by the end of
2020, every individual will produce 3mb data per second. This large volume of data is being
generated with a great velocity.
• Variety: The data being produced by different means is of three types:
Structured Data: It is the relational data which is stored in the form of rows and columns.
Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data
which can’t be stored in the form of rows and columns.
Semi Structured Data: Log files are the examples of this type of data.
• Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in
the generation of doubtful or uncertain Information. Often data inconsistency arises because
of the volume or amount of data e.g. data in bulk could create confusion whereas less
amount of data could convey half or incomplete Information.
• Value: After having the 4 V’s into account there comes one more V which stands for Value!.
Bulk of Data having no Value is of no good to the company, unless you turn it into something
useful. Data in itself is of no use or importance but it needs to be converted into something
valuable to extract Information. Hence, you can state that Value! is the most important V of
all the 5V’s

Distributed Computing
Challenges
• Lack of performance and scalability.
• Lack of flexible resource management.
• Lack of application deployment support.
• Lack of quality of service.
• Lack of multiple data source support.

Storage & Memory B/W lagging CPU
CPU B/W requirements out-pacing memory and storage
Disk & memory getting “further” away from CPU
Large sequential transfers better for both memory & disk
CPU DRAM LAN Disk
Annual bandwidth improvement (all
milestones)
1.5 1.27 1.39 1.28
Annual latency improvement (all milestones) 1.17 1.07 1.12 1.11
MemoryW all Storage Chasm
41

Commodity Hardware Economics
For $1000
One computer can
42
Process
~32GB
Store
~15TB
99.9%
Of data is Underutilized

Enterprise + Big Data = Big Opportunity
16

Why Hadoop?
Big Data analytics and the Apache Hadoop open source
project are rapidly emerging as the preferred solution to
address business and technology trends that are disrupting
traditional data management and processing.
Enterprises can gain a competitive advantage by being
early adopters of big data analytics.
44

Hadoop Adoption
HDFS
MapReduce
Ecosystem Projects
46

Hadoop Adoption in the Industry
2007 2008 2009 2010
The DatagraphBlog
Source: Hadoop Summit Presentations
47

What is Hadoop?
A scalable fault-tolerant distributed system for data storage
and processing
Core Hadoop has two main components
Hadoop Distributed File System (HDFS): self-healing, high-
bandwidth clustered storage
Reliable, redundant, distributed file system optimized for large files
MapReduce: fault-tolerant distributed processing
Programming model for processing sets of data
Mapping inputs to outputs and reducing the output of multiple Mappers to one (or
a few) answer(s)
Operates on unstructured and structured data
A large and active ecosystem
Open source under the friendly Apache License
http://wiki.apache.org/hadoop/
51

Hadoop ecosystem
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
• Name node
• Data Node
Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.

Apache HBase
• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be processed
within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data

YARN
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it
performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.

MapReduce
By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is lateron processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.

PIG
Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care of.
After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization
and hence is a major segment of the Hadoop Ecosystem.

HIVE
• With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query language is
called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by Hive
thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with
two components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in
the processing of queries.

Mahout
• Mahout, allows Machine Learnability to a system or application.
Machine Learning, as the name suggests helps the system to develop
itself based on some patterns, user/environmental interaction or on
the basis of algorithms.
• It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts
of Machine learning. It allows invoking algorithms as per our need
with the help of its own libraries.

Apache Spark
• It’s a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most
of the companies interchangeably.

Data Management
• Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries, especially
Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of coordination
and synchronization among the resources or the components of
Hadoop which resulted in inconsistency, often. Zookeeper overcame
all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus scheduling
jobs and binding them together as a single unit. There is two kinds of
jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow
is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.

High Level Hadoop Architecture

Main components of Hadoop
• Namenode—controls operation of the data jobs.
• Datanode—this writes data in blocks to local storage. And it replicates data
blocks to other datanodes. DataNodes are also rack-aware. You would not
want to replicate all your data to the same rack of servers as an outage
there would cause you to loose all your data.
• SecondaryNameNode—this one take over if the primary Namenode goes
offline.
• JobTracker—sends MapReduce jobs to nodes in the cluster.
• TaskTracker—accepts tasks from the Job Tracker.

Main components of Hadoop
• Yarn—runs the Yarn components ResourceManager and NodeManager.
This is a resource manager that can also run as a stand-alone
component to provide other applications the ability to run in a
distributed architecture. For example you can use Apache Spark with
Yarn. You could also write your own program to use Yarn. But that is
complicated.
• Client Application—this is whatever program you have written or some
other client like Apache Pig. Apache Pig is an easy-to-use shell that
takes SQL-like commands and translates them to Java MapReduce
programs and runs them on Hadoop.
• Application Master—runs shell commands in a container as directed by
Yarn.

HDFS Concepts
70
 Sits on top of a native (ext3, xfs, etc..) file system
 Performs best with a ‘modest’ number of large files
 Files in HDFS are ‘write once’
 HDFS is optimized for large, streaming reads of files

H D FS
71
 Hadoop Distributed File System
– Data is organized into files & directories
– Files are divided into blocks, distributed across cluster
nodes
– Block placement known at runtime by map-reduce =
computation co-located with data
– Blocks replicated to handle failure
– Checksums used to ensure data integrity
 Replication: one and only strategy for error
handling, recovery and fault tolerance
– Self Healing
– Make multiple copies

Hadoop Server Roles
Slave
Task
T
racker
Data
Node
Slave
T
askT
racker
Data
Node
Slave
T
askT
racker
Data
Node
Master
Name
Node
Master
Secondary
Node
Job
T
racker
Client Client Client Client Client Client Client Client
Slave
Task
T
racker
Data
Node
Slave
T
askT
racker
Data
Node
Slave
Task
T
racker
Data
Node
Up to 4K
Nodes
72

Cluster versus single node
•When you first install Hadoop, such as to learn it, it
runs in single node. But in production you would set
it up to run in cluster node, meaning assign data
nodes to run on different machines. The whole set
of machines is called the cluster. A Hadoop cluster
can scale immensely to store petabytes of data.

Hadoop Cluster
DN,TT
Up to 4K
Nodes
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
NN
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
JT
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
SNN
1GbE/10GbE
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
DN,TT
1GbE/10GbE
CORE SWITCH
DN,TT
CORE SWITCH Client
74

HDFS File Write Operation
75

HDFS File Read Operation
76

Functional Programming meets
Distributed Processing
77

What is MapReduce?
A method for distributing a task across multiple nodes
Each node processes data stored on that node
Consists of two developer-created phases
1. Map
2. Reduce
In between Map and Reduce is the Shuffle and Sort
78

MapReduce Provides
Automatic parallelization and distribution
Fault Tolerance
Status and Monitoring Tools
A clean abstraction for programmers
Google Technology RoundTable: MapReduce
79

MapReduce -two programs
Map means to take items like a string from a csv file
and run an operation over every line in the file, like to
split it into a list of fields. Those become (key->value)
pairs.
Reduce groups these (key->value) pairs and runs an
operation to, for example, concatenate them into one
string or sum them like (key->sum)

Key MapReduce Terminology Concepts
A user runs a client program on a client computer
The client program submits a job to Hadoop
The job is sent to the JobTracker process on the Master Node
Each Slave Node runs a process called the TaskTracker
The JobTracker instructs TaskTrackers to run and monitor tasks
A task attempt is an instance of a task running on a slave node
There will be at least as many task attempts as there are tasks which
need to be performed
81

MapReduce: Basic Concepts
Each Mapper processes single input split from HDFS
Hadoop passes developer’s Map code one record at a time
Each record has a key and a value
Intermediate data written by the Mapper to local disk
During shuffle and sort phase, all values associated with
same intermediate key are transferred to same Reducer
Reducer is passed each key and a list of all its values
Output from Reducers is written to HDFS
82

MapReduce Operation
83
W hat was the max/
min temperature for the last century?

Sample Dataset
The requirement:
you need to find out grouped by type of customer how many of
each type are in each country with the name of the country listed in
the countries.dat in the final result (and not the 2 digit country
name). Each record has a key and a value
To do this you need to:
Join the data sets
Key on country
Count type of customer per country
Output the results
84

MapReduce Paradigm
Input Map Shuffle and Sort Reduce Output
Map
Reduce
cat grep sort uniq output
Map
Map
Reduce
85

MapReduce Example
Problem: Count the number of times that each word appears in the following paragraph:
J
ohn has a red car
,which has no radio.Mary has a red
bicycle.Bill has no car or bicycle.
Map
Server 1: John has a red car, which has no radio.
John: 1
has: 2
a: 1
red: 1
car: 1
which: 1
no: 1
radio: 1
Server 2: Mary has a red bicycle.
Mary: 1
has: 1
a: 1
red: 1
bicycle: 1
Server 3: Bill has no car or bicycle.
Bill: 1
has: 1
no: 1
car: 1
or: 1
biclycle:1
Reduce
John:
1
has 2
has: 1
has: 1
a: 1
a: 1
red: 1
red: 1
car: 1
car: 1
which: 1
no: 1
no: 1
radio: 1
Mary: 1
bicycle: 1
bicycle: 1
Bill: 1
or: 1
John: 1
has 4
a: 2
red: 2
car: 2
which: 1
no: 2
radio: 1
Mary: 1
bicycle: 2
Bill: 1
or: 1
86
Server 1 Server 2 Server 3 Server 1 Server 2 Server 3

Putting it all Together: MapReduce and HDFS
Task Tracker
Task Tracker
Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Large Data Set
(Log files,Sensor Data)
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Client/Dev
2
Map Job
1
3
4
87

• Hadoop is a‘top-level’Apache project
• Created and managed under the auspices of theApache Software Foundation
• Several other projects exist that rely on some or all of Hadoop
• Typically either both HDFS and MapReduce,or just HDFS
• Ecosystem Projects Include
• Hive
• Pig
• HBase
• Many more… ..
Hadoop Ecosystem Projects
Introductih
on
tttp
o :
A
/n
/h
aly
a
tid
cs
oa
o
nd
pB
.a
igp
D
a
ac
tah
-e
H.
ao
do
ro
g
p
/
© 2014 Storage Networking Industry Association. All Rights Reserved. 88

Hadoop, SQL & MPP Systems
89
Hadoop Traditional SQL
Systems
MPP Systems
Scale-Out Scale-Up Scale-Out
Key/Value Pairs Relational Tables Relational Tables
Functional
Programming
Declarative Queries Declarative Queries
Offline Batch
Processing
Online Transactions Online Transactions

Comparing RDBMS and MapReduce
90
Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Exabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide

Diagnostics and Customer Churn
Issues
W hat make and model systems are deployed?
Are certain set top boxes in need of replacement based on system
diagnostic data?
Is the acorrelation between make,model or vintageof set top box and
customer churn?
W hat are the most expensive boxes to maintain?
Which systems should we pro-actively replace to keep customers happy?
Big Data Solution
Collect unstructured data from set top boxes—multiple terabytes
Analyze system data in Hadoop in near real time
Pull data in to Hive for interactive query and modeling
Analytics with Hadoop increases customer satisfaction
91

Pay PerView Advertising
Issues
Fixed inventory of ad space is provided by national content providers. For
example, 100 ads offered to provider for 1 month of programming
Provider can use this space to advertise its products and services, such as
pay per view
Do we advertise “The Longest Yard” in the middle of afootball gameor in
the middle of a romantic comedy?
10% increase in pay per view movie rentals = $10M in incremental revenue
• Big Data Solution
Collect programming data and viewer rental data in a large data repository
Develop models to correlate proclivity to rent to programming format
Find the most productive time slots and programs to advertise pay per
view inventory
Improve ad placement and pay-per-view conversion with Hadoop
92

Risk Modeling
93
 Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers. i.e,ifdirect deposits stop
coming into checking acct, it’s likely that customer lost his/her job, which
impacts creditworthiness for other products (CC, mortgage, etc.)
– Data existing in silos across multiple LOB’s and acquired bank systems
– Data size approached 1 petabyte
 W hy do this in Hadoop?
– Ability to cost-effectively integrate + 1 PB of data from multiple data
sources: data warehouse,call center
, chat and email
– Platform for more analysis with poly-structured data sources; i.e.,
combining bank data with credit bureau data;Twitter, etc.
– Offload intensive computation from DW

Sentiment Analysis
94
 Sentiment Analysis
– Hadoop used frequently to monitor what customers think of company’s
products or services
– Data loaded from social media sources (Twitter, blogs, Facebook, emails,
chats,etc.) into Hadoop cluster
– Map/Reduce jobs run continuously to identifysentiment (i.e.,Acme
Company’s rates are “outrageous” or “rip off”)
– Negative/positive comments can be acted upon (special offer
, coupon, etc.)
 W hy Hadoop
– Social media/
web data is unstructured
– Amount ofdata is immense
– New data sources arise weekly

Resources: The Big Data Conversation
World Economic Forum: “Personal Data: The Emergence of a New
Asset Class” 2011
McKinsey Global Institute: Big Data: The next frontier for innovation,
competition, and productivity
Big Data: Harnessing a game-changing asset
IDC: 2011 Digital Universe Study: Extracting Value from Chaos
The Economist: Data, Data Everywhere
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning
New Field
O’Reilly – What is Data Science?
O’Reilly – Building Data Science Teams?
O’Reilly – Data for the public good
Obama Administration “Big Data Research and Development Initiative.”
96

INTRODUCTION TO BIG DATA AND HADOOP

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a INTRODUCTION TO BIG DATA AND HADOOP

Similar a INTRODUCTION TO BIG DATA AND HADOOP (20)

Más de Dr Geetha Mohan

Más de Dr Geetha Mohan (13)

Último

Último (20)

INTRODUCTION TO BIG DATA AND HADOOP