2. 2
Size of Data
Speed,AccuracyandComplexityofIntelligence
Big Data
analytics
Big Data
Traditional
analytics
Advanced
analytics
Big Data relates to rapidly growing, Structured and Unstructured datasets with sizes beyond the ability of
conventional database tools to store, manage, and analyze them. In addition to its size and complexity, it refers to
its ability to help in “Evidence-Based” Decision-making, having a high impact on business operations
What is Big Data ?
Gigabytes Terabytes Petabytes Zetabytes
Small Data Sets
Small Data Sets
Traditional
analytics
Big Data
Source: CRISIL GR&A analysis
Source: CRISIL GR&A analysis
3. 3
Structured Data
Resides in formal data stores – RDBMS and Data
Warehouse; grouped in the form of rows or columns
Accounts for ~10% of the total data existing currently
AudioVideo
Weather
patternsBlogs
Location
co-ordinatesText message
Web logs &
clickstreams
RDBMS (e.g.,
ERP and CRM
Data
Warehousing
Microsoft Project
Plan File
Semi-
Structured Data
A form of structured data that does not conform with the
formal structure of data models
Accounts for ~10% of the total data existing currently
Unstructured
Data
Comprises data formats which cannot be stored in row/
column format like audio files, video, clickstream data,
Accounts for ~80% of the total data existing currently
Sensor data/
M2M Email Social media
Geospatial
data
Source: Industry reporting; CRISIL GR&A analysis
6. Descriptive
analytics
6
Evolution of analytics
LevelofComplexity
In-database analyticsAnalytics as a separate value chain function
Time
Standard
reports
Adhoc
reports
Alerts
Statistical
analysis
Forecast
- ing
Predictive
modeling
Optimization
Stochastic
optimization
Natural Language Processing
Big Data analytics
Complex
event
processing
Predictive
analytics
Prescriptive
analytics
Basic analytics
What happened?
When did it happen?
What was the its impact ?
Advanced
analytics
Why did it
happen?
When will it
happen
again?
What
caused it to
happen?
What can be
done to
avoid it?
Multivariate statistical analysis
Time series analysis
Behavioral analytics
Data mining
Constraint
based BI
Social network analytics
Semantic analytics
Online analytical processing (OLAP)
Extreme SQL
Visualization
Analytic
database
functions
Big Data analytics is
where advanced
analytic techniques
are applied on Big
Data sets
The term came into
play late 2011 – early
2012
Late 1990s 2000 onwards
Source: CRISIL GR&A analysis
Query
drill
down
7. 7
Data
Sources
Big Data Analytics
Components of Big Data Ecosystem
Developer Environments
(Languages (Java),
Environments (Eclipse &
NetBeans), programming
interfaces (MapReduce))
Analytics
products
(Avro, Apache
Thrift)
BI
&visualization
tools
Applications
(mobile, search, web)
End users
Business analysts
Big Data
Data Architecture
Hadoop/ Big Data
tech’y framework
(MapReduce etc.)
Unstructured
data
(Text, web
pages, social
media content,
video etc.)
Structured
data
(stored in
MPP, RDBMS
and DW*)
Data administration tools
NoSQL
MPP
RDBMS
DW
NoSQL
Hadoop
based
Operational Data
Datamanagement&
storage
Dataanalytics&its
applicationanduse
ITservices
(SI,customization,consulting,systemdesign)
ETL & Data
integration
products
System
tools
Workflow/
scheduler
products
Input data
Four key elements:
1. Big Data
Management &
storage:
Data storage
infrastructure
and technologies
2. Big Data Analytics
Includes the
technologies and
tools to analyze the
data and generate
insight from it
3. Big Data’s
Application & Use
Involves enabling
the Big Data
insights to work in
BI and end-user
applications
4. IT services including
System Integration
Consulting
Project
management and
customization
What does the Big Data Ecosystem Constitute ?
*MPP – Massively parallel processing; RDBMS - Relational Data Base Management Systems; DW – Data warehouse
Source: CRISIL GR&A analysis
10. Software platform that lets one easily write and run applications that
process vast amounts of data. It includes:
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
Yahoo! is the biggest contributor
Here's what makes it especially useful:
◦ Scalable: It can reliably store and process petabytes.
◦ Economical: It distributes the data and processing across clusters of
commonly available computers (in thousands).
◦ Efficient: By distributing the data, it can process it in parallel on the
nodes where the data is located.
◦ Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
10
11. It is written with large clusters of computers in mind and is
built around the following assumptions:
◦ Hardware will fail.
◦ Processing will be run in batches. Thus there is an
emphasis on high throughput as opposed to low latency.
◦ Applications that run on HDFS have large data sets. A
typical file in HDFS is gigabytes to terabytes in size.
◦ It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster. It should support
tens of millions of files in a single instance.
◦ Applications need a write-once-read-many access model.
◦ Moving Computation is Cheaper than Moving Data.
◦ Portability is important.
11
12. Programming model developed at Google
Sort/merge based distributed computing
Initially, it was intended for their internal
search/indexing application, but now used extensively
by more organizations (e.g., Yahoo, Amazon.com, IBM,
etc.)
It is functional style programming that is naturally
parallelizable across a large cluster of workstations or
PCS.
The underlying system takes care of the partitioning of
the input data, scheduling the program’s execution
across several machines, handling machine failures, and
managing required inter-machine communication. (This
is the key for Hadoop’s success)
12
13. Hadoop implements Google’s MapReduce, using HDFS
MapReduce divides applications into many small blocks of
work.
HDFS creates multiple replicas of data blocks for reliability,
placing them on compute nodes around the cluster.
MapReduce can then process the data where it is located.
Hadoop ‘s target is to run on clusters of the order of 10,000-
nodes.
13
14. The run time partitions the input and provides it to
different Map instances;
Map (key, value) (key’, value’)
The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so
that each Reduce function gets the pairs with the
same key’.
Each Reduce produces a single (or zero) file output.
Map and Reduce are user written functions
14
15. map(String key, String value):
// key: document name; value: document contents; map (k1,v1)
list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(Example: If input string is (“God is God. I am I”), Map produces
{<“God”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>}
reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2))
list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“I”, <1,1>) 2)
15
16. see bob throw
see 1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1
see spot run
Can we do word count in parallel?
19. Worker failure: The master pings every worker periodically. If
no response is received from a worker in a certain amount of
time, the master marks the worker as failed. Any map tasks
completed by the worker are reset back to their initial idle
state, and therefore become eligible for scheduling on other
workers. Similarly, any map task or reduce task in progress
on a failed worker is also reset to idle and becomes eligible
for rescheduling.
Master Failure: It is easy to make the master write periodic
checkpoints of the master data structures described above. If
the master task dies, a new copy can be started from the last
check pointed state. However, in most cases, the user restarts
the job.
19
20. The input data (on HDFS) is stored on the local disks of the
machines in the cluster. HDFS divides each file into 64 MB
blocks, and stores several copies of each block (typically 3
copies) on different machines.
The MapReduce master takes the location information of the
input files into account and attempts to schedule a map task
on a machine that contains a replica of the corresponding
input data. Failing that, it attempts to schedule a map task
near a replica of that task's input data. When running large
MapReduce operations on a significant fraction of the workers
in a cluster, most input data is read locally and consumes no
network bandwidth.
20
21. The map phase has M pieces and the reduce phase has R pieces.
M and R should be much larger than the number of worker machines.
Having each worker perform many different tasks improves dynamic load
balancing, and also speeds up recovery when a worker fails.
Larger the M and R, more the decisions the master must make
R is often constrained by users because the output of each reduce task ends
up in a separate output file.
Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker
machines.
21
22. The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It has
many similarities with existing distributed file systems.
However, the differences from other distributed file systems
are significant.
◦ highly fault-tolerant and is designed to be deployed on
low-cost hardware.
◦ provides high throughput access to application data and is
suitable for applications that have large data sets.
◦ relaxes a few POSIX requirements to enable streaming
access to file system data.
◦ part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/.
22
26. 26
2018E Supply 2018E
Demand
Demand-supply gap for data scientists*
in US, 2018
Data
Scientists
Data-savvy
Managers
Technical
Engineers
Expertise in data
analytics skills to extract
data, use of modeling &
simulations
Multi-disciplinary
knowledge of business to
find insights
Advanced business
degree such as MBA,
M.S. or managerial
diplomas
Advanced degree like
M.S. or Ph.D., in
mathematics, statistics,
economics, computer
science or any decision
sciences
Knowledge of statistics
and/or machine learning
to frame key questions
and analyze answers
Conceptual knowledge of
business to interpret and
challenge the insights
Ability to make decisions
using Big Data insights
Having a degree in
computer science,
information technology,
systems engineering.
or related disciplines
Possessing data
management knowledge
IT skills to develop,
implement, and maintain
hardware and software
Project management
across the Big Data
ecosystem
– Consulting
services
– Implementation
– Infrastructure
management
– Analytics
Big Data analytics
Business intelligence
Visualization
Technical support in
hardware & software
across the Big Data
ecosystem for:
– Data architecture
– Data
administration
– Developer
environment
– Applications
50%-60%
gap relative
to supply
300K
Role in Ecosystem
Requisite educational
qualifications
Other expertise
140K – 190K
440K-490K
Demand-supply gap for data-savvy
managers* in US, 2018
2018E Supply 2018E
Demand
60% gap
relative to
supply
2.5 million
1.5 million
4.0 million
*Analysts with deep analytical training; **Managers to analyze Big Data and make decisions based on their findings; Source: McKinsey Global Institute; CRISIL GR&A analysis
27. 27
2011E 2012E 2015F
Global Big Data Market Size, 2011 – 2015E
US$ billion
5.3-5.6
8.0-8.5
25.0-26.0
The global Big Data market is expected to grow by about a CAGR of 46% over 2012-2015
IT & ITES, including analytics, is expected to grow the fastest, at a rate of more than 60%
– Its share in the total Big Data market is expected to increase to ~45% in 2015 from ~31% in 2011
The USD 25 billion opportunity represents the initial wave of the opportunity. This opportunity is set to expand
even more rapidly after 2015 given the pace at which data is being generated.
Source: Industry reporting; CRISIL GR&A analysis
2015
US$ 6-6.5
billion
US$ 7-7.5
billion
US$ 10-11
billion
Global Big Data Market Size, 2015F
~US$25 billion
Big Data analytics &
IT & IT-enabled
services
Software
Hardware
Lion’s share of the Big
Data hardware and
software market is
expected to be
occupied by IT giants
like IBM, HP, Microsoft,
SAP, SAS, Oracle, etc.
Opportunity for India
lies in capturing the
slice of IT services that
includes Big Data
analytics and IT & IT-
enabled services