SlideShare a Scribd company logo
1 of 28
The Next Frontier for Innovation, Competition and Productivity
2
Size of Data
Speed,AccuracyandComplexityofIntelligence
Big Data
analytics
Big Data
Traditional
analytics
Advanced
analytics
Big Data relates to rapidly growing, Structured and Unstructured datasets with sizes beyond the ability of
conventional database tools to store, manage, and analyze them. In addition to its size and complexity, it refers to
its ability to help in “Evidence-Based” Decision-making, having a high impact on business operations
What is Big Data ?
Gigabytes Terabytes Petabytes Zetabytes
Small Data Sets
Small Data Sets
Traditional
analytics
Big Data
Source: CRISIL GR&A analysis
Source: CRISIL GR&A analysis
3
Structured Data
 Resides in formal data stores – RDBMS and Data
Warehouse; grouped in the form of rows or columns
 Accounts for ~10% of the total data existing currently
AudioVideo
Weather
patternsBlogs
Location
co-ordinatesText message
Web logs &
clickstreams
RDBMS (e.g.,
ERP and CRM
Data
Warehousing
Microsoft Project
Plan File
Semi-
Structured Data
 A form of structured data that does not conform with the
formal structure of data models
 Accounts for ~10% of the total data existing currently
Unstructured
Data
 Comprises data formats which cannot be stored in row/
column format like audio files, video, clickstream data,
 Accounts for ~80% of the total data existing currently
Sensor data/
M2M Email Social media
Geospatial
data
Source: Industry reporting; CRISIL GR&A analysis
4
Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types
Veracity
•Data
Quality
Descriptive
analytics
6
Evolution of analytics
LevelofComplexity
In-database analyticsAnalytics as a separate value chain function
Time
Standard
reports
Adhoc
reports
Alerts
Statistical
analysis
Forecast
- ing
Predictive
modeling
Optimization
Stochastic
optimization
Natural Language Processing
Big Data analytics
Complex
event
processing
Predictive
analytics
Prescriptive
analytics
Basic analytics
 What happened?
 When did it happen?
 What was the its impact ?
Advanced
analytics
 Why did it
happen?
 When will it
happen
again?
 What
caused it to
happen?
 What can be
done to
avoid it?
Multivariate statistical analysis
Time series analysis
Behavioral analytics
Data mining
Constraint
based BI
Social network analytics
Semantic analytics
Online analytical processing (OLAP)
Extreme SQL
Visualization
Analytic
database
functions
 Big Data analytics is
where advanced
analytic techniques
are applied on Big
Data sets
 The term came into
play late 2011 – early
2012
Late 1990s 2000 onwards
Source: CRISIL GR&A analysis
Query
drill
down
7
Data
Sources
Big Data Analytics
Components of Big Data Ecosystem
Developer Environments
(Languages (Java),
Environments (Eclipse &
NetBeans), programming
interfaces (MapReduce))
Analytics
products
(Avro, Apache
Thrift)
BI
&visualization
tools
Applications
(mobile, search, web)
End users
Business analysts
Big Data
Data Architecture
Hadoop/ Big Data
tech’y framework
(MapReduce etc.)
Unstructured
data
(Text, web
pages, social
media content,
video etc.)
Structured
data
(stored in
MPP, RDBMS
and DW*)
Data administration tools
NoSQL
MPP
RDBMS
DW
NoSQL
Hadoop
based
Operational Data
Datamanagement&
storage
Dataanalytics&its
applicationanduse
ITservices
(SI,customization,consulting,systemdesign)
ETL & Data
integration
products
System
tools
Workflow/
scheduler
products
Input data
Four key elements:
1. Big Data
Management &
storage:
 Data storage
infrastructure
and technologies
2. Big Data Analytics
 Includes the
technologies and
tools to analyze the
data and generate
insight from it
3. Big Data’s
Application & Use
 Involves enabling
the Big Data
insights to work in
BI and end-user
applications
4. IT services including
 System Integration
 Consulting
 Project
management and
customization
What does the Big Data Ecosystem Constitute ?
*MPP – Massively parallel processing; RDBMS - Relational Data Base Management Systems; DW – Data warehouse
Source: CRISIL GR&A analysis
8
9
 Software platform that lets one easily write and run applications that
process vast amounts of data. It includes:
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
 Yahoo! is the biggest contributor
 Here's what makes it especially useful:
◦ Scalable: It can reliably store and process petabytes.
◦ Economical: It distributes the data and processing across clusters of
commonly available computers (in thousands).
◦ Efficient: By distributing the data, it can process it in parallel on the
nodes where the data is located.
◦ Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
10
It is written with large clusters of computers in mind and is
built around the following assumptions:
◦ Hardware will fail.
◦ Processing will be run in batches. Thus there is an
emphasis on high throughput as opposed to low latency.
◦ Applications that run on HDFS have large data sets. A
typical file in HDFS is gigabytes to terabytes in size.
◦ It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster. It should support
tens of millions of files in a single instance.
◦ Applications need a write-once-read-many access model.
◦ Moving Computation is Cheaper than Moving Data.
◦ Portability is important.
11
 Programming model developed at Google
 Sort/merge based distributed computing
 Initially, it was intended for their internal
search/indexing application, but now used extensively
by more organizations (e.g., Yahoo, Amazon.com, IBM,
etc.)
 It is functional style programming that is naturally
parallelizable across a large cluster of workstations or
PCS.
 The underlying system takes care of the partitioning of
the input data, scheduling the program’s execution
across several machines, handling machine failures, and
managing required inter-machine communication. (This
is the key for Hadoop’s success)
12
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of
work.
 HDFS creates multiple replicas of data blocks for reliability,
placing them on compute nodes around the cluster.
 MapReduce can then process the data where it is located.
 Hadoop ‘s target is to run on clusters of the order of 10,000-
nodes.
13
 The run time partitions the input and provides it to
different Map instances;
 Map (key, value)  (key’, value’)
 The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so
that each Reduce function gets the pairs with the
same key’.
 Each Reduce produces a single (or zero) file output.
 Map and Reduce are user written functions
14
map(String key, String value):
// key: document name; value: document contents; map (k1,v1) 
list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(Example: If input string is (“God is God. I am I”), Map produces
{<“God”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>}
reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2)) 
list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“I”, <1,1>)  2)
15
see bob throw
see 1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1
see spot run
Can we do word count in parallel?
 InputFormat
 Map function
 Partitioner
 Sorting & Merging
 Combiner
 Shuffling
 Merging
 Reduce function
 OutputFormat
 1:many
 Worker failure: The master pings every worker periodically. If
no response is received from a worker in a certain amount of
time, the master marks the worker as failed. Any map tasks
completed by the worker are reset back to their initial idle
state, and therefore become eligible for scheduling on other
workers. Similarly, any map task or reduce task in progress
on a failed worker is also reset to idle and becomes eligible
for rescheduling.
 Master Failure: It is easy to make the master write periodic
checkpoints of the master data structures described above. If
the master task dies, a new copy can be started from the last
check pointed state. However, in most cases, the user restarts
the job.
19
 The input data (on HDFS) is stored on the local disks of the
machines in the cluster. HDFS divides each file into 64 MB
blocks, and stores several copies of each block (typically 3
copies) on different machines.
 The MapReduce master takes the location information of the
input files into account and attempts to schedule a map task
on a machine that contains a replica of the corresponding
input data. Failing that, it attempts to schedule a map task
near a replica of that task's input data. When running large
MapReduce operations on a significant fraction of the workers
in a cluster, most input data is read locally and consumes no
network bandwidth.
20
 The map phase has M pieces and the reduce phase has R pieces.
 M and R should be much larger than the number of worker machines.
 Having each worker perform many different tasks improves dynamic load
balancing, and also speeds up recovery when a worker fails.
 Larger the M and R, more the decisions the master must make
 R is often constrained by users because the output of each reduce task ends
up in a separate output file.
 Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker
machines.
21
 The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It has
many similarities with existing distributed file systems.
However, the differences from other distributed file systems
are significant.
◦ highly fault-tolerant and is designed to be deployed on
low-cost hardware.
◦ provides high throughput access to application data and is
suitable for applications that have large data sets.
◦ relaxes a few POSIX requirements to enable streaming
access to file system data.
◦ part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/.
22
23
24
25
26
2018E Supply 2018E
Demand
Demand-supply gap for data scientists*
in US, 2018
Data
Scientists
Data-savvy
Managers
Technical
Engineers
 Expertise in data
analytics skills to extract
data, use of modeling &
simulations
 Multi-disciplinary
knowledge of business to
find insights
 Advanced business
degree such as MBA,
M.S. or managerial
diplomas
 Advanced degree like
M.S. or Ph.D., in
mathematics, statistics,
economics, computer
science or any decision
sciences
 Knowledge of statistics
and/or machine learning
to frame key questions
and analyze answers
 Conceptual knowledge of
business to interpret and
challenge the insights
 Ability to make decisions
using Big Data insights
 Having a degree in
computer science,
information technology,
systems engineering.
or related disciplines
 Possessing data
management knowledge
 IT skills to develop,
implement, and maintain
hardware and software
 Project management
across the Big Data
ecosystem
– Consulting
services
– Implementation
– Infrastructure
management
– Analytics
 Big Data analytics
 Business intelligence
 Visualization
 Technical support in
hardware & software
across the Big Data
ecosystem for:
– Data architecture
– Data
administration
– Developer
environment
– Applications
50%-60%
gap relative
to supply
300K
Role in Ecosystem
Requisite educational
qualifications
Other expertise
140K – 190K
440K-490K
Demand-supply gap for data-savvy
managers* in US, 2018
2018E Supply 2018E
Demand
60% gap
relative to
supply
2.5 million
1.5 million
4.0 million
*Analysts with deep analytical training; **Managers to analyze Big Data and make decisions based on their findings; Source: McKinsey Global Institute; CRISIL GR&A analysis
27
2011E 2012E 2015F
Global Big Data Market Size, 2011 – 2015E
US$ billion
5.3-5.6
8.0-8.5
25.0-26.0
 The global Big Data market is expected to grow by about a CAGR of 46% over 2012-2015
 IT & ITES, including analytics, is expected to grow the fastest, at a rate of more than 60%
– Its share in the total Big Data market is expected to increase to ~45% in 2015 from ~31% in 2011
 The USD 25 billion opportunity represents the initial wave of the opportunity. This opportunity is set to expand
even more rapidly after 2015 given the pace at which data is being generated.
Source: Industry reporting; CRISIL GR&A analysis
2015
US$ 6-6.5
billion
US$ 7-7.5
billion
US$ 10-11
billion
Global Big Data Market Size, 2015F
~US$25 billion
Big Data analytics &
IT & IT-enabled
services
Software
Hardware
Lion’s share of the Big
Data hardware and
software market is
expected to be
occupied by IT giants
like IBM, HP, Microsoft,
SAP, SAS, Oracle, etc.
Opportunity for India
lies in capturing the
slice of IT services that
includes Big Data
analytics and IT & IT-
enabled services
28

More Related Content

What's hot

Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 

What's hot (20)

Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 

Similar to Big data

Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 

Similar to Big data (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Big data

  • 1. The Next Frontier for Innovation, Competition and Productivity
  • 2. 2 Size of Data Speed,AccuracyandComplexityofIntelligence Big Data analytics Big Data Traditional analytics Advanced analytics Big Data relates to rapidly growing, Structured and Unstructured datasets with sizes beyond the ability of conventional database tools to store, manage, and analyze them. In addition to its size and complexity, it refers to its ability to help in “Evidence-Based” Decision-making, having a high impact on business operations What is Big Data ? Gigabytes Terabytes Petabytes Zetabytes Small Data Sets Small Data Sets Traditional analytics Big Data Source: CRISIL GR&A analysis Source: CRISIL GR&A analysis
  • 3. 3 Structured Data  Resides in formal data stores – RDBMS and Data Warehouse; grouped in the form of rows or columns  Accounts for ~10% of the total data existing currently AudioVideo Weather patternsBlogs Location co-ordinatesText message Web logs & clickstreams RDBMS (e.g., ERP and CRM Data Warehousing Microsoft Project Plan File Semi- Structured Data  A form of structured data that does not conform with the formal structure of data models  Accounts for ~10% of the total data existing currently Unstructured Data  Comprises data formats which cannot be stored in row/ column format like audio files, video, clickstream data,  Accounts for ~80% of the total data existing currently Sensor data/ M2M Email Social media Geospatial data Source: Industry reporting; CRISIL GR&A analysis
  • 4. 4
  • 6. Descriptive analytics 6 Evolution of analytics LevelofComplexity In-database analyticsAnalytics as a separate value chain function Time Standard reports Adhoc reports Alerts Statistical analysis Forecast - ing Predictive modeling Optimization Stochastic optimization Natural Language Processing Big Data analytics Complex event processing Predictive analytics Prescriptive analytics Basic analytics  What happened?  When did it happen?  What was the its impact ? Advanced analytics  Why did it happen?  When will it happen again?  What caused it to happen?  What can be done to avoid it? Multivariate statistical analysis Time series analysis Behavioral analytics Data mining Constraint based BI Social network analytics Semantic analytics Online analytical processing (OLAP) Extreme SQL Visualization Analytic database functions  Big Data analytics is where advanced analytic techniques are applied on Big Data sets  The term came into play late 2011 – early 2012 Late 1990s 2000 onwards Source: CRISIL GR&A analysis Query drill down
  • 7. 7 Data Sources Big Data Analytics Components of Big Data Ecosystem Developer Environments (Languages (Java), Environments (Eclipse & NetBeans), programming interfaces (MapReduce)) Analytics products (Avro, Apache Thrift) BI &visualization tools Applications (mobile, search, web) End users Business analysts Big Data Data Architecture Hadoop/ Big Data tech’y framework (MapReduce etc.) Unstructured data (Text, web pages, social media content, video etc.) Structured data (stored in MPP, RDBMS and DW*) Data administration tools NoSQL MPP RDBMS DW NoSQL Hadoop based Operational Data Datamanagement& storage Dataanalytics&its applicationanduse ITservices (SI,customization,consulting,systemdesign) ETL & Data integration products System tools Workflow/ scheduler products Input data Four key elements: 1. Big Data Management & storage:  Data storage infrastructure and technologies 2. Big Data Analytics  Includes the technologies and tools to analyze the data and generate insight from it 3. Big Data’s Application & Use  Involves enabling the Big Data insights to work in BI and end-user applications 4. IT services including  System Integration  Consulting  Project management and customization What does the Big Data Ecosystem Constitute ? *MPP – Massively parallel processing; RDBMS - Relational Data Base Management Systems; DW – Data warehouse Source: CRISIL GR&A analysis
  • 8. 8
  • 9. 9
  • 10.  Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access  Yahoo! is the biggest contributor  Here's what makes it especially useful: ◦ Scalable: It can reliably store and process petabytes. ◦ Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). ◦ Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. ◦ Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 10
  • 11. It is written with large clusters of computers in mind and is built around the following assumptions: ◦ Hardware will fail. ◦ Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency. ◦ Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. ◦ It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. ◦ Applications need a write-once-read-many access model. ◦ Moving Computation is Cheaper than Moving Data. ◦ Portability is important. 11
  • 12.  Programming model developed at Google  Sort/merge based distributed computing  Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)  It is functional style programming that is naturally parallelizable across a large cluster of workstations or PCS.  The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) 12
  • 13.  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.  MapReduce can then process the data where it is located.  Hadoop ‘s target is to run on clusters of the order of 10,000- nodes. 13
  • 14.  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions 14
  • 15. map(String key, String value): // key: document name; value: document contents; map (k1,v1)  list(k2,v2) for each word w in value: EmitIntermediate(w, "1"); (Example: If input string is (“God is God. I am I”), Map produces {<“God”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>} reduce(String key, Iterator values): // key: a word; values: a list of counts; reduce (k2,list(v2))  list(v2) int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); (Example: reduce(“I”, <1,1>)  2) 15
  • 16. see bob throw see 1 bob 1 throw 1 see 1 spot 1 run 1 bob 1 run 1 see 2 spot 1 throw 1 see spot run Can we do word count in parallel?
  • 17.
  • 18.  InputFormat  Map function  Partitioner  Sorting & Merging  Combiner  Shuffling  Merging  Reduce function  OutputFormat  1:many
  • 19.  Worker failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.  Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last check pointed state. However, in most cases, the user restarts the job. 19
  • 20.  The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.  The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 20
  • 21.  The map phase has M pieces and the reduce phase has R pieces.  M and R should be much larger than the number of worker machines.  Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails.  Larger the M and R, more the decisions the master must make  R is often constrained by users because the output of each reduce task ends up in a separate output file.  Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 21
  • 22.  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. ◦ highly fault-tolerant and is designed to be deployed on low-cost hardware. ◦ provides high throughput access to application data and is suitable for applications that have large data sets. ◦ relaxes a few POSIX requirements to enable streaming access to file system data. ◦ part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26 2018E Supply 2018E Demand Demand-supply gap for data scientists* in US, 2018 Data Scientists Data-savvy Managers Technical Engineers  Expertise in data analytics skills to extract data, use of modeling & simulations  Multi-disciplinary knowledge of business to find insights  Advanced business degree such as MBA, M.S. or managerial diplomas  Advanced degree like M.S. or Ph.D., in mathematics, statistics, economics, computer science or any decision sciences  Knowledge of statistics and/or machine learning to frame key questions and analyze answers  Conceptual knowledge of business to interpret and challenge the insights  Ability to make decisions using Big Data insights  Having a degree in computer science, information technology, systems engineering. or related disciplines  Possessing data management knowledge  IT skills to develop, implement, and maintain hardware and software  Project management across the Big Data ecosystem – Consulting services – Implementation – Infrastructure management – Analytics  Big Data analytics  Business intelligence  Visualization  Technical support in hardware & software across the Big Data ecosystem for: – Data architecture – Data administration – Developer environment – Applications 50%-60% gap relative to supply 300K Role in Ecosystem Requisite educational qualifications Other expertise 140K – 190K 440K-490K Demand-supply gap for data-savvy managers* in US, 2018 2018E Supply 2018E Demand 60% gap relative to supply 2.5 million 1.5 million 4.0 million *Analysts with deep analytical training; **Managers to analyze Big Data and make decisions based on their findings; Source: McKinsey Global Institute; CRISIL GR&A analysis
  • 27. 27 2011E 2012E 2015F Global Big Data Market Size, 2011 – 2015E US$ billion 5.3-5.6 8.0-8.5 25.0-26.0  The global Big Data market is expected to grow by about a CAGR of 46% over 2012-2015  IT & ITES, including analytics, is expected to grow the fastest, at a rate of more than 60% – Its share in the total Big Data market is expected to increase to ~45% in 2015 from ~31% in 2011  The USD 25 billion opportunity represents the initial wave of the opportunity. This opportunity is set to expand even more rapidly after 2015 given the pace at which data is being generated. Source: Industry reporting; CRISIL GR&A analysis 2015 US$ 6-6.5 billion US$ 7-7.5 billion US$ 10-11 billion Global Big Data Market Size, 2015F ~US$25 billion Big Data analytics & IT & IT-enabled services Software Hardware Lion’s share of the Big Data hardware and software market is expected to be occupied by IT giants like IBM, HP, Microsoft, SAP, SAS, Oracle, etc. Opportunity for India lies in capturing the slice of IT services that includes Big Data analytics and IT & IT- enabled services
  • 28. 28