SlideShare una empresa de Scribd logo
1 de 77
BIGDATA
2
How much data ? How it matters?3
 Data Size matters……
 How it matters……………….
 Like this………..
4
bit (b) 0 or 1 1/8 of a byte
byte (B) 8 bits 1 byte
kilobyte (KB) 10001 bytes 1,000 bytes
megabyte (MB) 10002 bytes 1,000,000 bytes
gigabyte (GB) 10003 bytes 1,000,000,000 bytes
terabyte (TB) 10004 bytes
1,000,000,000,000
bytes
petabyte (PB) 10005 bytes
1,000,000,000,000,000
bytes
exabyte (EB) 10006 bytes
1,000,000,000,000,000
,000 bytes
zettabyte (ZB) 10007 bytes
1,000,000,000,000,000
,000,000 bytes
yottabyte (YB) 10008 bytes
1,000,000,000,000,000
,000,000,000 bytes
5
 But Where and in which
companies…….?
 Every Where……. Like in……
6
7
 Where in real time…..?
8
 Where in real time…..?
9
 Where in real time…..?
10
 Where in real time…..?
11
 Where in real time…..?
Asia’s largest and world’s third largest data centre in Bengaluru
12
 Where in real time…..?
Simple to startWhat is the maximum file size you have dealt so far?
Movies/Files/Streaming video that you have used?
What have you observed?
What is the maximum download speed you get?
Simple computation
How much time to just transfer.
Introduction to Big Data
Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
 The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
14
Introduction to Big Data
Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
 The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
15
Big Data Driving Factors
16
Dimensions of Big Data
17
Big data spans three dimensions: Volume, velocity and Variety
 Volume: Volume refers to the ‘amount of data’, which is growing day
by day at a very fast pace.
 The size of data generated by humans, machines and their interactions
on social media itself is massive.
 Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
18
19
VELOCITY
 Velocity is defined as the pace at which different sources generate the data
every day.
 This flow of data is massive and continuous. There are 1.03 billion Daily
Active Users (Facebook DAU) on Mobile as of now, which is an increase of
22% year-over-year.
 This shows how fast the number of users are growing on social media and
how fast the data is getting generated daily.
20
Examples of Big Data
 Daily we upload millions of bytes of data. 90 % of the world’s data has been
created in last two years.
21
Examples of Big Data
 Walmart handles more than 1 million customer transactions every hour.
 Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
 230+ millions of tweets are created every day.
 More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
 YouTube users upload 48 hours of new video every minute of the day.
22
Contn..
 Amazon handles 15 million customer click stream user data per day to
recommend products.
 294 billion emails are sent every day. Services analyses this data to find
the spams.
 Modern cars have close to 100 sensors which monitors fuel level, tire
pressure etc. , each vehicle generates a lot of sensor data.
23
Traits of Big data
The eight (8) ‘V’ Dimension Characteristics of Big Data:
Part One: Volume, Velocity, Variety
Part Two: Variability (Unpredictability), Veracity
(Reliability), Virality (Circulated rapidly), Visualization and
Value.
24
 Veracity
 Big Data Veracity refers to the biases, noise and abnormality in data. Is
the data that is being stored, and mined meaningful to the problem being
analyzed. Inderpal feel veracity in data analysis is the biggest challenge
when compares to things like volume and velocity.
 Validity
 Like big data veracity is the issue of validity meaning is the data correct
and accurate for the intended use. Clearly valid data is key to making the
right decisions.
25
 Volatility
Big data volatility refers to how long is data valid and how
long should it be stored.
In this world of real time data you need to determine at
what point is data no longer relevant to the current
analysis.
26
Challenges of Conventional Systems
• Conventional analytical tools and techniques are
inadequate to handle data that is unstructured
(like text data), that is too large in size, or that is
growing rapidly like social media data.
• A cluster analysis on a 200MB file with 1 million
customer records is manageable, but the same
cluster analysis on 1000GB of Facebook customer
profile information will take a considerable
amount of time if conventional tools and
techniques are used.
Challenges of Conventional Systems
• Facebook as well as entities like Google and
Walmart generate data in petabytes every day.
• Traditional Analytics analyzes on the known
data environment that too the data that is well
understood. It cannot work on unstructured
data efficiently.
• Traditional Analytics is built on top of the
relational data model, relationships between
the subjects of interests have been created
inside the system and the analysis is done
based on them. This approach will not
adequate for big data analytics.
Challenges of Conventional Systems
• Traditional analytics is batch oriented and we need to
wait for nightly ETL (extract,transform and load) and
transformation jobs to complete before the required
insight is obtained.
• Parallelism in a traditional analytics system is achieved
through costly hardware like MPP
• (Massively Parallel Processing) systems.
Other Challenges of Conventional
Systems
• Data challenges
• Volume, velocity, veracity, variety
• Data discovery and comprehensiveness
• Scalability
• Process challenges
• Capturing Data
• Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data (Mathematically, simulation)
• Understating output, visualizing results and display issues
on mobile devices.
Traits or Characteristics of Big Data
The eight (8) ‘V’ Dimension Characteristics of Big
Data:
Part One: Volume, Velocity, Variety
Part Two: Variability (Unpredictability), Veracity
(Reliability), Virality (Circulated rapidly),
Visualization and Value.
Characteristics of Big Data
The original three ‘V’ Dimension Characteristics of
Big Data identified in 2001 are:
Features of Big Data
Features of Big Data -
Security, Compliance,
Auditing and Protection
 The sheer size of a Big Data repository brings with it a major
security challenge, generating the age-old question presented to
IT: How can the data be protected?
 Steps to Securing Big Data
 Classifying Data
 Protecting Big Data Analytics
 Big Data and Compliance
 The Intellectual Property Challenge
Security, Compliance, Auditing
and Protection
 Data Access:
Data can be easily protected, but only if you eliminate access to the data.
That’s not a pragmatic solution, to say the least. The key is to control access,
but even then, knowing the who, what, when, and where of data access is only
a start.
 Data availability:
Controlling where the data are stored and how the data are distributed. The
more control you have, the better you are positioned to protect the data.
Security, Compliance,
Auditing and Protection
 Performance:
Higher levels of encryption, complex security methodologies, and
additional security layers can all improve security. However, these
security techniques all carry a processing burden that can severely
affect performance.
 Liability:
Accessible data carry with them liability, such as the sensitivity of the
data, the legal requirements connected to the data, privacy issues, and
intellectual property concerns.
PRAGMATIC STEPS TO SECURING BIG DATA
 First, get rid of data that are no longer needed. If you do not need
certain information, it should be destroyed, because it represents a risk
to the organization.
 Information cannot legally be destroyed; in that case, the information
should be securely archived by an offline method.
 Real challenge is to decide which data is needed? As values can be
found in unexpected places. For example, getting rid of activity logs
may be a smart move from a security standpoint.
Classifying data
 Protecting data becomes much easier if the data are classified—that is, the
data should be divided into appropriate groupings for management
purposes.
 For example, Internal e-mails between two colleagues should not be secured or
treated the same way as financial reports, human resources (HR)information, or
customer data.
 Classification can become a powerful tool for determining the sensitivity of
data.
 A simple approach may just include classifications such as financial, HR, sales,
inventory, and communications, each of which is self-explanatory and offers
insight into the sensitivity of the data.
o Once organizations better understand their data, they can take important
steps to segregate the information, which will make the deployment of
security measures like encryption and monitoring more manageable.
PROTECTING BIG DATA ANALYTICS
 The real cause of concern is the fact that Big Data contains all of the
things you don’t want to see when you are trying to protect data.
 Big Data can contain very unique sample sets—for example, data
from devices that monitor physical elements (e.g., traffic, movement,
soil pH, rain, wind) on a frequent schedule, that are accumulated
frequently and in real time.
 All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.
 That uniqueness also means you cannot leverage time-saving backup
preparation and security technologies, such as deduplication.
 This greatly increases the capacity requirements for backup
subsystems, slows down security scanning, makes it harder to detect
data corruption, and complicates archiving.
 There is also the issue of the large size and number of files often
found in Big Data analytic environments.
 In order for a backup application and associated appliances or
hardware to churn through a large number of files, bandwidth to the
backup systems and/or the backup appliance must be large, and the
receiving devices must be able to ingest data at the rate that the data
can be delivered.
BIG DATA AND COMPLIANCE
 Compliance has a major effect on how Big Data is protected, stored,
accessed, and archived.
 Big Data is not easily handled by the RDBMS; This means it is harder
to understand how compliance affects the data.
 Big Data is transforming the storage and access paradigms to an
emerging new world of horizontally scaling, unstructured databases,
which are better at solving some old business problems through
analytics.
 New data types and methodologies are still expected to meet the
legislative requirements expected by compliance laws.
 Health care probably provides the best example for those charged
with compliance as they examine how Big Data creation, storage,
and flow work in their organizations.
 Electronic health record systems, driven by the Health Insurance
Portability and Accountability Act (HIPAA).-Storing Personal
information.
 Unfortunately, most of the data stores in use today—including
Hadoop, Cassandra, and MongoDB—do not incorporate sufficient
data security tools to provide enterprises with the peace of mind that
confidential data will remain safe and secure at all times.
THE INTELLECTUAL PROPERTY CHALLENGE
 One of the biggest issues around Big Data is the concept of
intellectual property (IP).
 IP refers to creations of the human mind, such as inventions, literary
and artistic works, and symbols, names, images, and designs used in
commerce.
 Between 1985 and 2010, the number of patents granted worldwide
rose from slightly less than 400,000 to more than 900,000. Increase
of more than 125 percent over one generation (25 years).
 The same concepts just have to be expanded into the realm of Big
Data. Some basic rules are as follows:
 Understand what IP is and know what you have to protect:
 What needs to protect it, how to protect it and whom to protect it
from.TO do so, IP security in IT (usually a computer security officer, or
CSO) must communicate on an ongoing basis with the executives who
oversee intellectual capital. Meeting at least quarterly. Corporate
leaders will be the foundation for protecting IP.
o Prioritize protection:
 CSOs with extensive experience normally recommend doing a risk and
cost-benefit analysis.
 require you to create a map of your company’s assets and determine
what information, if lost, would hurt your company the most.
 This help you figure out where to best allocate your protective efforts.
 Understand what IP is and know what you have to protect:
 What needs to protect it, how to protect it and whom to protect it
from.TO do so, IP security in IT (usually a computer security officer, or
CSO) must communicate on an ongoing basis with the executives who
oversee intellectual capital. Meeting at least quarterly. Corporate
leaders will be the foundation for protecting IP.
o Prioritize protection:
 CSOs with extensive experience normally recommend doing a risk and
cost-benefit analysis.
 require you to create a map of your company’s assets and determine
what information, if lost, would hurt your company the most.
 This help you figure out where to best allocate your protective efforts.
 Label:
 Confidential information should be labeled appropriately. If company
data are proprietary, note that on every log-in screen.
o Lock it up:
 Physical as well as digital protection schemes are a must. Rooms that
store sensitive data should be locked. This applies to everything from the
server farm to the file room. Keep track of who has the keys, always use
complex passwords, and limit employee access to important databases.
o Educate employees.
o Know your tools:
 Those tools can locate sensitive documents and keep track of how they
are being used and by whom.
o Use a counterintelligence mind-set:
 If you were spying on your own company, how would you do it?
 These guidelines can be applied to almost any information
security paradigm that is geared toward protecting IP. The
same guidelines can be used when designing IP protection for a
Big Data platform.
Analysis vs Reporting
Where does "Reporting" stop and "Analytics" kick in?
Let us try to understand the differences first.
 While Reporting provides data, Analytics is supposed to
provide answers.
 Reporting is typical Standardized while Analytics is
customized.
 Reporting has a stringent format while Analytics is
flexible.
 Reporting provides what is typically asked for,
while Analytics caters to the underlying need
 The output of Reporting is in the form of
canned reports, dashboards and alerts
while Analytics has presentations
comprising of insights, recommended actions,
and a forecast of its impact on the company.
 Reporting includes building, configuring,
consolidating, organizing, formatting, and
summarizing data while Analytics consist of
questioning, examining, interpreting, predicting
and prescribing.
Analysis vs Reporting
Both reporting and analysis play their roles in
influencing and driving the actions in an
organization with the ultimate goal of value
maximization
Analysis vs Reporting
• Canned reports:
• These are the out-of-the-box and custom reports that
you can access within the analytics tool.
• In general, some canned reports are more valuable
than others, and a report’s value may depend on how
relevant it is to an individual’s role (e.g., SME or
specialist vs. web producer).
• Dashboards:
• These custom-made reports combine different KPIs
and reports to provide a comprehensive, high-level
view of business performance for specific audiences.
Dashboards may include data from various data
sources and are also usually fairly static.
Analysis vs Reporting
• Alerts:
• These conditional reports are triggered when data
falls outside of expected ranges or some other pre-
defined criteria is met. Once people are notified of
what happened, they can take appropriate action as
necessary.
Analysis vs Reporting
To overcome the first huddle of the initial confusion
between Reporting and Analytics and taken a leap
towards deriving the real benefits of analysis
Analysis vs Reporting
Analysis vs Reporting
Four stages of Analytics maturity model
 Descriptive (Pure play Reporting)
 Diagnostic
 Predictive
 Prescriptive
Analytic Processes and Tools
• The process of examining large data sets containing
a variety of data types – i.e., Big Data – to market
trends, customer preferences, and other useful
information.
• Companies and enterprises that implement Big
Data Analytics often gather several business
benefits, such as marketing campaigns, new
revenue opportunities, improved customer service
delivery, more efficient operations, and competitive
advantages.
• Companies implement Big Data Analytics
because they want to make more informed
business decisions.
• Big Data Analytics gives analytics professionals,
such as data scientists and predictive modeller's,
the ability to analyze Big Data from multiple
and varied sources, including transactional data
and other structured data.
Analytic Processes and Tools
Analytic Processes and Tools
Types of Big Data Analytics Tools
Big Data Analytics tools are important for companies and
enterprises because of the huge volume of Big Data now
generated and managed by modern organizations.
Big Data Analytics tools also help businesses save time and
money in gaining insights to inform data-driven decisions.
The different types Big Data Analytics tools are:
Data storage and management, Data cleaning, Data
mining, Data analysis, Data visualization, Data Integration,
and Data collection.
Types of Big Data Analytics Tools and Environment
Analytic Processes and Tools
Modern Data Analytic Tools
• Whenever analysts or journalists assemble lists of the top
trends for this year, "big data" is almost certain to be on
the list.
• Big data isn't really a new concept. Computers have
always worked with large and growing sets of data, and
we've had databases and data warehouses for years.
• What is new is… how much bigger that data is, how
quickly it is growing and how complicated it is.
Enterprises understand that the data in their systems
represents of insights that could help them improve their
processes and their performance.
• But they need tools that will allow them to collect and
analyze that data.
Modern Data Analytic Tools
• Interestingly, many of the best and best known
big data tools available are open source
projects. The very best known of these is
Hadoop, which is spawning an entire industry
of related services and products.
• As well as 49 other big data projects. We find
a lot of Apache projects related to Hadoop, as
well as open source NoSQL databases,
business intelligence tools, development tools
and much more.
• Here it is ……
Modern Data Analytic Tools
Big Data Companies: The Leaders
Tableau
Tableau started out by offering visualization techniques for
exploring and analyzing relational databases and data cubes
and has expanded to include Big Data research.
It offers visualization of data from any source, from Hadoop to
Excel files.
New Relic
New Relic uses a SaaS model for monitoring Web and mobile
applications in real-time that run in the cloud, on-premises, or
in a hybrid mix.
The plug-ins uses PaaS/cloud services, caching, database,
Web servers and queuing.
Modern Data Analytic Tools
IBM
IBM offers cloud services for massive compute scale
through its Soft layer subsidiary. On the software side, its
DB2, Informix and InfoSphere support Big Data analytics
and Cognos and SPSS analytics software specialize in BI. t.
IBM also offers InfoSphere, the data integration and data
warehousing used in a BD scenario.
VMware
VMware has incorporated Big Data into its flagship
virtualization product, called VMware vSphere Big Data
Extensions. BDE is a virtual appliance that enables
administrators to deploy and manage the Hadoop clusters
under vSphere. It supports a number of Hadoop
distributions, including Apache, Cloudera, Hortonworks,
MapR and Pivotal.
Big Data Companies
Modern Data Analytic Tools
SAP
SAP's main Big Data tool is its HANA. It can run analytics on
80 terabytes of data and integrates with Hadoop. It can also
perform advanced analytics, like predictive analytics, spatial
data processing, text analytics, text search, streaming
analytics, and graph data processing and has ETL (Extract,
Transform, and Load) capabilities.
Oracle
Oracle has its Big Data Appliance with a number of software
products. They include Oracle NoSQL , Apache Hadoop,
Oracle Data Integrator Application Adapter for Hadoop, Oracle
Loader for Hadoop, Oracle R Enterprise tool, for R
programming language , Oracle Linux so on…
Big Data Companies
Modern Data Analytic Tools
Pentaho
Pentaho is a suite of open source-based tools for business
analytics that has expanded to cover Big Data. The suite
offers data integration, OLAP services, reporting, a
dashboard, data mining and ETL capabilities. Pentaho for
Big Data is a data integration tool based specifically
designed for executing ETL jobs in and out of Big Data
environments such as Apache Hadoop or Hadoop
distributions on Amazon, Cloudera,
Thoughtworks
Thoughtworks incorporates Agile software development
principals into building Big Data applications through its Agile
Analytics product. It builds applications for data warehousing
and business intelligence using the fast paced Agile process
for quick and continuous delivery of newer applications to
extract insight from data.
Big Data Companies
Modern Data Analytic Tools
Amazon Web Services
Amazon has a number of enterprise Big Data platforms,
including the Hadoop-based Elastic MapReduce, Kinesis
Firehose for streaming massive amounts of data into AWS,
Kinesis Analytics to analyze the data, DynamoDB big data
database, NoSQL and Hbase. All of these services work
within its greater Amazon Web Services offerings.
Microsoft
It has a partnership with Hortonworks and offers the
HDInsights tool based for analyzing structured and
unstructured data on Hortonworks. SQL Server 2016 comes
with a connector to Hadoop for Big Data processing, and
Microsoft recently acquired Revolution Analytics, which made
the only Big Data analytics platform written in R, a
programming language for building Big Data apps without
requiring the skills of a data scientist.
Big Data Companies
Modern Data Analytic Tools
Some other Big Data Companies are……..
• Tibco Jaspersoft
• Google
• Mu Sigma
• HP Enterprise
• Big Panda
• Cogito
• Alation
• Splunk
• ……
Modern Data Analytic Tools
Open Source Big Data Analysis Platforms and Tools
Hadoop
You simply can't talk about big data without
mentioning Hadoop. The Apache distributed data processing
software is so pervasive that often the terms "Hadoop" and "big
data" are used synonymously. The Apache Foundation also
sponsors a number of related projects that extend the
capabilities of Hadoop. In addition, numerous vendors offer
supported versions of Hadoop and related technologies.
Operating System: Windows, Linux, OS X.
Modern Data Analytic Tools
MapReduce
Originally developed by Google,
the MapReduce website describes it as "a programming
model and software framework for writing applications
that rapidly process vast amounts of data in parallel on
large clusters of compute nodes." It's used by Hadoop,
as well as many other data processing applications.
Operating System: OS Independent.
Open Source Big Data Analysis Platforms and Tools
• GridGain
• It offers an alternative to Hadoop's MapReduce that is
compatible with the Hadoop Distributed File System.
• It offers in-memory processing for fast analysis of real-
time data. You can download the open source version
from GitHub or purchase a commercially supported
version from the link above.
• Operating System: Windows, Linux, OS X.
• HPCC Systems
Developed by LexisNexis Risk Solutions, HPCC
Systems is short for "high performance computing cluster."
It claims to offer superior performance to Hadoop.
• Both free community versions and paid enterprise
versions are available. Operating System: Linux.
Open Source Big Data Analysis Platforms and Tools
• Storm
• Recently owned by Twitter, Storm offers distributed
real-time computation capabilities and is often
described as the "Hadoop of realtime." It's highly
scalable, robust, fault-tolerant and works with nearly
all programming languages.
• Operating System: Linux.
Open Source Big Data Analysis Platforms and Tools
Open Source Big Data Business Intelligence Tools
• Talend
• Talend Open Studio for Big Data, which is a set of data
integration tools that support Hadoop, HDFS, Hive, Hbase
and Pig.
• The company also sells an enterprise edition and other
commercial products and services.
• Operating System: Windows, Linux, OS X.
• Jaspersoft
• It makes "the most flexible, cost effective and widely
deployed business intelligence software in the world."
• Jedox
• The open source Palo Suite includes an OLAP Server,
Palo Web, Palo ETL Server and Palo for Excel.
• Jedox offers commercial software based on the same
tools.
• Operating System: OS Independent.
• SpagoBI
• It claims to be "the only entirely open source business
intelligence suite." Commercial support, training and
services are available.
• Operating System: OS Independent.
Open Source Big Data Business Intelligence Tools
Statistical Concepts: Sampling
Distributions
 The sampling distribution is a distribution of a sample statistic. While the concept of a distribution of a
set of numbers is intuitive for most students.
 The sampling distribution is a distribution of a sample statistic. It is a model of a distribution of scores,
like the population distribution, except that the scores are not raw scores, but statistics. It is a thought
experiment; "what would the world be like if a person repeatedly took samples of size N from the
population distribution and computed a particular statistic each time?" The resulting distribution of
statistics is called the sampling distribution of that statistic.
 For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean of
the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again
computed. If this process were repeated an infinite number of times, the distribution of the now infinite
number of sample means would be called the sampling distribution of the mean.
 Every statistic has a sampling distribution. For example, suppose that instead of the mean, medians were
computed for each sample. The infinite number of medians would be called the sampling distribution of
the median.
Re-Sampling
 In statistics, resampling is any of a variety of methods for doing one of the
following:
 Estimating the precision of sample statistics (medians, variances, percentiles)
by using subsets of available data ( jackknifing ) or drawing randomly with
replacement from a set of data points ( bootstrapping ).
 Exchanging labels on data points when performing significance tests(
permutation tests , also called exact tests, randomization tests, or re-
randomization tests).
 Validating models by using random subsets (bootstrapping, crossvalidation).
 Common resampling techniques include bootstrapping, jackknifing and
permutation tests.
Statistical Inference
 Statistical Inference, Model & Estimation
 Recall, a statistical inference aims at learning characteristics of th epopulation
from a sample; the population characteristics are parameters and sample
characteristics are statistics.
 A statistical model is a representation of a complex phenomena that generated the
data.
 It has mathematical formulations that describe relationships between random
variables and parameters.
 It makes assumptions about the random variables, and sometimes parameters.
 A general form: data = model + residuals
 Model should explain most of the variation in the data
 Residuals are a representation of a lack-of-fit, that is of the portion of the data
unexplained by the model.
 Estimation represents a way of a process of learning and determining the population parameter based on
the model fitted to the data.
 Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the
population parameter from the sample statistic.
 An estimator is particular example of a statistic, which becomes an estimate when the formula is
replaced with actual observed sample values.
 Point estimation = a single value that estimates the parameter. Point estimates are single values
calculated from the sample
 Confidence Intervals = gives a range of values for the parameter Interval estimates are intervals within
which the parameter is expected to fall, with a certain degree of confidence.
 Hypothesis tests = tests for a specific value(s) of the parameter.
 In order to perform these inferential tasks, i.e., make inference about the unknown population parameter
from the sample statistic, we need to know the likely values of the sample statistic. What would happen if
we do sampling many times?
 We need the sampling distribution of the statistic It depends on the model assumptions about the
population distribution, and/or on the sample size.
 Standard error : refers to the standard deviation of a sampling distribution.
Prediction Error
Prediction error is a discontinuity attribute that
removes the predictable image components and
reveals the unpredictable.
To use prediction error as a discontinuity attribute -
the original goal and starting point of my project - one
has to devise a prediction-error computation that
predicts and removes the plane-wave volumes of
sedimentary layers but that is incapable of predicting
the discontinuities.

Más contenido relacionado

La actualidad más candente

Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
Shanthi Mukkavilli
 

La actualidad más candente (20)

Data analytics
Data analyticsData analytics
Data analytics
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Big Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation SlideBig Data Analytics Powerpoint Presentation Slide
Big Data Analytics Powerpoint Presentation Slide
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Our big data
Our big dataOur big data
Our big data
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Big Data
Big DataBig Data
Big Data
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Big data and analytics
Big data and analyticsBig data and analytics
Big data and analytics
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data Warehouse
 

Similar a Introduction to big data

Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
pateelhs
 
big-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdfbig-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdf
VirajSaud
 

Similar a Introduction to big data (20)

Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Unit III.pdf
Unit III.pdfUnit III.pdf
Unit III.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
BIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdfBIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdf
 
sybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptxsybca-bigdata-ppt.pptx
sybca-bigdata-ppt.pptx
 
Bigdata Hadoop introduction
Bigdata Hadoop introductionBigdata Hadoop introduction
Bigdata Hadoop introduction
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
big-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdfbig-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdf
 
Know The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdfKnow The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdf
 
big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
Big data
Big dataBig data
Big data
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big Data
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Introduction to big data

  • 2. 2
  • 3. How much data ? How it matters?3  Data Size matters……  How it matters……………….  Like this………..
  • 4. 4 bit (b) 0 or 1 1/8 of a byte byte (B) 8 bits 1 byte kilobyte (KB) 10001 bytes 1,000 bytes megabyte (MB) 10002 bytes 1,000,000 bytes gigabyte (GB) 10003 bytes 1,000,000,000 bytes terabyte (TB) 10004 bytes 1,000,000,000,000 bytes petabyte (PB) 10005 bytes 1,000,000,000,000,000 bytes exabyte (EB) 10006 bytes 1,000,000,000,000,000 ,000 bytes zettabyte (ZB) 10007 bytes 1,000,000,000,000,000 ,000,000 bytes yottabyte (YB) 10008 bytes 1,000,000,000,000,000 ,000,000,000 bytes
  • 5. 5  But Where and in which companies…….?  Every Where……. Like in……
  • 6. 6
  • 7. 7  Where in real time…..?
  • 8. 8  Where in real time…..?
  • 9. 9  Where in real time…..?
  • 10. 10  Where in real time…..?
  • 11. 11  Where in real time…..? Asia’s largest and world’s third largest data centre in Bengaluru
  • 12. 12  Where in real time…..?
  • 13. Simple to startWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation How much time to just transfer.
  • 14. Introduction to Big Data Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications.  The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data. 14
  • 15. Introduction to Big Data Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications.  The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data. 15
  • 16. Big Data Driving Factors 16
  • 17. Dimensions of Big Data 17
  • 18. Big data spans three dimensions: Volume, velocity and Variety  Volume: Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.  The size of data generated by humans, machines and their interactions on social media itself is massive.  Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005. 18
  • 19. 19
  • 20. VELOCITY  Velocity is defined as the pace at which different sources generate the data every day.  This flow of data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year.  This shows how fast the number of users are growing on social media and how fast the data is getting generated daily. 20
  • 21. Examples of Big Data  Daily we upload millions of bytes of data. 90 % of the world’s data has been created in last two years. 21
  • 22. Examples of Big Data  Walmart handles more than 1 million customer transactions every hour.  Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.  230+ millions of tweets are created every day.  More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.  YouTube users upload 48 hours of new video every minute of the day. 22
  • 23. Contn..  Amazon handles 15 million customer click stream user data per day to recommend products.  294 billion emails are sent every day. Services analyses this data to find the spams.  Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data. 23
  • 24. Traits of Big data The eight (8) ‘V’ Dimension Characteristics of Big Data: Part One: Volume, Velocity, Variety Part Two: Variability (Unpredictability), Veracity (Reliability), Virality (Circulated rapidly), Visualization and Value. 24
  • 25.  Veracity  Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity.  Validity  Like big data veracity is the issue of validity meaning is the data correct and accurate for the intended use. Clearly valid data is key to making the right decisions. 25
  • 26.  Volatility Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis. 26
  • 27. Challenges of Conventional Systems • Conventional analytical tools and techniques are inadequate to handle data that is unstructured (like text data), that is too large in size, or that is growing rapidly like social media data. • A cluster analysis on a 200MB file with 1 million customer records is manageable, but the same cluster analysis on 1000GB of Facebook customer profile information will take a considerable amount of time if conventional tools and techniques are used.
  • 28. Challenges of Conventional Systems • Facebook as well as entities like Google and Walmart generate data in petabytes every day. • Traditional Analytics analyzes on the known data environment that too the data that is well understood. It cannot work on unstructured data efficiently. • Traditional Analytics is built on top of the relational data model, relationships between the subjects of interests have been created inside the system and the analysis is done based on them. This approach will not adequate for big data analytics.
  • 29. Challenges of Conventional Systems • Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,transform and load) and transformation jobs to complete before the required insight is obtained. • Parallelism in a traditional analytics system is achieved through costly hardware like MPP • (Massively Parallel Processing) systems.
  • 30. Other Challenges of Conventional Systems • Data challenges • Volume, velocity, veracity, variety • Data discovery and comprehensiveness • Scalability • Process challenges • Capturing Data • Aligning data from different sources • Transforming data into suitable form for data analysis • Modeling data (Mathematically, simulation) • Understating output, visualizing results and display issues on mobile devices.
  • 31. Traits or Characteristics of Big Data The eight (8) ‘V’ Dimension Characteristics of Big Data: Part One: Volume, Velocity, Variety Part Two: Variability (Unpredictability), Veracity (Reliability), Virality (Circulated rapidly), Visualization and Value.
  • 32. Characteristics of Big Data The original three ‘V’ Dimension Characteristics of Big Data identified in 2001 are:
  • 34. Features of Big Data - Security, Compliance, Auditing and Protection  The sheer size of a Big Data repository brings with it a major security challenge, generating the age-old question presented to IT: How can the data be protected?  Steps to Securing Big Data  Classifying Data  Protecting Big Data Analytics  Big Data and Compliance  The Intellectual Property Challenge
  • 35. Security, Compliance, Auditing and Protection  Data Access: Data can be easily protected, but only if you eliminate access to the data. That’s not a pragmatic solution, to say the least. The key is to control access, but even then, knowing the who, what, when, and where of data access is only a start.  Data availability: Controlling where the data are stored and how the data are distributed. The more control you have, the better you are positioned to protect the data.
  • 36. Security, Compliance, Auditing and Protection  Performance: Higher levels of encryption, complex security methodologies, and additional security layers can all improve security. However, these security techniques all carry a processing burden that can severely affect performance.  Liability: Accessible data carry with them liability, such as the sensitivity of the data, the legal requirements connected to the data, privacy issues, and intellectual property concerns.
  • 37. PRAGMATIC STEPS TO SECURING BIG DATA  First, get rid of data that are no longer needed. If you do not need certain information, it should be destroyed, because it represents a risk to the organization.  Information cannot legally be destroyed; in that case, the information should be securely archived by an offline method.  Real challenge is to decide which data is needed? As values can be found in unexpected places. For example, getting rid of activity logs may be a smart move from a security standpoint.
  • 38. Classifying data  Protecting data becomes much easier if the data are classified—that is, the data should be divided into appropriate groupings for management purposes.  For example, Internal e-mails between two colleagues should not be secured or treated the same way as financial reports, human resources (HR)information, or customer data.  Classification can become a powerful tool for determining the sensitivity of data.  A simple approach may just include classifications such as financial, HR, sales, inventory, and communications, each of which is self-explanatory and offers insight into the sensitivity of the data. o Once organizations better understand their data, they can take important steps to segregate the information, which will make the deployment of security measures like encryption and monitoring more manageable.
  • 39. PROTECTING BIG DATA ANALYTICS  The real cause of concern is the fact that Big Data contains all of the things you don’t want to see when you are trying to protect data.  Big Data can contain very unique sample sets—for example, data from devices that monitor physical elements (e.g., traffic, movement, soil pH, rain, wind) on a frequent schedule, that are accumulated frequently and in real time.  All of the data are unique to the moment, and if they are lost, they are impossible to recreate.  That uniqueness also means you cannot leverage time-saving backup preparation and security technologies, such as deduplication.
  • 40.  This greatly increases the capacity requirements for backup subsystems, slows down security scanning, makes it harder to detect data corruption, and complicates archiving.  There is also the issue of the large size and number of files often found in Big Data analytic environments.  In order for a backup application and associated appliances or hardware to churn through a large number of files, bandwidth to the backup systems and/or the backup appliance must be large, and the receiving devices must be able to ingest data at the rate that the data can be delivered.
  • 41. BIG DATA AND COMPLIANCE  Compliance has a major effect on how Big Data is protected, stored, accessed, and archived.  Big Data is not easily handled by the RDBMS; This means it is harder to understand how compliance affects the data.  Big Data is transforming the storage and access paradigms to an emerging new world of horizontally scaling, unstructured databases, which are better at solving some old business problems through analytics.  New data types and methodologies are still expected to meet the legislative requirements expected by compliance laws.
  • 42.  Health care probably provides the best example for those charged with compliance as they examine how Big Data creation, storage, and flow work in their organizations.  Electronic health record systems, driven by the Health Insurance Portability and Accountability Act (HIPAA).-Storing Personal information.  Unfortunately, most of the data stores in use today—including Hadoop, Cassandra, and MongoDB—do not incorporate sufficient data security tools to provide enterprises with the peace of mind that confidential data will remain safe and secure at all times.
  • 43. THE INTELLECTUAL PROPERTY CHALLENGE  One of the biggest issues around Big Data is the concept of intellectual property (IP).  IP refers to creations of the human mind, such as inventions, literary and artistic works, and symbols, names, images, and designs used in commerce.  Between 1985 and 2010, the number of patents granted worldwide rose from slightly less than 400,000 to more than 900,000. Increase of more than 125 percent over one generation (25 years).  The same concepts just have to be expanded into the realm of Big Data. Some basic rules are as follows:
  • 44.  Understand what IP is and know what you have to protect:  What needs to protect it, how to protect it and whom to protect it from.TO do so, IP security in IT (usually a computer security officer, or CSO) must communicate on an ongoing basis with the executives who oversee intellectual capital. Meeting at least quarterly. Corporate leaders will be the foundation for protecting IP. o Prioritize protection:  CSOs with extensive experience normally recommend doing a risk and cost-benefit analysis.  require you to create a map of your company’s assets and determine what information, if lost, would hurt your company the most.  This help you figure out where to best allocate your protective efforts.
  • 45.  Understand what IP is and know what you have to protect:  What needs to protect it, how to protect it and whom to protect it from.TO do so, IP security in IT (usually a computer security officer, or CSO) must communicate on an ongoing basis with the executives who oversee intellectual capital. Meeting at least quarterly. Corporate leaders will be the foundation for protecting IP. o Prioritize protection:  CSOs with extensive experience normally recommend doing a risk and cost-benefit analysis.  require you to create a map of your company’s assets and determine what information, if lost, would hurt your company the most.  This help you figure out where to best allocate your protective efforts.
  • 46.  Label:  Confidential information should be labeled appropriately. If company data are proprietary, note that on every log-in screen. o Lock it up:  Physical as well as digital protection schemes are a must. Rooms that store sensitive data should be locked. This applies to everything from the server farm to the file room. Keep track of who has the keys, always use complex passwords, and limit employee access to important databases. o Educate employees. o Know your tools:  Those tools can locate sensitive documents and keep track of how they are being used and by whom. o Use a counterintelligence mind-set:  If you were spying on your own company, how would you do it?
  • 47.  These guidelines can be applied to almost any information security paradigm that is geared toward protecting IP. The same guidelines can be used when designing IP protection for a Big Data platform.
  • 48. Analysis vs Reporting Where does "Reporting" stop and "Analytics" kick in? Let us try to understand the differences first.  While Reporting provides data, Analytics is supposed to provide answers.  Reporting is typical Standardized while Analytics is customized.  Reporting has a stringent format while Analytics is flexible.  Reporting provides what is typically asked for, while Analytics caters to the underlying need
  • 49.  The output of Reporting is in the form of canned reports, dashboards and alerts while Analytics has presentations comprising of insights, recommended actions, and a forecast of its impact on the company.  Reporting includes building, configuring, consolidating, organizing, formatting, and summarizing data while Analytics consist of questioning, examining, interpreting, predicting and prescribing. Analysis vs Reporting
  • 50. Both reporting and analysis play their roles in influencing and driving the actions in an organization with the ultimate goal of value maximization Analysis vs Reporting
  • 51. • Canned reports: • These are the out-of-the-box and custom reports that you can access within the analytics tool. • In general, some canned reports are more valuable than others, and a report’s value may depend on how relevant it is to an individual’s role (e.g., SME or specialist vs. web producer). • Dashboards: • These custom-made reports combine different KPIs and reports to provide a comprehensive, high-level view of business performance for specific audiences. Dashboards may include data from various data sources and are also usually fairly static. Analysis vs Reporting
  • 52. • Alerts: • These conditional reports are triggered when data falls outside of expected ranges or some other pre- defined criteria is met. Once people are notified of what happened, they can take appropriate action as necessary. Analysis vs Reporting
  • 53. To overcome the first huddle of the initial confusion between Reporting and Analytics and taken a leap towards deriving the real benefits of analysis Analysis vs Reporting
  • 54. Analysis vs Reporting Four stages of Analytics maturity model  Descriptive (Pure play Reporting)  Diagnostic  Predictive  Prescriptive
  • 55. Analytic Processes and Tools • The process of examining large data sets containing a variety of data types – i.e., Big Data – to market trends, customer preferences, and other useful information. • Companies and enterprises that implement Big Data Analytics often gather several business benefits, such as marketing campaigns, new revenue opportunities, improved customer service delivery, more efficient operations, and competitive advantages.
  • 56. • Companies implement Big Data Analytics because they want to make more informed business decisions. • Big Data Analytics gives analytics professionals, such as data scientists and predictive modeller's, the ability to analyze Big Data from multiple and varied sources, including transactional data and other structured data. Analytic Processes and Tools
  • 57. Analytic Processes and Tools Types of Big Data Analytics Tools Big Data Analytics tools are important for companies and enterprises because of the huge volume of Big Data now generated and managed by modern organizations. Big Data Analytics tools also help businesses save time and money in gaining insights to inform data-driven decisions. The different types Big Data Analytics tools are: Data storage and management, Data cleaning, Data mining, Data analysis, Data visualization, Data Integration, and Data collection.
  • 58. Types of Big Data Analytics Tools and Environment Analytic Processes and Tools
  • 59. Modern Data Analytic Tools • Whenever analysts or journalists assemble lists of the top trends for this year, "big data" is almost certain to be on the list. • Big data isn't really a new concept. Computers have always worked with large and growing sets of data, and we've had databases and data warehouses for years. • What is new is… how much bigger that data is, how quickly it is growing and how complicated it is. Enterprises understand that the data in their systems represents of insights that could help them improve their processes and their performance. • But they need tools that will allow them to collect and analyze that data.
  • 60. Modern Data Analytic Tools • Interestingly, many of the best and best known big data tools available are open source projects. The very best known of these is Hadoop, which is spawning an entire industry of related services and products. • As well as 49 other big data projects. We find a lot of Apache projects related to Hadoop, as well as open source NoSQL databases, business intelligence tools, development tools and much more. • Here it is ……
  • 61. Modern Data Analytic Tools Big Data Companies: The Leaders Tableau Tableau started out by offering visualization techniques for exploring and analyzing relational databases and data cubes and has expanded to include Big Data research. It offers visualization of data from any source, from Hadoop to Excel files. New Relic New Relic uses a SaaS model for monitoring Web and mobile applications in real-time that run in the cloud, on-premises, or in a hybrid mix. The plug-ins uses PaaS/cloud services, caching, database, Web servers and queuing.
  • 62. Modern Data Analytic Tools IBM IBM offers cloud services for massive compute scale through its Soft layer subsidiary. On the software side, its DB2, Informix and InfoSphere support Big Data analytics and Cognos and SPSS analytics software specialize in BI. t. IBM also offers InfoSphere, the data integration and data warehousing used in a BD scenario. VMware VMware has incorporated Big Data into its flagship virtualization product, called VMware vSphere Big Data Extensions. BDE is a virtual appliance that enables administrators to deploy and manage the Hadoop clusters under vSphere. It supports a number of Hadoop distributions, including Apache, Cloudera, Hortonworks, MapR and Pivotal. Big Data Companies
  • 63. Modern Data Analytic Tools SAP SAP's main Big Data tool is its HANA. It can run analytics on 80 terabytes of data and integrates with Hadoop. It can also perform advanced analytics, like predictive analytics, spatial data processing, text analytics, text search, streaming analytics, and graph data processing and has ETL (Extract, Transform, and Load) capabilities. Oracle Oracle has its Big Data Appliance with a number of software products. They include Oracle NoSQL , Apache Hadoop, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, Oracle R Enterprise tool, for R programming language , Oracle Linux so on… Big Data Companies
  • 64. Modern Data Analytic Tools Pentaho Pentaho is a suite of open source-based tools for business analytics that has expanded to cover Big Data. The suite offers data integration, OLAP services, reporting, a dashboard, data mining and ETL capabilities. Pentaho for Big Data is a data integration tool based specifically designed for executing ETL jobs in and out of Big Data environments such as Apache Hadoop or Hadoop distributions on Amazon, Cloudera, Thoughtworks Thoughtworks incorporates Agile software development principals into building Big Data applications through its Agile Analytics product. It builds applications for data warehousing and business intelligence using the fast paced Agile process for quick and continuous delivery of newer applications to extract insight from data. Big Data Companies
  • 65. Modern Data Analytic Tools Amazon Web Services Amazon has a number of enterprise Big Data platforms, including the Hadoop-based Elastic MapReduce, Kinesis Firehose for streaming massive amounts of data into AWS, Kinesis Analytics to analyze the data, DynamoDB big data database, NoSQL and Hbase. All of these services work within its greater Amazon Web Services offerings. Microsoft It has a partnership with Hortonworks and offers the HDInsights tool based for analyzing structured and unstructured data on Hortonworks. SQL Server 2016 comes with a connector to Hadoop for Big Data processing, and Microsoft recently acquired Revolution Analytics, which made the only Big Data analytics platform written in R, a programming language for building Big Data apps without requiring the skills of a data scientist. Big Data Companies
  • 66. Modern Data Analytic Tools Some other Big Data Companies are…….. • Tibco Jaspersoft • Google • Mu Sigma • HP Enterprise • Big Panda • Cogito • Alation • Splunk • ……
  • 67. Modern Data Analytic Tools Open Source Big Data Analysis Platforms and Tools Hadoop You simply can't talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms "Hadoop" and "big data" are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X.
  • 68. Modern Data Analytic Tools MapReduce Originally developed by Google, the MapReduce website describes it as "a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data processing applications. Operating System: OS Independent. Open Source Big Data Analysis Platforms and Tools
  • 69. • GridGain • It offers an alternative to Hadoop's MapReduce that is compatible with the Hadoop Distributed File System. • It offers in-memory processing for fast analysis of real- time data. You can download the open source version from GitHub or purchase a commercially supported version from the link above. • Operating System: Windows, Linux, OS X. • HPCC Systems Developed by LexisNexis Risk Solutions, HPCC Systems is short for "high performance computing cluster." It claims to offer superior performance to Hadoop. • Both free community versions and paid enterprise versions are available. Operating System: Linux. Open Source Big Data Analysis Platforms and Tools
  • 70. • Storm • Recently owned by Twitter, Storm offers distributed real-time computation capabilities and is often described as the "Hadoop of realtime." It's highly scalable, robust, fault-tolerant and works with nearly all programming languages. • Operating System: Linux. Open Source Big Data Analysis Platforms and Tools
  • 71. Open Source Big Data Business Intelligence Tools • Talend • Talend Open Studio for Big Data, which is a set of data integration tools that support Hadoop, HDFS, Hive, Hbase and Pig. • The company also sells an enterprise edition and other commercial products and services. • Operating System: Windows, Linux, OS X. • Jaspersoft • It makes "the most flexible, cost effective and widely deployed business intelligence software in the world."
  • 72. • Jedox • The open source Palo Suite includes an OLAP Server, Palo Web, Palo ETL Server and Palo for Excel. • Jedox offers commercial software based on the same tools. • Operating System: OS Independent. • SpagoBI • It claims to be "the only entirely open source business intelligence suite." Commercial support, training and services are available. • Operating System: OS Independent. Open Source Big Data Business Intelligence Tools
  • 73. Statistical Concepts: Sampling Distributions  The sampling distribution is a distribution of a sample statistic. While the concept of a distribution of a set of numbers is intuitive for most students.  The sampling distribution is a distribution of a sample statistic. It is a model of a distribution of scores, like the population distribution, except that the scores are not raw scores, but statistics. It is a thought experiment; "what would the world be like if a person repeatedly took samples of size N from the population distribution and computed a particular statistic each time?" The resulting distribution of statistics is called the sampling distribution of that statistic.  For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean of the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again computed. If this process were repeated an infinite number of times, the distribution of the now infinite number of sample means would be called the sampling distribution of the mean.  Every statistic has a sampling distribution. For example, suppose that instead of the mean, medians were computed for each sample. The infinite number of medians would be called the sampling distribution of the median.
  • 74. Re-Sampling  In statistics, resampling is any of a variety of methods for doing one of the following:  Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data ( jackknifing ) or drawing randomly with replacement from a set of data points ( bootstrapping ).  Exchanging labels on data points when performing significance tests( permutation tests , also called exact tests, randomization tests, or re- randomization tests).  Validating models by using random subsets (bootstrapping, crossvalidation).  Common resampling techniques include bootstrapping, jackknifing and permutation tests.
  • 75. Statistical Inference  Statistical Inference, Model & Estimation  Recall, a statistical inference aims at learning characteristics of th epopulation from a sample; the population characteristics are parameters and sample characteristics are statistics.  A statistical model is a representation of a complex phenomena that generated the data.  It has mathematical formulations that describe relationships between random variables and parameters.  It makes assumptions about the random variables, and sometimes parameters.  A general form: data = model + residuals  Model should explain most of the variation in the data  Residuals are a representation of a lack-of-fit, that is of the portion of the data unexplained by the model.
  • 76.  Estimation represents a way of a process of learning and determining the population parameter based on the model fitted to the data.  Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the population parameter from the sample statistic.  An estimator is particular example of a statistic, which becomes an estimate when the formula is replaced with actual observed sample values.  Point estimation = a single value that estimates the parameter. Point estimates are single values calculated from the sample  Confidence Intervals = gives a range of values for the parameter Interval estimates are intervals within which the parameter is expected to fall, with a certain degree of confidence.  Hypothesis tests = tests for a specific value(s) of the parameter.  In order to perform these inferential tasks, i.e., make inference about the unknown population parameter from the sample statistic, we need to know the likely values of the sample statistic. What would happen if we do sampling many times?  We need the sampling distribution of the statistic It depends on the model assumptions about the population distribution, and/or on the sample size.  Standard error : refers to the standard deviation of a sampling distribution.
  • 77. Prediction Error Prediction error is a discontinuity attribute that removes the predictable image components and reveals the unpredictable. To use prediction error as a discontinuity attribute - the original goal and starting point of my project - one has to devise a prediction-error computation that predicts and removes the plane-wave volumes of sedimentary layers but that is incapable of predicting the discontinuities.