SlideShare una empresa de Scribd logo
1 de 78
Statistics: Unlocking the Power of Data Lock5
Financial DATA ANALYTICS
Dr. M.Vijayalakshmi, VESIT
4th jan 2018, SAKEC Mumbai
Statistics: Unlocking the Power of Data Lock5
Financial Data
The financial industry has always been driven by data.
Today, Big Data is prevalent at various levels of this field, ranging from
the financial services sector to capital markets.
The availability of Big Data in this domain has opened up new avenues
for innovation and has offered immense opportunities for growth and
sustainability.
At the same time, it has presented several new challenges that must be
overcome to gain the maximum value out of it.
Statistics: Unlocking the Power of Data Lock5
Financial Data Analytics in a Nut Shell
Statistics: Unlocking the Power of Data Lock5
Motivation
There has been an explosion in the velocity, variety and volume of financial
data. Social media activity, mobile interactions, server logs, real-time market
feeds, customer service records, transaction details, information from existing
databases – there’s no end to the flood.
To make sense of these giant data sets, companies are increasingly turning to
data scientists for answers. These numbers gurus are:
 Capturing and analyzing new sources of data, building predictive models and running
live simulations of market events
 Using technologies such as Hadoop, NoSQL and Storm to tap into non-traditional data
sets (e.g., geolocation, sentiment data) and integrate them with more traditional
numbers (e.g., trade data)
 Finding and storing increasingly diverse data in its raw form for future analysis
They’ve been aided in this quest by the development of cloud-based data
storage and the surge of sophisticated (and sometimes free or open-source)
analytics tools.
Statistics: Unlocking the Power of Data Lock5
Important Applications of Financial
Data Analytics
1. Predictive Analytics / Trading
2. Sentiment Analysis
3. Financial Fraud
4. Credit Scoring Ratings
5. Pricing
6. Customer Segmentation
7. Know Your Customer
Statistics: Unlocking the Power of Data Lock5
Sentiment Analysis
Sentiment analysis (aka opinion mining) applies natural-language
processing, text analysis and computational linguistics to source material
to discover what folks really think.
Several big Businesses like MarketPsy Capital, Think Big Analytics and
MarketPsych Data are using it to:
Build algorithms around market sentiment data (e.g., Twitter feeds) that
can short the market when disasters (e.g., storms, terrorist attacks) occur
Track trends, monitor the launch of new products, respond to issues and
improve overall brand perception
Analyze unstructured voice recordings from call centers and recommend
ways to reduce customer churn, up-sell and cross-sell products and detect
fraud
Some data companies are even acting as intermediaries, collecting and
selling sentiment indicators to retail investors.
Statistics: Unlocking the Power of Data Lock5
Automated Risk Credit Management
Internet finance companies are finding ways to approve loans and manage risk.
Aliloan (from AliBaba) is an automated online system that provides flexible
micro-loans to entrepreneurial online vendors.
To gauge whether a vendor is creditworthy, Alibaba collects data from its e-
commerce and payment platforms and analyzes transaction records, customer
ratings, shipping records and a host of other info.
These findings are confirmed by third-party verification and cross-checked
against external data sets (e.g., customs, tax data, electricity records, etc.).
Once the loan is granted, Alibaba continues to monitor the use of funds and
assess the business’s strategic development.
Entrepreneurs in emerging markets are also reaping the benefits. Like Aliloan,
companies such as Kreditech and Lenddo provide automated small loans based
on innovative credit scoring techniques. In these cases, much of the score is
calculated from applicants’ online social networking data.
Statistics: Unlocking the Power of Data Lock5
Real Time Analytics
In days of yore, financial institutions were hampered by the lag-time between data
collection and data analysis. Real-time analytics short-circuits this problem and provides
the industry with new ways to:
Fight Financial Fraud: Banks and credit card companies routinely analyze account
balances, spending patterns, credit history, employment details, location and a load of
other data points to determine whether transactions are above aboard. If suspicious
activity is detected, they can immediately suspend the account and alert the owner.
Improve Credit Ratings: A continuous feed of online data means credit ratings can
be updated in real time. This provides lenders with a more accurate picture of a
customer’s assets, business operations and transaction history.
Provide More Accurate Pricing: Progressive Insurance already tailors its policies to
account for a customer’s changing financial situation. In the Internet of Things, data
from automobile sensors will also help insurance companies issues its policy holders
with warnings about accidents, traffic jams and weather conditions. That makes for
safer drivers and fewer payouts
Statistics: Unlocking the Power of Data Lock5
Customer Segmentation
Like every other industry on the planet, banks and financial
institutions are hungry to know more about the people using their
products and services. And though they already store a ton of data
– from credit scores to day-to-day transactions – they’re not too
proud to look for it elsewhere.
 This kind of customer segmentation allows them to:
 Offer customized product offerings and services
 Improve existing profitable relationships and avoid customer churn
 Create better marketing campaigns and more attractive product offerings
 Tailor product development to specific customer segments
Statistics: Unlocking the Power of Data Lock5
Predictive Analytics
By combining segmentation with predictive analytics, companies can also cut down on
risk. For example, to decide whether certain customers are likely to pay off their credit
cards, some major banks use technology developed by the company Sqrrl. This analysis
takes into account the demographic characteristics of customers’ neighborhoods and
makes calculated predictions.
Similar strides have been made in forecasting market behavior. Once upon a time (e.g.,
2009), high-frequency trading – the speedy exchange of securities – was hugely
lucrative. With competition came a drop in profits and the need for a new strategy.
HFT traders adapted by employing strategic sequential trading, using big data analytics
to identify specific market participants and anticipate their future actions. In a field of
breakneck speed, this gives HFT traders an unmistakable advantage.
By studying search volume data provided by Google Trends, they were able to identify
online precursors for stock market moves. Their results suggest that increases in search
volume for financially relevant search terms usually precede big losses in financial
markets.
Statistics: Unlocking the Power of Data Lock5
Analytics of Financial Times Series
A vast majority of Financial data occurs in the form of a times series
 Stock prices (ticker data)
 Asset prices
 Customer Numbers
 Etc
So Financial Data Analytics places a lot of importance on Financial times
series analytics
Statistics: Unlocking the Power of Data Lock5
Examples of financial time series
Daily log returns of Apple stock: 2007 to 2016 (10 years)
BSE index
Quarterly earnings of Coca-Cola Company: 1983-2009 Seasonal time
series useful in
 earning forecasts
 pricing weather related derivatives (e.g. energy)
 modeling intraday behavior of asset returns
Exchange rate between US Dollar vs Re
Size of insurance claims Values
High-frequency financial data: Tick-by-tick data of stock, etc
13
Mining Time-Series Data
A time series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals
Time series analysis: A subfield of statistics, comprises methods that
attempt to understand such time series, often either to understand the
underlying context of the data points or to make forecasts (or
predictions)
Methods for time series analyses
 Frequency-domain methods: Model-free analyses, well-suited to
exploratory investigations
 spectral analysis vs. wavelet analysis
 Time-domain methods: Auto-correlation and cross-correlation
analysis
 Motif-based time-series analysis
Applications
 Financial: stock price, inflation
 Industry: power consumption
 Scientific: experiment results
 Meteorological: precipitation
Statistics: Unlocking the Power of Data Lock5 14
Time-Series Data Analysis: Prediction &
Regression Analysis
(Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Statistics: Unlocking the Power of Data Lock5 15
What is Regression?
Modeling the relationship between one response variable and one or
more predictor variables
Analyzing the confidence of the model
E.g, height v.s weight
Statistics: Unlocking the Power of Data Lock5 16
Regression Yields Analytical Model
Discrete data points →Analytical model
 General relationship
 Easy calculation
 Further analysis
Application - Prediction
Statistics: Unlocking the Power of Data Lock5 17
Application - Detrending
Obtain the trend for irregular data series
Subtract trend
Reveal oscillations
trend
Statistics: Unlocking the Power of Data Lock5 18
Linear Regression - Single Predictor
Model is linear
y = w0 + w1 x
where w0 (y-intercept) and w1
(slope) are regression coefficients
Method of least squares:
y: response
variable
x: predictor
variable
w1
w0
| |
1
| |
2
1
( )( )
1
( )
D
i i
i
D
i
i
x x y y
x x
w 

 



 x
w
y
w
1
0


Statistics: Unlocking the Power of Data Lock5 19
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
E.g., for 2-D data or
y = w0 + w1 x1+ w2 x2
Solvable by
 Extension of least square method
(XTX ) W=Y →W = (XTX ) -1Y
 Commercial software (SAS, S-Plus) x1
x2
y
Linear Regression – Multiple Predictor
Statistics: Unlocking the Power of Data Lock5 20
Nonlinear Regression with Linear Method
Polynomial regression model
 E.g., y = w0 + w1 x + w2 x2 + w3 x3
Let x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Log-linear regression model
 E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 )
Let y’=log(y)
y’= w0 + w1 x + w2 x2 + w3 x3
Statistics: Unlocking the Power of Data Lock5 21
Generalized Linear Regression
Response y
 Distribution function in the exponential family
 Variance of y depends on E( y), not a constant
E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 )
Examples
 Logistic regression (binomial regression): probability of some
event occurring
 Poisson regression: number of customers
 …
References: Nelder and Wedderburn, 1972; McCullagh and
Nelder, 1989
22
Regression Tree (Breiman et al., 1984)
Partition the domain space
Leaf: (1) a continuous-valued
prediction; (2) average value
Statistics: Unlocking the Power of Data Lock5 23
Model Tree
Leaf – a linear equation
More general than regression tree
Figure source: http://datamining.ihe.nl/research/model-trees.htm
Statistics: Unlocking the Power of Data Lock5 24
Regression Trees and Model Trees
Regression tree: proposed in CART system (Breiman et al. 1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the training tuples that
reach the leaf
Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
 A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data cannot be represented well by a simple
linear model
Statistics: Unlocking the Power of Data Lock5 25
A time series can be illustrated as a time-series graph
which describes a point moving with the passage of time
Statistics: Unlocking the Power of Data Lock5 26
Categories of Time-Series Movements
Categories of Time-Series Movements
 Long-term or trend movements (trend curve): general direction in
which a time series is moving over a long interval of time
 Cyclic movements or cycle variations: long term oscillations about a
trend line or curve
e.g., business cycles, may or may not be periodic
 Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears to follow
during corresponding months of successive years.
 Irregular or random movements
Time series analysis: decomposition of a time series into these four
basic movements
 Additive Modal: TS = T + C + S + I
 Multiplicative Modal: TS = T  C  S  I
Statistics: Unlocking the Power of Data Lock5
Estimation of Trend Curve
The freehand method
 Fit the curve by looking at the graph
 Costly and barely reliable for large-scaled data mining
The least-square method
 Find the curve minimizing the sum of the squares of the deviation of points on
the curve from the corresponding data points
The moving-average method
27
Statistics: Unlocking the Power of Data Lock5 28
Moving Average
Moving average of order n
 Smoothes the data
 Eliminates cyclic, seasonal and irregular movements
 Loses the data at the beginning or end of a series
 Sensitive to outliers (can be reduced by weighted moving
average)
Statistics: Unlocking the Power of Data Lock5 29
Trend Discovery in Time-Series (1):
Estimation of Seasonal Variations
Seasonal index
 Set of numbers showing the relative values of a variable during the
months of the year
 E.g., if the sales during October, November, and December are 80%,
120%, and 140% of the average monthly sales for the whole year,
respectively, then 80, 120, and 140 are seasonal index numbers for
these months
Deseasonalized data
 Data adjusted for seasonal variations for better trend and cyclic
analysis
 Divide the original monthly data by the seasonal index numbers for
the corresponding months
Statistics: Unlocking the Power of Data Lock5
February 2, 2023 Data Mining: Concepts and Techniques 30
Seasonal Index
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10 11 12
Month
Seasonal Index
Raw data from
http://www.bbk.ac.uk/mano
p/man/docs/QII_2_2003%2
0Time%20series.pdf
Statistics: Unlocking the Power of Data Lock5
Trend Discovery in Time-Series (2)
Estimation of cyclic variations
 If (approximate) periodicity of cycles occurs, cyclic index can be constructed in
much the same manner as seasonal indexes
Estimation of irregular variations
 By adjusting the data for trend, seasonal and cyclic variations
With the systematic analysis of the trend, cyclic, seasonal, and irregular
components, it is possible to make long- or short-term predictions with
reasonable quality
31
Statistics: Unlocking the Power of Data Lock5 32
Similarity Search in Time-Series Analysis
Normal database query finds exact match
Similarity search finds data sequences that differ only
slightly from the given query sequence
Two categories of similarity queries
 Whole matching: find a sequence that is similar to the query
sequence
 Subsequence matching: find all pairs of similar sequences
Typical Applications
 Financial market
 Market basket data analysis
 Scientific databases
 Medical diagnosis
Statistics: Unlocking the Power of Data Lock5 33
Data Transformation
Many techniques for signal analysis require the data to be
in the frequency domain
Usually data-independent transformations are used
 The transformation matrix is determined a priori
 discrete Fourier transform (DFT)
 discrete wavelet transform (DWT)
The distance between two signals in the time domain is
the same as their Euclidean distance in the frequency
domain
Statistics: Unlocking the Power of Data Lock5 34
Discrete Fourier Transform
DFT does a good job of concentrating energy in the first
few coefficients
If we keep only first a few coefficients in DFT, we can
compute the lower bounds of the actual distance
Feature extraction: keep the first few coefficients (F-index)
as representative of the sequence
Statistics: Unlocking the Power of Data Lock5 35
DFT (continued)
Parseval’s Theorem
The Euclidean distance between two signals in the time
domain is the same as their distance in the frequency
domain
Keep the first few (say, 3) coefficients underestimates the
distance and there will be no false dismissals!







1
0
2
1
0
2
|
|
|
|
n
f
f
n
t
t X
x
|
]
)[
(
]
)[
(
|
|
]
[
]
[
|
3
0
2
0
2

 






f
n
t
f
Q
F
f
S
F
t
Q
t
S 

Statistics: Unlocking the Power of Data Lock5 36
Multidimensional Indexing in Time-Series
Multidimensional index construction
 Constructed for efficient accessing using the first few Fourier coefficients
Similarity search
 Use the index to retrieve the sequences that are at most a certain small distance
away from the query sequence
 Perform post-processing by computing the actual distance between sequences in
the time domain and discard any false matches
Statistics: Unlocking the Power of Data Lock5
Subsequence Matching
Break each sequence into a set of pieces of window with length w
Extract the features of the subsequence inside the window
Map each sequence to a “trail” in the feature space
Divide the trail of each sequence into “subtrails” and represent each of
them with minimum bounding rectangle
Use a multi-piece assembly algorithm to search for longer sequence
matches
37
Statistics: Unlocking the Power of Data Lock5 38
Analysis of Similar Time Series
Statistics: Unlocking the Power of Data Lock5
Enhanced Similarity Search Methods
Allow for gaps within a sequence or differences in offsets or amplitudes
Normalize sequences with amplitude scaling and offset translation
Two subsequences are considered similar if one lies within an envelope of
 width around the other, ignoring outliers
Two sequences are said to be similar if they have enough non-
overlapping time-ordered pairs of similar subsequences
Parameters specified by a user or expert: sliding window size, width of an
envelope for similarity, maximum gap, and matching fraction
39
Statistics: Unlocking the Power of Data Lock5 40
Steps for Performing a Similarity Search
Atomic matching
 Find all pairs of gap-free windows of a small length that are
similar
Window stitching
 Stitch similar windows to form pairs of large similar
subsequences allowing gaps between atomic matches
Subsequence Ordering
 Linearly order the subsequence matches to determine whether
enough similar pieces exist
Statistics: Unlocking the Power of Data Lock5 41
Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
Statistics: Unlocking the Power of Data Lock5 42
Sequence Distance
A function that measures the differentness of two
sequences (of possibly unequal length)
Example: Euclidean Distance between TS Q,C



n
i i
i c
q
C
Q
D 1
2
)
(
)
,
(
Statistics: Unlocking the Power of Data Lock5 43
Motif: Basic Concepts
What is a motif? A previously unknown, frequently
occurring sequential pattern
Match: Given subsequences Q,C ⊆ T,
C is a match for Q iff for some R
Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match Q.
If p = q or ∄ non-match N = T[s..*] such that s between p,q
then match is non-trivial.
(i.e. C,Q must be separated by a non-match)
1-Motif: the subsequence with most non-trivial matches
(least variance decides ties)
k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k)
R
C
Q
D 
)
,
(
Statistics: Unlocking the Power of Data Lock5 44
SAX: Symbolic Aggregate approXimation
Dim. Reduction/Compression
“Symbolic Aggregate approXimation”
SAX : ℝ → ∑
SAX : ↦ ccbaabbbabcbcb
Essentially an alphabet over the Piecewise Aggregate
Approximation (PAA) rank
Faster, simpler, more compression, yet on par with DFT,
DWT and other dim. reductions
Statistics: Unlocking the Power of Data Lock5 45
SAX Illustration
Statistics: Unlocking the Power of Data Lock5 46
SAX Algorithm
Parameters: alphabet size, word (segment) length (or output
rate)
1.Select probability distribution for TS
2.z-Normalize TS
3.PAA: Within each time interval, calculate aggregated value
(mean) of the segment
4.Partition TS range by equal-area partitioning the PDF into
n partitions (eq. freq. binning)
5.Label each segment with arank ∈∑ for aggregate’s
corresponding partition rank
Statistics: Unlocking the Power of Data Lock5 47
Finding Motifs in a Time Series
EMMA Algorithm: Finds 1-(k-)motif of fixed length n
SAX Compression (Dim. Reduction)
 Possible to store D(i,j) ∀(i,j) ∈ ∑∑
 Allows use of various distance measures (Minkowski, Dynamic Time
Warping)
Multiple Tiers
 Tier 1: Uses sliding window to hash length-w SAX subsequences
(aw addresses, total size O(m)).
Bucket B with most collisions & buckets with
MINDIST(B) < R form neighborhood of B.
 Tier 2: Neighborhood is pruned using more precise ADM
algorithm. Ni with max. matches is 1-motif. Early stop if |ADM
matches| > maxk>i(|neighborhoodk|)
Statistics: Unlocking the Power of Data Lock5 48
Hashing
c e c a b b c b a c c e c a b b c b a c
c c c c b b c c d c
w
n
2 4 2 0 1 1 2 1 0 2
5
2 2 2 2 1 1 2 2 3 2
5
2 4 2 0 1 1 2 1 0 2
5
… …
… …
…
… …
…
…
…
…
Statistics: Unlocking the Power of Data Lock5
Classification in Time Series
Application: Finance,
1-Nearest Neighbor
 Pros: accurate, robust, simple
 Cons: time and space complexity (lazy learning); results are not
interpretable
0 200 400 600 800 1000 1200
Statistics: Unlocking the Power of Data Lock5
Financial Data Applications
Fraud Detection - Anomaly Analysis
Statistics: Unlocking the Power of Data Lock5
What are Anomalies?
Anomaly is a pattern in the data that does not conform to
the expected behavior
Also referred to as outliers, exceptions, peculiarities,
surprise, etc.
Anomalies translate to significant (often critical) real life
entities
 Cyber intrusions
 Credit card fraud
Statistics: Unlocking the Power of Data Lock5
Real World Anomalies
Credit Card Fraud
 An abnormally high purchase made on a
credit card
Cyber Intrusions
 A web server involved in ftp traffic
Statistics: Unlocking the Power of Data Lock5
Simple Example
N1 and N2 are regions of
normal behavior
Points o1 and o2 are
anomalies
Points in region O3 are
anomalies
X
Y
N1
N2
o1
o2
O3
Statistics: Unlocking the Power of Data Lock5
Related problems
Rare Class Mining
Chance discovery
Novelty Detection
Exception Mining
Noise Removal
Black Swan*
Statistics: Unlocking the Power of Data Lock5
Key Challenges
Defining a representative normal region is
challenging
The boundary between normal and outlying
behavior is often not precise
The exact notion of an outlier is different for
different application domains
Availability of labeled data for training/validation
Malicious adversaries
Data might contain noise
Normal behavior keeps evolving
Statistics: Unlocking the Power of Data Lock5
Data Labels
Supervised Anomaly Detection
 Labels available for both normal data and anomalies
 Similar to rare class mining
Semi-supervised Anomaly Detection
 Labels available only for normal data
Unsupervised Anomaly Detection
 No labels assumed
 Based on the assumption that anomalies are very rare compared to normal data
Statistics: Unlocking the Power of Data Lock5
Applications of Anomaly Detection
Insurance / Credit card fraud detection
Anti-Money Laundering (AML)
Fraud
Identity Theft and Fake Account Registration
Risk Modeling
Account Takeover
Promotion Credit Abuse
Customer Behavior Analytics
Cyber Security
Fraud Detection
Fraud detection refers to detection of criminal activities
occurring in commercial organizations
 Malicious users might be the actual customers of the organization
or might be posing as a customer (also known as identity theft).
Types of fraud
 Credit card fraud
 Insurance claim fraud
 Mobile / cell phone fraud
 Insider trading
Challenges
 Fast and accurate real-time detection
 Misclassification cost is very high
Statistics: Unlocking the Power of Data Lock5
Classification Based Techniques
Main idea: build a classification model for normal (and anomalous (rare))
events based on labeled training data, and use it to classify each new
unseen event
Classification models must be able to handle skewed (imbalanced) class
distributions
Categories:
 Supervised classification techniques
 Require knowledge of both normal and anomaly class
 Build classifier to distinguish between normal and known anomalies
 Semi-supervised classification techniques
 Require knowledge of normal class only!
 Use modified classification model to learn the normal behavior and then detect any
deviations from normal behavior as anomalous
Statistics: Unlocking the Power of Data Lock5
Classification Based Techniques
Advantages:
 Supervised classification techniques
 Models that can be easily understood
 High accuracy in detecting many kinds of known anomalies
 Semi-supervised classification techniques
 Models that can be easily understood
 Normal behavior can be accurately learned
Drawbacks:
 Supervised classification techniques
 Require both labels from both normal and anomaly class
 Cannot detect unknown and emerging anomalies
 Semi-supervised classification techniques
 Require labels from normal class
 Possible high false alarm rate - previously unseen (yet legitimate) data records
may be recognized as anomalies
Statistics: Unlocking the Power of Data Lock5
Supervised Classification Techniques
Manipulating data records (oversampling /
undersampling / generating artificial examples)
Rule based techniques
Model based techniques
 Neural network based approaches
 Support Vector machines (SVM) based approaches
 Bayesian networks based approaches
Cost-sensitive classification techniques
Ensemble based algorithms (SMOTEBoost,
RareBoost
Statistics: Unlocking the Power of Data Lock5
Semi-supervised Classification Techniques
Use modified classification model to learn the
normal behavior and then detect any deviations
from normal behavior as anomalous
Recent approaches:
 Neural network based approaches
 Support Vector machines (SVM) based approaches
 Markov model based approaches
 Rule-based approaches
Statistics: Unlocking the Power of Data Lock5
Nearest Neighbor Based Techniques
Key assumption: normal points have close neighbors
while anomalies are located far from other points
General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data
record is anomaly or not
Categories:
 Distance based methods
 Anomalies are data points most distant from other points
 Density based methods
 Anomalies are data points in low density regions
Statistics: Unlocking the Power of Data Lock5
Clustering Based Techniques
Key assumption: normal data records belong to large and
dense clusters, while anomalies belong do not belong to any of
the clusters or form very small clusters
Categorization according to labels
 Semi-supervised – cluster normal data to create modes of normal
behavior. If a new instance does not belong to any of the clusters or it is
not close to any cluster, is anomaly
 Unsupervised – post-processing is needed after a clustering step to
determine the size of the clusters and the distance from the clusters is
required fro the point to be anomaly
Anomalies detected using clustering based methods can be:
 Data records that do not fit into any cluster (residuals from clustering)
 Small clusters
 Low density clusters or local anomalies (far from other points within the
same cluster)
Statistics: Unlocking the Power of Data Lock5
Clustering Based Techniques
Advantages:
 No need to be supervised
 Easily adaptable to on-line / incremental mode suitable for
anomaly detection from temporal data
Drawbacks
 Computationally expensive
Using indexing structures (k-d tree, R* tree) may alleviate this
problem
 If normal points do not create any clusters the techniques
may fail
 In high dimensional spaces, data is sparse and distances
between any two data records may become quite similar.
Clustering algorithms may not give any meaningful clusters
Statistics: Unlocking the Power of Data Lock5
Statistics Based Techniques
Data points are modeled using stochastic distribution 
points are determined to be outliers depending on their
relationship with this model
Advantage
 Utilize existing statistical modeling techniques to model various type
of distributions
Challenges
 With high dimensions, difficult to estimate distributions
 Parametric assumptions often do not hold for real data sets
Statistics: Unlocking the Power of Data Lock5
Types of Statistical Techniques
Parametric Techniques
 Assume that the normal (and possibly anomalous) data is generated
from an underlying parametric distribution
 Learn the parameters from the normal sample
 Determine the likelihood of a test instance to be generated from this
distribution to detect anomalies
Non-parametric Techniques
 Do not assume any knowledge of parameters
 Use non-parametric techniques to learn a distribution – e.g. parzen
window estimation
Statistics: Unlocking the Power of Data Lock5
Information Theory Based Techniques
Compute information content in data using information
theoretic measures, e.g., entropy, relative entropy, etc.
Key idea: Outliers significantly alter the information content
in a dataset
Approach: Detect data instances that significantly alter the
information content
 Require an information theoretic measure
Advantage
 Operate in an unsupervised mode
Challenges
 Require an information theoretic measure sensitive enough to detect
irregularity induced by very few outliers
Statistics: Unlocking the Power of Data Lock5
Visualization Based Techniques
Use visualization tools to observe the data
Provide alternate views of data for manual
inspection
Anomalies are detected visually
Advantages
 Keeps a human in the loop
Disadvantages
 Works well for low dimensional data
 Can provide only aggregated or partial views for high
dimension data
Statistics: Unlocking the Power of Data Lock5
Visual Data Mining*
Detecting Tele-
communication fraud
Display telephone call
patterns as a graph
Use colors to identify
fraudulent telephone
calls (anomalies)
Statistics: Unlocking the Power of Data Lock5
Contextual Anomaly Detection
Detect context anomalies
General Approach
 Identify a context around a data instance (using a set of
contextual attributes)
 Determine if the data instance is anomalous w.r.t. the context
(using a set of behavioral attributes)
Assumption
 All normal instances within a context will be similar (in terms of
behavioral attributes), while the anomalies will be different
Statistics: Unlocking the Power of Data Lock5
Contextual Attributes
Contextual attributes define a neighborhood
(context) for each instance
For example:
 Spatial Context
Latitude, Longitude
 Graph Context
Edges, Weights
 Sequential Context
Position, Time
 Profile Context
User demographics
Statistics: Unlocking the Power of Data Lock5
Sequential Anomaly Detection
Detect anomalous sequences in a database of
sequences, or
Detect anomalous subsequence within a sequence
Data is presented as a set of symbolic sequences
 System call intrusion detection
 Proteomics
 Climate data
Statistics: Unlocking the Power of Data Lock5
Motivation for On-line Anomaly Detection
Data in many rare events applications arrives continuously
at an enormous pace
There is a significant challenge to analyze such data
Examples of such rare events applications:
 Video analysis
 Network traffic monitoring
 Credit card fraudulent transactions
Statistics: Unlocking the Power of Data Lock5
Sentiment Analysis for Finance
Sentiment analysis is an emerging area where structured and
unstructured data is analyzed to generate useful insights leading to
improved performances.
Information obtained from multiple sources including news wires, macro-
economic announcements, social media, micro blogs /twitter, online
(search) information such as Google trends and Wikipedia influence both
business intelligence and performance evaluation.
This sentiment data can help investors and finance professionals to
exploit the market and manage their risk exposure.
 Stock market prediction
 New product review
 Stock Trading
 Customer Brand Building
Statistics: Unlocking the Power of Data Lock5
Sentiment Analysis in Finance
Statistics: Unlocking the Power of Data Lock5
Statistics: Unlocking the Power of Data Lock5
Thank You

Más contenido relacionado

Similar a FDA_SAKEC2018.pptx

leewayhertz.com-Data analysis workflow using Scikit-learn.pdf
leewayhertz.com-Data analysis workflow using Scikit-learn.pdfleewayhertz.com-Data analysis workflow using Scikit-learn.pdf
leewayhertz.com-Data analysis workflow using Scikit-learn.pdfKristiLBurns
 
Big data: What's the big deal?
Big data: What's the big deal?Big data: What's the big deal?
Big data: What's the big deal?Penser
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Miningabdulraqeebalareqi1
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)InData Labs
 
Sahara lifedemo acs_client
Sahara lifedemo acs_clientSahara lifedemo acs_client
Sahara lifedemo acs_clientAnkur Khanna
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasProf Dr Mehmed ERDAS
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasProf Dr Mehmed ERDAS
 
Effective Big Data Analytics Use Cases in 20+ Industries
Effective Big Data Analytics Use Cases in 20+ IndustriesEffective Big Data Analytics Use Cases in 20+ Industries
Effective Big Data Analytics Use Cases in 20+ IndustriesKavika Roy
 
How Big Data helps banks know their customers better
How Big Data helps banks know their customers betterHow Big Data helps banks know their customers better
How Big Data helps banks know their customers betterHEXANIKA
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analyticsThe Marketing Distillery
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dssNiyitegekabilly
 
Financial Text Analysis
Financial Text AnalysisFinancial Text Analysis
Financial Text AnalysisBytesview
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance BusinessAnkur Khanna
 
Analytics in banking services
Analytics in banking servicesAnalytics in banking services
Analytics in banking servicesMariyageorge
 

Similar a FDA_SAKEC2018.pptx (20)

Cis 500 assignment 4
Cis 500 assignment 4Cis 500 assignment 4
Cis 500 assignment 4
 
leewayhertz.com-Data analysis workflow using Scikit-learn.pdf
leewayhertz.com-Data analysis workflow using Scikit-learn.pdfleewayhertz.com-Data analysis workflow using Scikit-learn.pdf
leewayhertz.com-Data analysis workflow using Scikit-learn.pdf
 
Big data: What's the big deal?
Big data: What's the big deal?Big data: What's the big deal?
Big data: What's the big deal?
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Mining
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)
 
Sahara lifedemo acs_client
Sahara lifedemo acs_clientSahara lifedemo acs_client
Sahara lifedemo acs_client
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Mining Lec1.pptx
Data Mining Lec1.pptxData Mining Lec1.pptx
Data Mining Lec1.pptx
 
Effective Big Data Analytics Use Cases in 20+ Industries
Effective Big Data Analytics Use Cases in 20+ IndustriesEffective Big Data Analytics Use Cases in 20+ Industries
Effective Big Data Analytics Use Cases in 20+ Industries
 
How Big Data helps banks know their customers better
How Big Data helps banks know their customers betterHow Big Data helps banks know their customers better
How Big Data helps banks know their customers better
 
Drive your business with predictive analytics
Drive your business with predictive analyticsDrive your business with predictive analytics
Drive your business with predictive analytics
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data mining techniques and dss
Data mining techniques and dssData mining techniques and dss
Data mining techniques and dss
 
Financial Text Analysis
Financial Text AnalysisFinancial Text Analysis
Financial Text Analysis
 
Insurance value chain
Insurance value chainInsurance value chain
Insurance value chain
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
 
Analytics in banking services
Analytics in banking servicesAnalytics in banking services
Analytics in banking services
 
Big data is a popular term used to describe the exponential growth and availa...
Big data is a popular term used to describe the exponential growth and availa...Big data is a popular term used to describe the exponential growth and availa...
Big data is a popular term used to describe the exponential growth and availa...
 

Último

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Último (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

FDA_SAKEC2018.pptx

  • 1. Statistics: Unlocking the Power of Data Lock5 Financial DATA ANALYTICS Dr. M.Vijayalakshmi, VESIT 4th jan 2018, SAKEC Mumbai
  • 2. Statistics: Unlocking the Power of Data Lock5 Financial Data The financial industry has always been driven by data. Today, Big Data is prevalent at various levels of this field, ranging from the financial services sector to capital markets. The availability of Big Data in this domain has opened up new avenues for innovation and has offered immense opportunities for growth and sustainability. At the same time, it has presented several new challenges that must be overcome to gain the maximum value out of it.
  • 3. Statistics: Unlocking the Power of Data Lock5 Financial Data Analytics in a Nut Shell
  • 4. Statistics: Unlocking the Power of Data Lock5 Motivation There has been an explosion in the velocity, variety and volume of financial data. Social media activity, mobile interactions, server logs, real-time market feeds, customer service records, transaction details, information from existing databases – there’s no end to the flood. To make sense of these giant data sets, companies are increasingly turning to data scientists for answers. These numbers gurus are:  Capturing and analyzing new sources of data, building predictive models and running live simulations of market events  Using technologies such as Hadoop, NoSQL and Storm to tap into non-traditional data sets (e.g., geolocation, sentiment data) and integrate them with more traditional numbers (e.g., trade data)  Finding and storing increasingly diverse data in its raw form for future analysis They’ve been aided in this quest by the development of cloud-based data storage and the surge of sophisticated (and sometimes free or open-source) analytics tools.
  • 5. Statistics: Unlocking the Power of Data Lock5 Important Applications of Financial Data Analytics 1. Predictive Analytics / Trading 2. Sentiment Analysis 3. Financial Fraud 4. Credit Scoring Ratings 5. Pricing 6. Customer Segmentation 7. Know Your Customer
  • 6. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis Sentiment analysis (aka opinion mining) applies natural-language processing, text analysis and computational linguistics to source material to discover what folks really think. Several big Businesses like MarketPsy Capital, Think Big Analytics and MarketPsych Data are using it to: Build algorithms around market sentiment data (e.g., Twitter feeds) that can short the market when disasters (e.g., storms, terrorist attacks) occur Track trends, monitor the launch of new products, respond to issues and improve overall brand perception Analyze unstructured voice recordings from call centers and recommend ways to reduce customer churn, up-sell and cross-sell products and detect fraud Some data companies are even acting as intermediaries, collecting and selling sentiment indicators to retail investors.
  • 7. Statistics: Unlocking the Power of Data Lock5 Automated Risk Credit Management Internet finance companies are finding ways to approve loans and manage risk. Aliloan (from AliBaba) is an automated online system that provides flexible micro-loans to entrepreneurial online vendors. To gauge whether a vendor is creditworthy, Alibaba collects data from its e- commerce and payment platforms and analyzes transaction records, customer ratings, shipping records and a host of other info. These findings are confirmed by third-party verification and cross-checked against external data sets (e.g., customs, tax data, electricity records, etc.). Once the loan is granted, Alibaba continues to monitor the use of funds and assess the business’s strategic development. Entrepreneurs in emerging markets are also reaping the benefits. Like Aliloan, companies such as Kreditech and Lenddo provide automated small loans based on innovative credit scoring techniques. In these cases, much of the score is calculated from applicants’ online social networking data.
  • 8. Statistics: Unlocking the Power of Data Lock5 Real Time Analytics In days of yore, financial institutions were hampered by the lag-time between data collection and data analysis. Real-time analytics short-circuits this problem and provides the industry with new ways to: Fight Financial Fraud: Banks and credit card companies routinely analyze account balances, spending patterns, credit history, employment details, location and a load of other data points to determine whether transactions are above aboard. If suspicious activity is detected, they can immediately suspend the account and alert the owner. Improve Credit Ratings: A continuous feed of online data means credit ratings can be updated in real time. This provides lenders with a more accurate picture of a customer’s assets, business operations and transaction history. Provide More Accurate Pricing: Progressive Insurance already tailors its policies to account for a customer’s changing financial situation. In the Internet of Things, data from automobile sensors will also help insurance companies issues its policy holders with warnings about accidents, traffic jams and weather conditions. That makes for safer drivers and fewer payouts
  • 9. Statistics: Unlocking the Power of Data Lock5 Customer Segmentation Like every other industry on the planet, banks and financial institutions are hungry to know more about the people using their products and services. And though they already store a ton of data – from credit scores to day-to-day transactions – they’re not too proud to look for it elsewhere.  This kind of customer segmentation allows them to:  Offer customized product offerings and services  Improve existing profitable relationships and avoid customer churn  Create better marketing campaigns and more attractive product offerings  Tailor product development to specific customer segments
  • 10. Statistics: Unlocking the Power of Data Lock5 Predictive Analytics By combining segmentation with predictive analytics, companies can also cut down on risk. For example, to decide whether certain customers are likely to pay off their credit cards, some major banks use technology developed by the company Sqrrl. This analysis takes into account the demographic characteristics of customers’ neighborhoods and makes calculated predictions. Similar strides have been made in forecasting market behavior. Once upon a time (e.g., 2009), high-frequency trading – the speedy exchange of securities – was hugely lucrative. With competition came a drop in profits and the need for a new strategy. HFT traders adapted by employing strategic sequential trading, using big data analytics to identify specific market participants and anticipate their future actions. In a field of breakneck speed, this gives HFT traders an unmistakable advantage. By studying search volume data provided by Google Trends, they were able to identify online precursors for stock market moves. Their results suggest that increases in search volume for financially relevant search terms usually precede big losses in financial markets.
  • 11. Statistics: Unlocking the Power of Data Lock5 Analytics of Financial Times Series A vast majority of Financial data occurs in the form of a times series  Stock prices (ticker data)  Asset prices  Customer Numbers  Etc So Financial Data Analytics places a lot of importance on Financial times series analytics
  • 12. Statistics: Unlocking the Power of Data Lock5 Examples of financial time series Daily log returns of Apple stock: 2007 to 2016 (10 years) BSE index Quarterly earnings of Coca-Cola Company: 1983-2009 Seasonal time series useful in  earning forecasts  pricing weather related derivatives (e.g. energy)  modeling intraday behavior of asset returns Exchange rate between US Dollar vs Re Size of insurance claims Values High-frequency financial data: Tick-by-tick data of stock, etc
  • 13. 13 Mining Time-Series Data A time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals Time series analysis: A subfield of statistics, comprises methods that attempt to understand such time series, often either to understand the underlying context of the data points or to make forecasts (or predictions) Methods for time series analyses  Frequency-domain methods: Model-free analyses, well-suited to exploratory investigations  spectral analysis vs. wavelet analysis  Time-domain methods: Auto-correlation and cross-correlation analysis  Motif-based time-series analysis Applications  Financial: stock price, inflation  Industry: power consumption  Scientific: experiment results  Meteorological: precipitation
  • 14. Statistics: Unlocking the Power of Data Lock5 14 Time-Series Data Analysis: Prediction & Regression Analysis (Numerical) prediction is similar to classification  construct a model  use model to predict continuous or ordered value for a given input Prediction is different from classification  Classification refers to predict categorical class label  Prediction models continuous-valued functions Major method for prediction: regression  model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis  Linear and multiple regression  Non-linear regression  Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees
  • 15. Statistics: Unlocking the Power of Data Lock5 15 What is Regression? Modeling the relationship between one response variable and one or more predictor variables Analyzing the confidence of the model E.g, height v.s weight
  • 16. Statistics: Unlocking the Power of Data Lock5 16 Regression Yields Analytical Model Discrete data points →Analytical model  General relationship  Easy calculation  Further analysis Application - Prediction
  • 17. Statistics: Unlocking the Power of Data Lock5 17 Application - Detrending Obtain the trend for irregular data series Subtract trend Reveal oscillations trend
  • 18. Statistics: Unlocking the Power of Data Lock5 18 Linear Regression - Single Predictor Model is linear y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: y: response variable x: predictor variable w1 w0 | | 1 | | 2 1 ( )( ) 1 ( ) D i i i D i i x x y y x x w         x w y w 1 0  
  • 19. Statistics: Unlocking the Power of Data Lock5 19 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|) E.g., for 2-D data or y = w0 + w1 x1+ w2 x2 Solvable by  Extension of least square method (XTX ) W=Y →W = (XTX ) -1Y  Commercial software (SAS, S-Plus) x1 x2 y Linear Regression – Multiple Predictor
  • 20. Statistics: Unlocking the Power of Data Lock5 20 Nonlinear Regression with Linear Method Polynomial regression model  E.g., y = w0 + w1 x + w2 x2 + w3 x3 Let x2 = x2, x3= x3 y = w0 + w1 x + w2 x2 + w3 x3 Log-linear regression model  E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 ) Let y’=log(y) y’= w0 + w1 x + w2 x2 + w3 x3
  • 21. Statistics: Unlocking the Power of Data Lock5 21 Generalized Linear Regression Response y  Distribution function in the exponential family  Variance of y depends on E( y), not a constant E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 ) Examples  Logistic regression (binomial regression): probability of some event occurring  Poisson regression: number of customers  … References: Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989
  • 22. 22 Regression Tree (Breiman et al., 1984) Partition the domain space Leaf: (1) a continuous-valued prediction; (2) average value
  • 23. Statistics: Unlocking the Power of Data Lock5 23 Model Tree Leaf – a linear equation More general than regression tree Figure source: http://datamining.ihe.nl/research/model-trees.htm
  • 24. Statistics: Unlocking the Power of Data Lock5 24 Regression Trees and Model Trees Regression tree: proposed in CART system (Breiman et al. 1984)  CART: Classification And Regression Trees  Each leaf stores a continuous-valued prediction  It is the average value of the predicted attribute for the training tuples that reach the leaf Model tree: proposed by Quinlan (1992)  Each leaf holds a regression model—a multivariate linear equation for the predicted attribute  A more general case than regression tree Regression and model trees tend to be more accurate than linear regression when the data cannot be represented well by a simple linear model
  • 25. Statistics: Unlocking the Power of Data Lock5 25 A time series can be illustrated as a time-series graph which describes a point moving with the passage of time
  • 26. Statistics: Unlocking the Power of Data Lock5 26 Categories of Time-Series Movements Categories of Time-Series Movements  Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time  Cyclic movements or cycle variations: long term oscillations about a trend line or curve e.g., business cycles, may or may not be periodic  Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.  Irregular or random movements Time series analysis: decomposition of a time series into these four basic movements  Additive Modal: TS = T + C + S + I  Multiplicative Modal: TS = T  C  S  I
  • 27. Statistics: Unlocking the Power of Data Lock5 Estimation of Trend Curve The freehand method  Fit the curve by looking at the graph  Costly and barely reliable for large-scaled data mining The least-square method  Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points The moving-average method 27
  • 28. Statistics: Unlocking the Power of Data Lock5 28 Moving Average Moving average of order n  Smoothes the data  Eliminates cyclic, seasonal and irregular movements  Loses the data at the beginning or end of a series  Sensitive to outliers (can be reduced by weighted moving average)
  • 29. Statistics: Unlocking the Power of Data Lock5 29 Trend Discovery in Time-Series (1): Estimation of Seasonal Variations Seasonal index  Set of numbers showing the relative values of a variable during the months of the year  E.g., if the sales during October, November, and December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months Deseasonalized data  Data adjusted for seasonal variations for better trend and cyclic analysis  Divide the original monthly data by the seasonal index numbers for the corresponding months
  • 30. Statistics: Unlocking the Power of Data Lock5 February 2, 2023 Data Mining: Concepts and Techniques 30 Seasonal Index 0 20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 8 9 10 11 12 Month Seasonal Index Raw data from http://www.bbk.ac.uk/mano p/man/docs/QII_2_2003%2 0Time%20series.pdf
  • 31. Statistics: Unlocking the Power of Data Lock5 Trend Discovery in Time-Series (2) Estimation of cyclic variations  If (approximate) periodicity of cycles occurs, cyclic index can be constructed in much the same manner as seasonal indexes Estimation of irregular variations  By adjusting the data for trend, seasonal and cyclic variations With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality 31
  • 32. Statistics: Unlocking the Power of Data Lock5 32 Similarity Search in Time-Series Analysis Normal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequence Two categories of similarity queries  Whole matching: find a sequence that is similar to the query sequence  Subsequence matching: find all pairs of similar sequences Typical Applications  Financial market  Market basket data analysis  Scientific databases  Medical diagnosis
  • 33. Statistics: Unlocking the Power of Data Lock5 33 Data Transformation Many techniques for signal analysis require the data to be in the frequency domain Usually data-independent transformations are used  The transformation matrix is determined a priori  discrete Fourier transform (DFT)  discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain
  • 34. Statistics: Unlocking the Power of Data Lock5 34 Discrete Fourier Transform DFT does a good job of concentrating energy in the first few coefficients If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance Feature extraction: keep the first few coefficients (F-index) as representative of the sequence
  • 35. Statistics: Unlocking the Power of Data Lock5 35 DFT (continued) Parseval’s Theorem The Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain Keep the first few (say, 3) coefficients underestimates the distance and there will be no false dismissals!        1 0 2 1 0 2 | | | | n f f n t t X x | ] )[ ( ] )[ ( | | ] [ ] [ | 3 0 2 0 2          f n t f Q F f S F t Q t S  
  • 36. Statistics: Unlocking the Power of Data Lock5 36 Multidimensional Indexing in Time-Series Multidimensional index construction  Constructed for efficient accessing using the first few Fourier coefficients Similarity search  Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence  Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches
  • 37. Statistics: Unlocking the Power of Data Lock5 Subsequence Matching Break each sequence into a set of pieces of window with length w Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle Use a multi-piece assembly algorithm to search for longer sequence matches 37
  • 38. Statistics: Unlocking the Power of Data Lock5 38 Analysis of Similar Time Series
  • 39. Statistics: Unlocking the Power of Data Lock5 Enhanced Similarity Search Methods Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two subsequences are considered similar if one lies within an envelope of  width around the other, ignoring outliers Two sequences are said to be similar if they have enough non- overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction 39
  • 40. Statistics: Unlocking the Power of Data Lock5 40 Steps for Performing a Similarity Search Atomic matching  Find all pairs of gap-free windows of a small length that are similar Window stitching  Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches Subsequence Ordering  Linearly order the subsequence matches to determine whether enough similar pieces exist
  • 41. Statistics: Unlocking the Power of Data Lock5 41 Similar Time Series Analysis VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group
  • 42. Statistics: Unlocking the Power of Data Lock5 42 Sequence Distance A function that measures the differentness of two sequences (of possibly unequal length) Example: Euclidean Distance between TS Q,C    n i i i c q C Q D 1 2 ) ( ) , (
  • 43. Statistics: Unlocking the Power of Data Lock5 43 Motif: Basic Concepts What is a motif? A previously unknown, frequently occurring sequential pattern Match: Given subsequences Q,C ⊆ T, C is a match for Q iff for some R Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match Q. If p = q or ∄ non-match N = T[s..*] such that s between p,q then match is non-trivial. (i.e. C,Q must be separated by a non-match) 1-Motif: the subsequence with most non-trivial matches (least variance decides ties) k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k) R C Q D  ) , (
  • 44. Statistics: Unlocking the Power of Data Lock5 44 SAX: Symbolic Aggregate approXimation Dim. Reduction/Compression “Symbolic Aggregate approXimation” SAX : ℝ → ∑ SAX : ↦ ccbaabbbabcbcb Essentially an alphabet over the Piecewise Aggregate Approximation (PAA) rank Faster, simpler, more compression, yet on par with DFT, DWT and other dim. reductions
  • 45. Statistics: Unlocking the Power of Data Lock5 45 SAX Illustration
  • 46. Statistics: Unlocking the Power of Data Lock5 46 SAX Algorithm Parameters: alphabet size, word (segment) length (or output rate) 1.Select probability distribution for TS 2.z-Normalize TS 3.PAA: Within each time interval, calculate aggregated value (mean) of the segment 4.Partition TS range by equal-area partitioning the PDF into n partitions (eq. freq. binning) 5.Label each segment with arank ∈∑ for aggregate’s corresponding partition rank
  • 47. Statistics: Unlocking the Power of Data Lock5 47 Finding Motifs in a Time Series EMMA Algorithm: Finds 1-(k-)motif of fixed length n SAX Compression (Dim. Reduction)  Possible to store D(i,j) ∀(i,j) ∈ ∑∑  Allows use of various distance measures (Minkowski, Dynamic Time Warping) Multiple Tiers  Tier 1: Uses sliding window to hash length-w SAX subsequences (aw addresses, total size O(m)). Bucket B with most collisions & buckets with MINDIST(B) < R form neighborhood of B.  Tier 2: Neighborhood is pruned using more precise ADM algorithm. Ni with max. matches is 1-motif. Early stop if |ADM matches| > maxk>i(|neighborhoodk|)
  • 48. Statistics: Unlocking the Power of Data Lock5 48 Hashing c e c a b b c b a c c e c a b b c b a c c c c c b b c c d c w n 2 4 2 0 1 1 2 1 0 2 5 2 2 2 2 1 1 2 2 3 2 5 2 4 2 0 1 1 2 1 0 2 5 … … … … … … … … … … …
  • 49. Statistics: Unlocking the Power of Data Lock5 Classification in Time Series Application: Finance, 1-Nearest Neighbor  Pros: accurate, robust, simple  Cons: time and space complexity (lazy learning); results are not interpretable 0 200 400 600 800 1000 1200
  • 50. Statistics: Unlocking the Power of Data Lock5 Financial Data Applications Fraud Detection - Anomaly Analysis
  • 51. Statistics: Unlocking the Power of Data Lock5 What are Anomalies? Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Anomalies translate to significant (often critical) real life entities  Cyber intrusions  Credit card fraud
  • 52. Statistics: Unlocking the Power of Data Lock5 Real World Anomalies Credit Card Fraud  An abnormally high purchase made on a credit card Cyber Intrusions  A web server involved in ftp traffic
  • 53. Statistics: Unlocking the Power of Data Lock5 Simple Example N1 and N2 are regions of normal behavior Points o1 and o2 are anomalies Points in region O3 are anomalies X Y N1 N2 o1 o2 O3
  • 54. Statistics: Unlocking the Power of Data Lock5 Related problems Rare Class Mining Chance discovery Novelty Detection Exception Mining Noise Removal Black Swan*
  • 55. Statistics: Unlocking the Power of Data Lock5 Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Malicious adversaries Data might contain noise Normal behavior keeps evolving
  • 56. Statistics: Unlocking the Power of Data Lock5 Data Labels Supervised Anomaly Detection  Labels available for both normal data and anomalies  Similar to rare class mining Semi-supervised Anomaly Detection  Labels available only for normal data Unsupervised Anomaly Detection  No labels assumed  Based on the assumption that anomalies are very rare compared to normal data
  • 57. Statistics: Unlocking the Power of Data Lock5 Applications of Anomaly Detection Insurance / Credit card fraud detection Anti-Money Laundering (AML) Fraud Identity Theft and Fake Account Registration Risk Modeling Account Takeover Promotion Credit Abuse Customer Behavior Analytics Cyber Security
  • 58. Fraud Detection Fraud detection refers to detection of criminal activities occurring in commercial organizations  Malicious users might be the actual customers of the organization or might be posing as a customer (also known as identity theft). Types of fraud  Credit card fraud  Insurance claim fraud  Mobile / cell phone fraud  Insider trading Challenges  Fast and accurate real-time detection  Misclassification cost is very high
  • 59. Statistics: Unlocking the Power of Data Lock5 Classification Based Techniques Main idea: build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event Classification models must be able to handle skewed (imbalanced) class distributions Categories:  Supervised classification techniques  Require knowledge of both normal and anomaly class  Build classifier to distinguish between normal and known anomalies  Semi-supervised classification techniques  Require knowledge of normal class only!  Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous
  • 60. Statistics: Unlocking the Power of Data Lock5 Classification Based Techniques Advantages:  Supervised classification techniques  Models that can be easily understood  High accuracy in detecting many kinds of known anomalies  Semi-supervised classification techniques  Models that can be easily understood  Normal behavior can be accurately learned Drawbacks:  Supervised classification techniques  Require both labels from both normal and anomaly class  Cannot detect unknown and emerging anomalies  Semi-supervised classification techniques  Require labels from normal class  Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies
  • 61. Statistics: Unlocking the Power of Data Lock5 Supervised Classification Techniques Manipulating data records (oversampling / undersampling / generating artificial examples) Rule based techniques Model based techniques  Neural network based approaches  Support Vector machines (SVM) based approaches  Bayesian networks based approaches Cost-sensitive classification techniques Ensemble based algorithms (SMOTEBoost, RareBoost
  • 62. Statistics: Unlocking the Power of Data Lock5 Semi-supervised Classification Techniques Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous Recent approaches:  Neural network based approaches  Support Vector machines (SVM) based approaches  Markov model based approaches  Rule-based approaches
  • 63. Statistics: Unlocking the Power of Data Lock5 Nearest Neighbor Based Techniques Key assumption: normal points have close neighbors while anomalies are located far from other points General two-step approach 1. Compute neighborhood for each data record 2. Analyze the neighborhood to determine whether data record is anomaly or not Categories:  Distance based methods  Anomalies are data points most distant from other points  Density based methods  Anomalies are data points in low density regions
  • 64. Statistics: Unlocking the Power of Data Lock5 Clustering Based Techniques Key assumption: normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters Categorization according to labels  Semi-supervised – cluster normal data to create modes of normal behavior. If a new instance does not belong to any of the clusters or it is not close to any cluster, is anomaly  Unsupervised – post-processing is needed after a clustering step to determine the size of the clusters and the distance from the clusters is required fro the point to be anomaly Anomalies detected using clustering based methods can be:  Data records that do not fit into any cluster (residuals from clustering)  Small clusters  Low density clusters or local anomalies (far from other points within the same cluster)
  • 65. Statistics: Unlocking the Power of Data Lock5 Clustering Based Techniques Advantages:  No need to be supervised  Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data Drawbacks  Computationally expensive Using indexing structures (k-d tree, R* tree) may alleviate this problem  If normal points do not create any clusters the techniques may fail  In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. Clustering algorithms may not give any meaningful clusters
  • 66. Statistics: Unlocking the Power of Data Lock5 Statistics Based Techniques Data points are modeled using stochastic distribution  points are determined to be outliers depending on their relationship with this model Advantage  Utilize existing statistical modeling techniques to model various type of distributions Challenges  With high dimensions, difficult to estimate distributions  Parametric assumptions often do not hold for real data sets
  • 67. Statistics: Unlocking the Power of Data Lock5 Types of Statistical Techniques Parametric Techniques  Assume that the normal (and possibly anomalous) data is generated from an underlying parametric distribution  Learn the parameters from the normal sample  Determine the likelihood of a test instance to be generated from this distribution to detect anomalies Non-parametric Techniques  Do not assume any knowledge of parameters  Use non-parametric techniques to learn a distribution – e.g. parzen window estimation
  • 68. Statistics: Unlocking the Power of Data Lock5 Information Theory Based Techniques Compute information content in data using information theoretic measures, e.g., entropy, relative entropy, etc. Key idea: Outliers significantly alter the information content in a dataset Approach: Detect data instances that significantly alter the information content  Require an information theoretic measure Advantage  Operate in an unsupervised mode Challenges  Require an information theoretic measure sensitive enough to detect irregularity induced by very few outliers
  • 69. Statistics: Unlocking the Power of Data Lock5 Visualization Based Techniques Use visualization tools to observe the data Provide alternate views of data for manual inspection Anomalies are detected visually Advantages  Keeps a human in the loop Disadvantages  Works well for low dimensional data  Can provide only aggregated or partial views for high dimension data
  • 70. Statistics: Unlocking the Power of Data Lock5 Visual Data Mining* Detecting Tele- communication fraud Display telephone call patterns as a graph Use colors to identify fraudulent telephone calls (anomalies)
  • 71. Statistics: Unlocking the Power of Data Lock5 Contextual Anomaly Detection Detect context anomalies General Approach  Identify a context around a data instance (using a set of contextual attributes)  Determine if the data instance is anomalous w.r.t. the context (using a set of behavioral attributes) Assumption  All normal instances within a context will be similar (in terms of behavioral attributes), while the anomalies will be different
  • 72. Statistics: Unlocking the Power of Data Lock5 Contextual Attributes Contextual attributes define a neighborhood (context) for each instance For example:  Spatial Context Latitude, Longitude  Graph Context Edges, Weights  Sequential Context Position, Time  Profile Context User demographics
  • 73. Statistics: Unlocking the Power of Data Lock5 Sequential Anomaly Detection Detect anomalous sequences in a database of sequences, or Detect anomalous subsequence within a sequence Data is presented as a set of symbolic sequences  System call intrusion detection  Proteomics  Climate data
  • 74. Statistics: Unlocking the Power of Data Lock5 Motivation for On-line Anomaly Detection Data in many rare events applications arrives continuously at an enormous pace There is a significant challenge to analyze such data Examples of such rare events applications:  Video analysis  Network traffic monitoring  Credit card fraudulent transactions
  • 75. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis for Finance Sentiment analysis is an emerging area where structured and unstructured data is analyzed to generate useful insights leading to improved performances. Information obtained from multiple sources including news wires, macro- economic announcements, social media, micro blogs /twitter, online (search) information such as Google trends and Wikipedia influence both business intelligence and performance evaluation. This sentiment data can help investors and finance professionals to exploit the market and manage their risk exposure.  Stock market prediction  New product review  Stock Trading  Customer Brand Building
  • 76. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis in Finance
  • 77. Statistics: Unlocking the Power of Data Lock5
  • 78. Statistics: Unlocking the Power of Data Lock5 Thank You