Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

FDA_SAKEC2018.pptx

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Data mining
Data mining
Cargando en…3
×

Eche un vistazo a continuación

1 de 78 Anuncio

Más Contenido Relacionado

Similares a FDA_SAKEC2018.pptx (20)

Más reciente (20)

Anuncio

FDA_SAKEC2018.pptx

  1. 1. Statistics: Unlocking the Power of Data Lock5 Financial DATA ANALYTICS Dr. M.Vijayalakshmi, VESIT 4th jan 2018, SAKEC Mumbai
  2. 2. Statistics: Unlocking the Power of Data Lock5 Financial Data The financial industry has always been driven by data. Today, Big Data is prevalent at various levels of this field, ranging from the financial services sector to capital markets. The availability of Big Data in this domain has opened up new avenues for innovation and has offered immense opportunities for growth and sustainability. At the same time, it has presented several new challenges that must be overcome to gain the maximum value out of it.
  3. 3. Statistics: Unlocking the Power of Data Lock5 Financial Data Analytics in a Nut Shell
  4. 4. Statistics: Unlocking the Power of Data Lock5 Motivation There has been an explosion in the velocity, variety and volume of financial data. Social media activity, mobile interactions, server logs, real-time market feeds, customer service records, transaction details, information from existing databases – there’s no end to the flood. To make sense of these giant data sets, companies are increasingly turning to data scientists for answers. These numbers gurus are:  Capturing and analyzing new sources of data, building predictive models and running live simulations of market events  Using technologies such as Hadoop, NoSQL and Storm to tap into non-traditional data sets (e.g., geolocation, sentiment data) and integrate them with more traditional numbers (e.g., trade data)  Finding and storing increasingly diverse data in its raw form for future analysis They’ve been aided in this quest by the development of cloud-based data storage and the surge of sophisticated (and sometimes free or open-source) analytics tools.
  5. 5. Statistics: Unlocking the Power of Data Lock5 Important Applications of Financial Data Analytics 1. Predictive Analytics / Trading 2. Sentiment Analysis 3. Financial Fraud 4. Credit Scoring Ratings 5. Pricing 6. Customer Segmentation 7. Know Your Customer
  6. 6. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis Sentiment analysis (aka opinion mining) applies natural-language processing, text analysis and computational linguistics to source material to discover what folks really think. Several big Businesses like MarketPsy Capital, Think Big Analytics and MarketPsych Data are using it to: Build algorithms around market sentiment data (e.g., Twitter feeds) that can short the market when disasters (e.g., storms, terrorist attacks) occur Track trends, monitor the launch of new products, respond to issues and improve overall brand perception Analyze unstructured voice recordings from call centers and recommend ways to reduce customer churn, up-sell and cross-sell products and detect fraud Some data companies are even acting as intermediaries, collecting and selling sentiment indicators to retail investors.
  7. 7. Statistics: Unlocking the Power of Data Lock5 Automated Risk Credit Management Internet finance companies are finding ways to approve loans and manage risk. Aliloan (from AliBaba) is an automated online system that provides flexible micro-loans to entrepreneurial online vendors. To gauge whether a vendor is creditworthy, Alibaba collects data from its e- commerce and payment platforms and analyzes transaction records, customer ratings, shipping records and a host of other info. These findings are confirmed by third-party verification and cross-checked against external data sets (e.g., customs, tax data, electricity records, etc.). Once the loan is granted, Alibaba continues to monitor the use of funds and assess the business’s strategic development. Entrepreneurs in emerging markets are also reaping the benefits. Like Aliloan, companies such as Kreditech and Lenddo provide automated small loans based on innovative credit scoring techniques. In these cases, much of the score is calculated from applicants’ online social networking data.
  8. 8. Statistics: Unlocking the Power of Data Lock5 Real Time Analytics In days of yore, financial institutions were hampered by the lag-time between data collection and data analysis. Real-time analytics short-circuits this problem and provides the industry with new ways to: Fight Financial Fraud: Banks and credit card companies routinely analyze account balances, spending patterns, credit history, employment details, location and a load of other data points to determine whether transactions are above aboard. If suspicious activity is detected, they can immediately suspend the account and alert the owner. Improve Credit Ratings: A continuous feed of online data means credit ratings can be updated in real time. This provides lenders with a more accurate picture of a customer’s assets, business operations and transaction history. Provide More Accurate Pricing: Progressive Insurance already tailors its policies to account for a customer’s changing financial situation. In the Internet of Things, data from automobile sensors will also help insurance companies issues its policy holders with warnings about accidents, traffic jams and weather conditions. That makes for safer drivers and fewer payouts
  9. 9. Statistics: Unlocking the Power of Data Lock5 Customer Segmentation Like every other industry on the planet, banks and financial institutions are hungry to know more about the people using their products and services. And though they already store a ton of data – from credit scores to day-to-day transactions – they’re not too proud to look for it elsewhere.  This kind of customer segmentation allows them to:  Offer customized product offerings and services  Improve existing profitable relationships and avoid customer churn  Create better marketing campaigns and more attractive product offerings  Tailor product development to specific customer segments
  10. 10. Statistics: Unlocking the Power of Data Lock5 Predictive Analytics By combining segmentation with predictive analytics, companies can also cut down on risk. For example, to decide whether certain customers are likely to pay off their credit cards, some major banks use technology developed by the company Sqrrl. This analysis takes into account the demographic characteristics of customers’ neighborhoods and makes calculated predictions. Similar strides have been made in forecasting market behavior. Once upon a time (e.g., 2009), high-frequency trading – the speedy exchange of securities – was hugely lucrative. With competition came a drop in profits and the need for a new strategy. HFT traders adapted by employing strategic sequential trading, using big data analytics to identify specific market participants and anticipate their future actions. In a field of breakneck speed, this gives HFT traders an unmistakable advantage. By studying search volume data provided by Google Trends, they were able to identify online precursors for stock market moves. Their results suggest that increases in search volume for financially relevant search terms usually precede big losses in financial markets.
  11. 11. Statistics: Unlocking the Power of Data Lock5 Analytics of Financial Times Series A vast majority of Financial data occurs in the form of a times series  Stock prices (ticker data)  Asset prices  Customer Numbers  Etc So Financial Data Analytics places a lot of importance on Financial times series analytics
  12. 12. Statistics: Unlocking the Power of Data Lock5 Examples of financial time series Daily log returns of Apple stock: 2007 to 2016 (10 years) BSE index Quarterly earnings of Coca-Cola Company: 1983-2009 Seasonal time series useful in  earning forecasts  pricing weather related derivatives (e.g. energy)  modeling intraday behavior of asset returns Exchange rate between US Dollar vs Re Size of insurance claims Values High-frequency financial data: Tick-by-tick data of stock, etc
  13. 13. 13 Mining Time-Series Data A time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals Time series analysis: A subfield of statistics, comprises methods that attempt to understand such time series, often either to understand the underlying context of the data points or to make forecasts (or predictions) Methods for time series analyses  Frequency-domain methods: Model-free analyses, well-suited to exploratory investigations  spectral analysis vs. wavelet analysis  Time-domain methods: Auto-correlation and cross-correlation analysis  Motif-based time-series analysis Applications  Financial: stock price, inflation  Industry: power consumption  Scientific: experiment results  Meteorological: precipitation
  14. 14. Statistics: Unlocking the Power of Data Lock5 14 Time-Series Data Analysis: Prediction & Regression Analysis (Numerical) prediction is similar to classification  construct a model  use model to predict continuous or ordered value for a given input Prediction is different from classification  Classification refers to predict categorical class label  Prediction models continuous-valued functions Major method for prediction: regression  model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis  Linear and multiple regression  Non-linear regression  Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees
  15. 15. Statistics: Unlocking the Power of Data Lock5 15 What is Regression? Modeling the relationship between one response variable and one or more predictor variables Analyzing the confidence of the model E.g, height v.s weight
  16. 16. Statistics: Unlocking the Power of Data Lock5 16 Regression Yields Analytical Model Discrete data points →Analytical model  General relationship  Easy calculation  Further analysis Application - Prediction
  17. 17. Statistics: Unlocking the Power of Data Lock5 17 Application - Detrending Obtain the trend for irregular data series Subtract trend Reveal oscillations trend
  18. 18. Statistics: Unlocking the Power of Data Lock5 18 Linear Regression - Single Predictor Model is linear y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: y: response variable x: predictor variable w1 w0 | | 1 | | 2 1 ( )( ) 1 ( ) D i i i D i i x x y y x x w         x w y w 1 0  
  19. 19. Statistics: Unlocking the Power of Data Lock5 19 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|) E.g., for 2-D data or y = w0 + w1 x1+ w2 x2 Solvable by  Extension of least square method (XTX ) W=Y →W = (XTX ) -1Y  Commercial software (SAS, S-Plus) x1 x2 y Linear Regression – Multiple Predictor
  20. 20. Statistics: Unlocking the Power of Data Lock5 20 Nonlinear Regression with Linear Method Polynomial regression model  E.g., y = w0 + w1 x + w2 x2 + w3 x3 Let x2 = x2, x3= x3 y = w0 + w1 x + w2 x2 + w3 x3 Log-linear regression model  E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 ) Let y’=log(y) y’= w0 + w1 x + w2 x2 + w3 x3
  21. 21. Statistics: Unlocking the Power of Data Lock5 21 Generalized Linear Regression Response y  Distribution function in the exponential family  Variance of y depends on E( y), not a constant E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 ) Examples  Logistic regression (binomial regression): probability of some event occurring  Poisson regression: number of customers  … References: Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989
  22. 22. 22 Regression Tree (Breiman et al., 1984) Partition the domain space Leaf: (1) a continuous-valued prediction; (2) average value
  23. 23. Statistics: Unlocking the Power of Data Lock5 23 Model Tree Leaf – a linear equation More general than regression tree Figure source: http://datamining.ihe.nl/research/model-trees.htm
  24. 24. Statistics: Unlocking the Power of Data Lock5 24 Regression Trees and Model Trees Regression tree: proposed in CART system (Breiman et al. 1984)  CART: Classification And Regression Trees  Each leaf stores a continuous-valued prediction  It is the average value of the predicted attribute for the training tuples that reach the leaf Model tree: proposed by Quinlan (1992)  Each leaf holds a regression model—a multivariate linear equation for the predicted attribute  A more general case than regression tree Regression and model trees tend to be more accurate than linear regression when the data cannot be represented well by a simple linear model
  25. 25. Statistics: Unlocking the Power of Data Lock5 25 A time series can be illustrated as a time-series graph which describes a point moving with the passage of time
  26. 26. Statistics: Unlocking the Power of Data Lock5 26 Categories of Time-Series Movements Categories of Time-Series Movements  Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time  Cyclic movements or cycle variations: long term oscillations about a trend line or curve e.g., business cycles, may or may not be periodic  Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.  Irregular or random movements Time series analysis: decomposition of a time series into these four basic movements  Additive Modal: TS = T + C + S + I  Multiplicative Modal: TS = T  C  S  I
  27. 27. Statistics: Unlocking the Power of Data Lock5 Estimation of Trend Curve The freehand method  Fit the curve by looking at the graph  Costly and barely reliable for large-scaled data mining The least-square method  Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points The moving-average method 27
  28. 28. Statistics: Unlocking the Power of Data Lock5 28 Moving Average Moving average of order n  Smoothes the data  Eliminates cyclic, seasonal and irregular movements  Loses the data at the beginning or end of a series  Sensitive to outliers (can be reduced by weighted moving average)
  29. 29. Statistics: Unlocking the Power of Data Lock5 29 Trend Discovery in Time-Series (1): Estimation of Seasonal Variations Seasonal index  Set of numbers showing the relative values of a variable during the months of the year  E.g., if the sales during October, November, and December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months Deseasonalized data  Data adjusted for seasonal variations for better trend and cyclic analysis  Divide the original monthly data by the seasonal index numbers for the corresponding months
  30. 30. Statistics: Unlocking the Power of Data Lock5 February 2, 2023 Data Mining: Concepts and Techniques 30 Seasonal Index 0 20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 8 9 10 11 12 Month Seasonal Index Raw data from http://www.bbk.ac.uk/mano p/man/docs/QII_2_2003%2 0Time%20series.pdf
  31. 31. Statistics: Unlocking the Power of Data Lock5 Trend Discovery in Time-Series (2) Estimation of cyclic variations  If (approximate) periodicity of cycles occurs, cyclic index can be constructed in much the same manner as seasonal indexes Estimation of irregular variations  By adjusting the data for trend, seasonal and cyclic variations With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality 31
  32. 32. Statistics: Unlocking the Power of Data Lock5 32 Similarity Search in Time-Series Analysis Normal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequence Two categories of similarity queries  Whole matching: find a sequence that is similar to the query sequence  Subsequence matching: find all pairs of similar sequences Typical Applications  Financial market  Market basket data analysis  Scientific databases  Medical diagnosis
  33. 33. Statistics: Unlocking the Power of Data Lock5 33 Data Transformation Many techniques for signal analysis require the data to be in the frequency domain Usually data-independent transformations are used  The transformation matrix is determined a priori  discrete Fourier transform (DFT)  discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain
  34. 34. Statistics: Unlocking the Power of Data Lock5 34 Discrete Fourier Transform DFT does a good job of concentrating energy in the first few coefficients If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance Feature extraction: keep the first few coefficients (F-index) as representative of the sequence
  35. 35. Statistics: Unlocking the Power of Data Lock5 35 DFT (continued) Parseval’s Theorem The Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain Keep the first few (say, 3) coefficients underestimates the distance and there will be no false dismissals!        1 0 2 1 0 2 | | | | n f f n t t X x | ] )[ ( ] )[ ( | | ] [ ] [ | 3 0 2 0 2          f n t f Q F f S F t Q t S  
  36. 36. Statistics: Unlocking the Power of Data Lock5 36 Multidimensional Indexing in Time-Series Multidimensional index construction  Constructed for efficient accessing using the first few Fourier coefficients Similarity search  Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence  Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches
  37. 37. Statistics: Unlocking the Power of Data Lock5 Subsequence Matching Break each sequence into a set of pieces of window with length w Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle Use a multi-piece assembly algorithm to search for longer sequence matches 37
  38. 38. Statistics: Unlocking the Power of Data Lock5 38 Analysis of Similar Time Series
  39. 39. Statistics: Unlocking the Power of Data Lock5 Enhanced Similarity Search Methods Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two subsequences are considered similar if one lies within an envelope of  width around the other, ignoring outliers Two sequences are said to be similar if they have enough non- overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction 39
  40. 40. Statistics: Unlocking the Power of Data Lock5 40 Steps for Performing a Similarity Search Atomic matching  Find all pairs of gap-free windows of a small length that are similar Window stitching  Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches Subsequence Ordering  Linearly order the subsequence matches to determine whether enough similar pieces exist
  41. 41. Statistics: Unlocking the Power of Data Lock5 41 Similar Time Series Analysis VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group
  42. 42. Statistics: Unlocking the Power of Data Lock5 42 Sequence Distance A function that measures the differentness of two sequences (of possibly unequal length) Example: Euclidean Distance between TS Q,C    n i i i c q C Q D 1 2 ) ( ) , (
  43. 43. Statistics: Unlocking the Power of Data Lock5 43 Motif: Basic Concepts What is a motif? A previously unknown, frequently occurring sequential pattern Match: Given subsequences Q,C ⊆ T, C is a match for Q iff for some R Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match Q. If p = q or ∄ non-match N = T[s..*] such that s between p,q then match is non-trivial. (i.e. C,Q must be separated by a non-match) 1-Motif: the subsequence with most non-trivial matches (least variance decides ties) k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k) R C Q D  ) , (
  44. 44. Statistics: Unlocking the Power of Data Lock5 44 SAX: Symbolic Aggregate approXimation Dim. Reduction/Compression “Symbolic Aggregate approXimation” SAX : ℝ → ∑ SAX : ↦ ccbaabbbabcbcb Essentially an alphabet over the Piecewise Aggregate Approximation (PAA) rank Faster, simpler, more compression, yet on par with DFT, DWT and other dim. reductions
  45. 45. Statistics: Unlocking the Power of Data Lock5 45 SAX Illustration
  46. 46. Statistics: Unlocking the Power of Data Lock5 46 SAX Algorithm Parameters: alphabet size, word (segment) length (or output rate) 1.Select probability distribution for TS 2.z-Normalize TS 3.PAA: Within each time interval, calculate aggregated value (mean) of the segment 4.Partition TS range by equal-area partitioning the PDF into n partitions (eq. freq. binning) 5.Label each segment with arank ∈∑ for aggregate’s corresponding partition rank
  47. 47. Statistics: Unlocking the Power of Data Lock5 47 Finding Motifs in a Time Series EMMA Algorithm: Finds 1-(k-)motif of fixed length n SAX Compression (Dim. Reduction)  Possible to store D(i,j) ∀(i,j) ∈ ∑∑  Allows use of various distance measures (Minkowski, Dynamic Time Warping) Multiple Tiers  Tier 1: Uses sliding window to hash length-w SAX subsequences (aw addresses, total size O(m)). Bucket B with most collisions & buckets with MINDIST(B) < R form neighborhood of B.  Tier 2: Neighborhood is pruned using more precise ADM algorithm. Ni with max. matches is 1-motif. Early stop if |ADM matches| > maxk>i(|neighborhoodk|)
  48. 48. Statistics: Unlocking the Power of Data Lock5 48 Hashing c e c a b b c b a c c e c a b b c b a c c c c c b b c c d c w n 2 4 2 0 1 1 2 1 0 2 5 2 2 2 2 1 1 2 2 3 2 5 2 4 2 0 1 1 2 1 0 2 5 … … … … … … … … … … …
  49. 49. Statistics: Unlocking the Power of Data Lock5 Classification in Time Series Application: Finance, 1-Nearest Neighbor  Pros: accurate, robust, simple  Cons: time and space complexity (lazy learning); results are not interpretable 0 200 400 600 800 1000 1200
  50. 50. Statistics: Unlocking the Power of Data Lock5 Financial Data Applications Fraud Detection - Anomaly Analysis
  51. 51. Statistics: Unlocking the Power of Data Lock5 What are Anomalies? Anomaly is a pattern in the data that does not conform to the expected behavior Also referred to as outliers, exceptions, peculiarities, surprise, etc. Anomalies translate to significant (often critical) real life entities  Cyber intrusions  Credit card fraud
  52. 52. Statistics: Unlocking the Power of Data Lock5 Real World Anomalies Credit Card Fraud  An abnormally high purchase made on a credit card Cyber Intrusions  A web server involved in ftp traffic
  53. 53. Statistics: Unlocking the Power of Data Lock5 Simple Example N1 and N2 are regions of normal behavior Points o1 and o2 are anomalies Points in region O3 are anomalies X Y N1 N2 o1 o2 O3
  54. 54. Statistics: Unlocking the Power of Data Lock5 Related problems Rare Class Mining Chance discovery Novelty Detection Exception Mining Noise Removal Black Swan*
  55. 55. Statistics: Unlocking the Power of Data Lock5 Key Challenges Defining a representative normal region is challenging The boundary between normal and outlying behavior is often not precise The exact notion of an outlier is different for different application domains Availability of labeled data for training/validation Malicious adversaries Data might contain noise Normal behavior keeps evolving
  56. 56. Statistics: Unlocking the Power of Data Lock5 Data Labels Supervised Anomaly Detection  Labels available for both normal data and anomalies  Similar to rare class mining Semi-supervised Anomaly Detection  Labels available only for normal data Unsupervised Anomaly Detection  No labels assumed  Based on the assumption that anomalies are very rare compared to normal data
  57. 57. Statistics: Unlocking the Power of Data Lock5 Applications of Anomaly Detection Insurance / Credit card fraud detection Anti-Money Laundering (AML) Fraud Identity Theft and Fake Account Registration Risk Modeling Account Takeover Promotion Credit Abuse Customer Behavior Analytics Cyber Security
  58. 58. Fraud Detection Fraud detection refers to detection of criminal activities occurring in commercial organizations  Malicious users might be the actual customers of the organization or might be posing as a customer (also known as identity theft). Types of fraud  Credit card fraud  Insurance claim fraud  Mobile / cell phone fraud  Insider trading Challenges  Fast and accurate real-time detection  Misclassification cost is very high
  59. 59. Statistics: Unlocking the Power of Data Lock5 Classification Based Techniques Main idea: build a classification model for normal (and anomalous (rare)) events based on labeled training data, and use it to classify each new unseen event Classification models must be able to handle skewed (imbalanced) class distributions Categories:  Supervised classification techniques  Require knowledge of both normal and anomaly class  Build classifier to distinguish between normal and known anomalies  Semi-supervised classification techniques  Require knowledge of normal class only!  Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous
  60. 60. Statistics: Unlocking the Power of Data Lock5 Classification Based Techniques Advantages:  Supervised classification techniques  Models that can be easily understood  High accuracy in detecting many kinds of known anomalies  Semi-supervised classification techniques  Models that can be easily understood  Normal behavior can be accurately learned Drawbacks:  Supervised classification techniques  Require both labels from both normal and anomaly class  Cannot detect unknown and emerging anomalies  Semi-supervised classification techniques  Require labels from normal class  Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognized as anomalies
  61. 61. Statistics: Unlocking the Power of Data Lock5 Supervised Classification Techniques Manipulating data records (oversampling / undersampling / generating artificial examples) Rule based techniques Model based techniques  Neural network based approaches  Support Vector machines (SVM) based approaches  Bayesian networks based approaches Cost-sensitive classification techniques Ensemble based algorithms (SMOTEBoost, RareBoost
  62. 62. Statistics: Unlocking the Power of Data Lock5 Semi-supervised Classification Techniques Use modified classification model to learn the normal behavior and then detect any deviations from normal behavior as anomalous Recent approaches:  Neural network based approaches  Support Vector machines (SVM) based approaches  Markov model based approaches  Rule-based approaches
  63. 63. Statistics: Unlocking the Power of Data Lock5 Nearest Neighbor Based Techniques Key assumption: normal points have close neighbors while anomalies are located far from other points General two-step approach 1. Compute neighborhood for each data record 2. Analyze the neighborhood to determine whether data record is anomaly or not Categories:  Distance based methods  Anomalies are data points most distant from other points  Density based methods  Anomalies are data points in low density regions
  64. 64. Statistics: Unlocking the Power of Data Lock5 Clustering Based Techniques Key assumption: normal data records belong to large and dense clusters, while anomalies belong do not belong to any of the clusters or form very small clusters Categorization according to labels  Semi-supervised – cluster normal data to create modes of normal behavior. If a new instance does not belong to any of the clusters or it is not close to any cluster, is anomaly  Unsupervised – post-processing is needed after a clustering step to determine the size of the clusters and the distance from the clusters is required fro the point to be anomaly Anomalies detected using clustering based methods can be:  Data records that do not fit into any cluster (residuals from clustering)  Small clusters  Low density clusters or local anomalies (far from other points within the same cluster)
  65. 65. Statistics: Unlocking the Power of Data Lock5 Clustering Based Techniques Advantages:  No need to be supervised  Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data Drawbacks  Computationally expensive Using indexing structures (k-d tree, R* tree) may alleviate this problem  If normal points do not create any clusters the techniques may fail  In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. Clustering algorithms may not give any meaningful clusters
  66. 66. Statistics: Unlocking the Power of Data Lock5 Statistics Based Techniques Data points are modeled using stochastic distribution  points are determined to be outliers depending on their relationship with this model Advantage  Utilize existing statistical modeling techniques to model various type of distributions Challenges  With high dimensions, difficult to estimate distributions  Parametric assumptions often do not hold for real data sets
  67. 67. Statistics: Unlocking the Power of Data Lock5 Types of Statistical Techniques Parametric Techniques  Assume that the normal (and possibly anomalous) data is generated from an underlying parametric distribution  Learn the parameters from the normal sample  Determine the likelihood of a test instance to be generated from this distribution to detect anomalies Non-parametric Techniques  Do not assume any knowledge of parameters  Use non-parametric techniques to learn a distribution – e.g. parzen window estimation
  68. 68. Statistics: Unlocking the Power of Data Lock5 Information Theory Based Techniques Compute information content in data using information theoretic measures, e.g., entropy, relative entropy, etc. Key idea: Outliers significantly alter the information content in a dataset Approach: Detect data instances that significantly alter the information content  Require an information theoretic measure Advantage  Operate in an unsupervised mode Challenges  Require an information theoretic measure sensitive enough to detect irregularity induced by very few outliers
  69. 69. Statistics: Unlocking the Power of Data Lock5 Visualization Based Techniques Use visualization tools to observe the data Provide alternate views of data for manual inspection Anomalies are detected visually Advantages  Keeps a human in the loop Disadvantages  Works well for low dimensional data  Can provide only aggregated or partial views for high dimension data
  70. 70. Statistics: Unlocking the Power of Data Lock5 Visual Data Mining* Detecting Tele- communication fraud Display telephone call patterns as a graph Use colors to identify fraudulent telephone calls (anomalies)
  71. 71. Statistics: Unlocking the Power of Data Lock5 Contextual Anomaly Detection Detect context anomalies General Approach  Identify a context around a data instance (using a set of contextual attributes)  Determine if the data instance is anomalous w.r.t. the context (using a set of behavioral attributes) Assumption  All normal instances within a context will be similar (in terms of behavioral attributes), while the anomalies will be different
  72. 72. Statistics: Unlocking the Power of Data Lock5 Contextual Attributes Contextual attributes define a neighborhood (context) for each instance For example:  Spatial Context Latitude, Longitude  Graph Context Edges, Weights  Sequential Context Position, Time  Profile Context User demographics
  73. 73. Statistics: Unlocking the Power of Data Lock5 Sequential Anomaly Detection Detect anomalous sequences in a database of sequences, or Detect anomalous subsequence within a sequence Data is presented as a set of symbolic sequences  System call intrusion detection  Proteomics  Climate data
  74. 74. Statistics: Unlocking the Power of Data Lock5 Motivation for On-line Anomaly Detection Data in many rare events applications arrives continuously at an enormous pace There is a significant challenge to analyze such data Examples of such rare events applications:  Video analysis  Network traffic monitoring  Credit card fraudulent transactions
  75. 75. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis for Finance Sentiment analysis is an emerging area where structured and unstructured data is analyzed to generate useful insights leading to improved performances. Information obtained from multiple sources including news wires, macro- economic announcements, social media, micro blogs /twitter, online (search) information such as Google trends and Wikipedia influence both business intelligence and performance evaluation. This sentiment data can help investors and finance professionals to exploit the market and manage their risk exposure.  Stock market prediction  New product review  Stock Trading  Customer Brand Building
  76. 76. Statistics: Unlocking the Power of Data Lock5 Sentiment Analysis in Finance
  77. 77. Statistics: Unlocking the Power of Data Lock5
  78. 78. Statistics: Unlocking the Power of Data Lock5 Thank You

×