StreamAnalytix sponsored a meetup on “Anomaly Detection Techniques and Implementation using Apache Spark” which took place on Tuesday December 5, 2017 at Larkspur Landing Milpitas Hotel, Milpitas, CA. The meetup was led by Maxim Shkarayev, Lead Data Scientist, Impetus Technologies along with Punit Shah, Solution Architect, StreamAnalytix and Anand Venugopal, Product Head & AVP, StreamAnalytix, who introduced and summarized the vast field of Anomaly Detection and its applications in various industry problems. The speakers at the event also offered a structured approach to choose the right anomaly detection techniques based on specific use-cases and data characteristics which was followed by a demonstration of some real-world anomaly detection use-cases on Apache Spark based analytics platform.
5. • Real-time C360 and Churn
• Next Best Offer or Action
• Streaming ETL
• IoT and Log Analytics
• Fraud, Risk Anomaly detection
• Anomaly detection
• Predictive Maintenance
Enabling the Real-time Enterprise
Delightful Customer Experiences
Maximizing operational efficiency
with real-time insights
6. Build and Deploy use-cases fast
Pre-built ETL, Analytics, Read-write operators
Drag and Drop visual development and DevOps
Fast Data and Big Data; On-premise and Cloud
Enabling the Real-time Enterprise
“I could do my 1.5 month Spark app
in 1.5 days with this product”
- Analytics Lead at Tier 1 US Telco
7. Impetus Data Science Practice – Relevant Use-cases
Banking and Finance
Data Analytics & Modeling
Finding fraudulent travel and expenses
Text Mining & NLP
Intent to Fraud Detection in e-coms
Graph Analytics
Business impact of customer loss
Insurance
Data Analytics & Modeling
Insurance premium determination using
Catastrophe Modeling
Text Mining & NLP
Detecting Intent to commit fraud in e-
communications (AML, Dodd Frank etc.)
Communication and Media
Data Analytics & Modeling
Finding root cause of No Dial Tone;
Self-learning Anomaly Detection System
Marketing Analytics
Lead generation and Multi-touch
Attribution for increasing conversion rates
Manufacturing and Logistics
Data Analytics & Modeling
Lowering rejection rate of silicon wafers
for a semiconductor company
Early detection of paint defects for
leading auto manufacturer
Correlating multiple data sources to
identify factors related to warranty issues
Energy & Utilities
Data Analytics & Modeling
Reinforcement Learning model to enable
bidding of electricity (price and quantity)
Information Extraction
Extract label information from P&IDs and
make them searchable
Create a Bill of Materials for Budgeting
Healthcare
Data Analytics & Modeling
Predicting Patient Readmission
Text Mining & NLP
Competitive analysis of medicines
Graph Analytics
Drug-disease co-occurrence with Medline
8. Anomaly Definition
Anomaly: is an observation that greatly deviates from most of the other observations, i.e., a
data point/behavior/pattern that appears to be statistically unusual or anomalous
Basic qualities of anomaly:
1. Rare
2. Significantly different from others
9. Impetus DSP – Some Applications of Anomaly Detection
The problem of finding patterns in data that do not conform to expected behaviour
Manufacturing
Detect abnormal
machine behavior to
prevent cost overruns
Finance, Insurance
Detect and prevent Out
of Pattern or Fraudulent
spend, travel expenses
Healthcare
Detect fraud in claims
and payments; Events
from RFID and mobiles
Banking
Flag abnormally high
purchases or deposits,
detect cyber intrusions
Networking
Detect intrusion into
networks, prevent theft of
source code or IP
Social Media
Detect compromised
accounts, bots that
generate fake reviews
Video Surveillance
Detect or track objects
and persons of interest in
monotonous footage
Smart Homes
Detect energy leakage,
Standardize smart
sensor datasets
Telecom
Detect roaming abuse,
Revenue fraud, Service
disruptions etc.
Transportation
Ensure external
communications to the
vehicle are not intrusion
10. Deep Dive on Anomaly Detection
Thought Leadership / Advisory
11. Anomaly Detection Algorithms Across Disciplines
Host-based IDS
• Statistical Profiling using histograms
• Mixture of Models, Neural Networks
• SVM, Rule-based systems
Network Intrusion Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Non-parametric Statistical Modeling
• Bayesian Networks, Neural Networks
• SVM, Rule-based systems
• Clustering based, Nearest Neighbor
• Spectral, Information Theoretic
Credit Card Fraud Detection
• Neural Networks,
• Rule-based systems
• Clustering, Self-Organizing Map
• Artificial Immune System
• Decision Trees, SVM
Mobile Phone Fraud Detection
• Statistical Profiling using Histograms
• Parametric Statistical Modeling
• Neural networks, Rule-based systems
Insider Trading Detection
• Statistical Profiling using Histograms
• Information Theoretic
Medical and Public Health
• Parametric Statistical Modeling
• Neural Networks, Bayesian Networks
• Rule-based systems
• Nearest Neighbor Techniques
Fault Detection in Mechanical Units
• Parametric Statistical Modeling
• Non-Parametric Statistical Modeling
• Neural Networks, Spectral Methods
• Rule-based Systems
Structural Damage Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Mixture of Models, Neural Networks
Image Processing, Surveilence
• Mixture of Models, Regression, SVM
• Bayesian Networks, Neural Networks,
• Clustering, Nearest Neighbor Methods
Anomalous Topic Detection
• Mixture of Models, Neural Networks
• Statistical Profiling using Histograms
• Clustering, SVM
Anomaly Detection in Sensor Networks
• Parametric Statistical Modeling
• Bayesian Networks, Nearest Neighbor
• Rule-based Systems, Spectral
Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
12. Taxonomy for Anomaly Detection Algorithms
Anomaly
Detection
Point Anomaly
Detection
Contextual
Anomaly Detection
Collective Anomaly
Detection
Data instance anomalous with
respect to rest of the data (e.g. a
large transaction)
Data instance anomalous in a
specific context (e.g. large power
spike at night)
A collection of related data
instances are anomalous with
respect to the entire data set
13. Data – Types of Attributes
Data
Categorical
Nominal
Ordinal
Numerical
Named
Categories
Categories with
an implied order
Discrete
Continuous
Only particular
numbers
Any numerical
value
Binary
Variables with
only two options
(Yes/No)
14. Anomaly Detection Approaches
Supervised
(Classification)
Data skewness, lack
of counter examples
Unsupervised
(Clustering)
Faces curse of
dimensionality
Semi-supervised
(Novelty
detection)
Requires a “normal”
training dataset
• Anomalies are often a handful among millions of
normal data
• Given training data, this is a class imbalance problem
• There are methods to address this and using SVM,
Random Forests and ensemble learning
• If the data is auto-correlated, then it maybe required to use
time-series classification or Recurrent Neural Network
based approaches
• When there is no training data, unsupervised or
semi-supervised methods can be used
Source: https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
15. Unsupervised Anomaly Detection Algorithms
Unsupervised AD Algorithms
• k-NN Global Anomaly Detection (uses average
distance to k neighbors)
• kth-NN (uses distance to kth neighbor)
• LOF – Local Outlier Factor
• COF – Connectivity based OF
• LoOP – Local Outlier Probability
• LOCI – Local Correlation Integral
• aLOCI – approximate LOCI
• INFLO – Influenced Outlierness
• CBLOF/ uCBLOF - Cluster-Based LOF
• LDCOF - Local Density Cluster-based OF
• CMGOS - Clustering-based Multivariate
Gaussian Outlier Score
• HBOS - Histogram-based Outlier Score
• One-class Support Vector Machine
• rPCA - Robust PCA LOF
performance
Global anomalies (x1, x2), a
local anomaly x3 and a micro-
cluster c3.
K-NN underperforms on
local anomalies
Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
16. Some Anomaly Detection Methods
Data has a mix of Categorical and Numeric attributes
K-modes Generic Mixture Models Robust SVM
Uses Hamming distance
to measure distance for
Categorical Features
Extends the framework of
Gaussian Mixture Models
Kernel based approach that identifies
regions in which data resides in
alternate feature space
• Makes standard SVM robust as it
can be affected by outliers
• Retains strengths of SVM – fast
computation, handling high-
dimensional data and kernels
• Is based on GMMs which are
latent variable models
• A latent variable model is a
probability model where some
variables are never observed
• K-Means cannot handle data that
is non-numeric
• K-Modes applies a dissimilarity
measure for categorical items
17. Some Anomaly Detection Methods
Data has a sequential nature (timestamps, or sequences)
State Space Models Hidden Markov Modes Graph based Methods
Model the evolution of data in time to enable
forecasting and flag an anomaly if it exceeds
a threshold
Markov Chains and HMMs measure the
probability of different events happening in
some sequence
Graphs capture interdependencies, and
allow discovery of relational associations
such as in fraud
• Network intrusion graph grows
dynamically as events occur
• An activity vector obtained from the
graph can detect anomalies
• Markov chains can be built from
historical data
• This chain can be used to find the
probability of an anomalous sequence of
events
• Residual error between model and the
real system is used to identify
anomalous events
• This works with streaming data
System
Behavior
model
Observe
d
behavior
Expecte
d
behavior
Observation
Model Formation
Anomaly
Detection
Simulation
X
18. Some Anomaly Detection Methods
Other Methods
Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets
AutoEncoders can learn the latent
representation of the data by using an
encoder and a decoder together
RNN-based architectures enable sequence
prediction. The network can flag an anomaly
when needed
GANs combine two neural networks - a
generator and a discriminator, and can be
used to find anomalies
• Deep Convolutional GANs are being
used to learn a manifold of normal
variability
• This allows high accuracy in anomaly
detection
• RNN based models can detect
anomalies in Time Series Data
• More capable architectures such as
LSTM are also possible
• The output of the AutoEncoder is
compared to the input to detect and flag
anomalies
• Anomalies are more likely to have a high
reconstruction error
19. Impetus DSP - Out of Pattern Transaction Detection
The Challenge
• Major credit card company has
several thousand corporate
customers
• Customers have unique compliance
policies around acceptable spend
• Build a scalable product to identify out
of pattern spend behavior at card
level
Benefits Realized
• Value added service led to increase in
charge volumes of corporate
customers
• Demonstrated the value of external
facing product launches that leverage
machine learning
• Extending to fraud in travel expenses
Impetus Contribution
• Spend behavior of the card accounts
was analyzed to identify normal
spend
• Implemented algorithm to determine
out of pattern transactions and
scaled it to ~ 2M card accounts
• Launched the product in < 3 months
20. Case Study – “Out of Pattern” Financial Transactions
2 possible reasons
1)Customer’s situation may have really changed
2)Fraudulent usage
22. i. Introduction to web user interface for StreamAnalytix
ii. Multi-tenancy feature support
iii. Introduction to Data360 in StreamAnalytix
• Data pipelines
• Deploying the jobs
• Real-time dashboards and monitoring in StreamAnalytix
iv. Data Science in StreamAnalytix :
• Network anomaly use case
• Customer transaction anomaly detection use case
• A-B testing use case
v. Enterprise level features in StreamAnalytix
• Versioning
• Import & export data pipelines
• Register entities
• Data pipeline inspect