SlideShare una empresa de Scribd logo
1 de 23
Anomaly Detection &
Spark Implementation
Presenters:-
Maxim Shkarayev
Anand Venugopal
Punit Shah
DECEMBER 5, 2017
Meetup:
Stream Processing and Machine
Learning Platform for the Enterprise
Thought Leadership / Advisory
Impetus Introduction
Mission critical
technology solutions
since 1996
Global leaders are
our Big Data clients
1700 people: US,
India, global reach
Unique mix of
Big Data products
and services
• Real-time C360 and Churn
• Next Best Offer or Action
• Streaming ETL
• IoT and Log Analytics
• Fraud, Risk Anomaly detection
• Anomaly detection
• Predictive Maintenance
Enabling the Real-time Enterprise
Delightful Customer Experiences
Maximizing operational efficiency
with real-time insights
Build and Deploy use-cases fast
Pre-built ETL, Analytics, Read-write operators
Drag and Drop visual development and DevOps
Fast Data and Big Data; On-premise and Cloud
Enabling the Real-time Enterprise
“I could do my 1.5 month Spark app
in 1.5 days with this product”
- Analytics Lead at Tier 1 US Telco
Impetus Data Science Practice – Relevant Use-cases
Banking and Finance
Data Analytics & Modeling
Finding fraudulent travel and expenses
Text Mining & NLP
Intent to Fraud Detection in e-coms
Graph Analytics
Business impact of customer loss
Insurance
Data Analytics & Modeling
Insurance premium determination using
Catastrophe Modeling
Text Mining & NLP
Detecting Intent to commit fraud in e-
communications (AML, Dodd Frank etc.)
Communication and Media
Data Analytics & Modeling
Finding root cause of No Dial Tone;
Self-learning Anomaly Detection System
Marketing Analytics
Lead generation and Multi-touch
Attribution for increasing conversion rates
Manufacturing and Logistics
Data Analytics & Modeling
Lowering rejection rate of silicon wafers
for a semiconductor company
Early detection of paint defects for
leading auto manufacturer
Correlating multiple data sources to
identify factors related to warranty issues
Energy & Utilities
Data Analytics & Modeling
Reinforcement Learning model to enable
bidding of electricity (price and quantity)
Information Extraction
Extract label information from P&IDs and
make them searchable
Create a Bill of Materials for Budgeting
Healthcare
Data Analytics & Modeling
Predicting Patient Readmission
Text Mining & NLP
Competitive analysis of medicines
Graph Analytics
Drug-disease co-occurrence with Medline
Anomaly Definition
Anomaly: is an observation that greatly deviates from most of the other observations, i.e., a
data point/behavior/pattern that appears to be statistically unusual or anomalous
Basic qualities of anomaly:
1. Rare
2. Significantly different from others
Impetus DSP – Some Applications of Anomaly Detection
The problem of finding patterns in data that do not conform to expected behaviour
Manufacturing
Detect abnormal
machine behavior to
prevent cost overruns
Finance, Insurance
Detect and prevent Out
of Pattern or Fraudulent
spend, travel expenses
Healthcare
Detect fraud in claims
and payments; Events
from RFID and mobiles
Banking
Flag abnormally high
purchases or deposits,
detect cyber intrusions
Networking
Detect intrusion into
networks, prevent theft of
source code or IP
Social Media
Detect compromised
accounts, bots that
generate fake reviews
Video Surveillance
Detect or track objects
and persons of interest in
monotonous footage
Smart Homes
Detect energy leakage,
Standardize smart
sensor datasets
Telecom
Detect roaming abuse,
Revenue fraud, Service
disruptions etc.
Transportation
Ensure external
communications to the
vehicle are not intrusion
Deep Dive on Anomaly Detection
Thought Leadership / Advisory
Anomaly Detection Algorithms Across Disciplines
Host-based IDS
• Statistical Profiling using histograms
• Mixture of Models, Neural Networks
• SVM, Rule-based systems
Network Intrusion Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Non-parametric Statistical Modeling
• Bayesian Networks, Neural Networks
• SVM, Rule-based systems
• Clustering based, Nearest Neighbor
• Spectral, Information Theoretic
Credit Card Fraud Detection
• Neural Networks,
• Rule-based systems
• Clustering, Self-Organizing Map
• Artificial Immune System
• Decision Trees, SVM
Mobile Phone Fraud Detection
• Statistical Profiling using Histograms
• Parametric Statistical Modeling
• Neural networks, Rule-based systems
Insider Trading Detection
• Statistical Profiling using Histograms
• Information Theoretic
Medical and Public Health
• Parametric Statistical Modeling
• Neural Networks, Bayesian Networks
• Rule-based systems
• Nearest Neighbor Techniques
Fault Detection in Mechanical Units
• Parametric Statistical Modeling
• Non-Parametric Statistical Modeling
• Neural Networks, Spectral Methods
• Rule-based Systems
Structural Damage Detection
• Statistical Profiling using histograms
• Parametric Statistical Modeling
• Mixture of Models, Neural Networks
Image Processing, Surveilence
• Mixture of Models, Regression, SVM
• Bayesian Networks, Neural Networks,
• Clustering, Nearest Neighbor Methods
Anomalous Topic Detection
• Mixture of Models, Neural Networks
• Statistical Profiling using Histograms
• Clustering, SVM
Anomaly Detection in Sensor Networks
• Parametric Statistical Modeling
• Bayesian Networks, Nearest Neighbor
• Rule-based Systems, Spectral
Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
Taxonomy for Anomaly Detection Algorithms
Anomaly
Detection
Point Anomaly
Detection
Contextual
Anomaly Detection
Collective Anomaly
Detection
Data instance anomalous with
respect to rest of the data (e.g. a
large transaction)
Data instance anomalous in a
specific context (e.g. large power
spike at night)
A collection of related data
instances are anomalous with
respect to the entire data set
Data – Types of Attributes
Data
Categorical
Nominal
Ordinal
Numerical
Named
Categories
Categories with
an implied order
Discrete
Continuous
Only particular
numbers
Any numerical
value
Binary
Variables with
only two options
(Yes/No)
Anomaly Detection Approaches
Supervised
(Classification)
Data skewness, lack
of counter examples
Unsupervised
(Clustering)
Faces curse of
dimensionality
Semi-supervised
(Novelty
detection)
Requires a “normal”
training dataset
• Anomalies are often a handful among millions of
normal data
• Given training data, this is a class imbalance problem
• There are methods to address this and using SVM,
Random Forests and ensemble learning
• If the data is auto-correlated, then it maybe required to use
time-series classification or Recurrent Neural Network
based approaches
• When there is no training data, unsupervised or
semi-supervised methods can be used
Source: https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
Unsupervised Anomaly Detection Algorithms
Unsupervised AD Algorithms
• k-NN Global Anomaly Detection (uses average
distance to k neighbors)
• kth-NN (uses distance to kth neighbor)
• LOF – Local Outlier Factor
• COF – Connectivity based OF
• LoOP – Local Outlier Probability
• LOCI – Local Correlation Integral
• aLOCI – approximate LOCI
• INFLO – Influenced Outlierness
• CBLOF/ uCBLOF - Cluster-Based LOF
• LDCOF - Local Density Cluster-based OF
• CMGOS - Clustering-based Multivariate
Gaussian Outlier Score
• HBOS - Histogram-based Outlier Score
• One-class Support Vector Machine
• rPCA - Robust PCA LOF
performance
Global anomalies (x1, x2), a
local anomaly x3 and a micro-
cluster c3.
K-NN underperforms on
local anomalies
Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
Some Anomaly Detection Methods
Data has a mix of Categorical and Numeric attributes
K-modes Generic Mixture Models Robust SVM
Uses Hamming distance
to measure distance for
Categorical Features
Extends the framework of
Gaussian Mixture Models
Kernel based approach that identifies
regions in which data resides in
alternate feature space
• Makes standard SVM robust as it
can be affected by outliers
• Retains strengths of SVM – fast
computation, handling high-
dimensional data and kernels
• Is based on GMMs which are
latent variable models
• A latent variable model is a
probability model where some
variables are never observed
• K-Means cannot handle data that
is non-numeric
• K-Modes applies a dissimilarity
measure for categorical items
Some Anomaly Detection Methods
Data has a sequential nature (timestamps, or sequences)
State Space Models Hidden Markov Modes Graph based Methods
Model the evolution of data in time to enable
forecasting and flag an anomaly if it exceeds
a threshold
Markov Chains and HMMs measure the
probability of different events happening in
some sequence
Graphs capture interdependencies, and
allow discovery of relational associations
such as in fraud
• Network intrusion graph grows
dynamically as events occur
• An activity vector obtained from the
graph can detect anomalies
• Markov chains can be built from
historical data
• This chain can be used to find the
probability of an anomalous sequence of
events
• Residual error between model and the
real system is used to identify
anomalous events
• This works with streaming data
System
Behavior
model
Observe
d
behavior
Expecte
d
behavior
Observation
Model Formation
Anomaly
Detection
Simulation
X
Some Anomaly Detection Methods
Other Methods
Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets
AutoEncoders can learn the latent
representation of the data by using an
encoder and a decoder together
RNN-based architectures enable sequence
prediction. The network can flag an anomaly
when needed
GANs combine two neural networks - a
generator and a discriminator, and can be
used to find anomalies
• Deep Convolutional GANs are being
used to learn a manifold of normal
variability
• This allows high accuracy in anomaly
detection
• RNN based models can detect
anomalies in Time Series Data
• More capable architectures such as
LSTM are also possible
• The output of the AutoEncoder is
compared to the input to detect and flag
anomalies
• Anomalies are more likely to have a high
reconstruction error
Impetus DSP - Out of Pattern Transaction Detection
The Challenge
• Major credit card company has
several thousand corporate
customers
• Customers have unique compliance
policies around acceptable spend
• Build a scalable product to identify out
of pattern spend behavior at card
level
Benefits Realized
• Value added service led to increase in
charge volumes of corporate
customers
• Demonstrated the value of external
facing product launches that leverage
machine learning
• Extending to fraud in travel expenses
Impetus Contribution
• Spend behavior of the card accounts
was analyzed to identify normal
spend
• Implemented algorithm to determine
out of pattern transactions and
scaled it to ~ 2M card accounts
• Launched the product in < 3 months
Case Study – “Out of Pattern” Financial Transactions
2 possible reasons
1)Customer’s situation may have really changed
2)Fraudulent usage
Product Demo
i. Introduction to web user interface for StreamAnalytix
ii. Multi-tenancy feature support
iii. Introduction to Data360 in StreamAnalytix
• Data pipelines
• Deploying the jobs
• Real-time dashboards and monitoring in StreamAnalytix
iv. Data Science in StreamAnalytix :
• Network anomaly use case
• Customer transaction anomaly detection use case
• A-B testing use case
v. Enterprise level features in StreamAnalytix
• Versioning
• Import & export data pipelines
• Register entities
• Data pipeline inspect
Thank you.
Questions?
© 2017 Impetus Technologies
Email: inquiry@streamanalytix.com Twitter : @StreamAnalytix

Más contenido relacionado

La actualidad más candente

Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesHumberto Marchezi
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsTuri, Inc.
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection MLMaatougSelim
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!Databricks
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentationPawan Singh
 
User Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine LearningUser Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine LearningDNIF
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningParvathi Sanil Nair
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluationeShikshak
 

La actualidad más candente (20)

Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentation
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
User Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine LearningUser Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine Learning
 
Anomaly Detection: A Survey
Anomaly Detection: A SurveyAnomaly Detection: A Survey
Anomaly Detection: A Survey
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Seminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learningSeminar on detecting fake accounts in social media using machine learning
Seminar on detecting fake accounts in social media using machine learning
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 

Similar a Anomaly Detection and Spark Implementation - Meetup Presentation.pptx

Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber SecurityRishi Kant
 
Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toesCSIRO
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Nasscom how can you identify fraud in fintech lending using deep learning
Nasscom how can you identify fraud in fintech lending using deep learningNasscom how can you identify fraud in fintech lending using deep learning
Nasscom how can you identify fraud in fintech lending using deep learningRatnakar Pandey
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014iotisrael
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
SmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsSmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsDATAVERSITY
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesChamath Sajeewa
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION K Srinivas Rao
 
¿Como los modelos predictivos cambian los negocios?
¿Como los modelos predictivos cambian los negocios?¿Como los modelos predictivos cambian los negocios?
¿Como los modelos predictivos cambian los negocios?Fabricio Quintanilla
 
CaseWare Monitor - New in 5.4 Release
CaseWare Monitor - New in 5.4 ReleaseCaseWare Monitor - New in 5.4 Release
CaseWare Monitor - New in 5.4 ReleaseAlessa
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.Shakas Technologies
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecuritystelligence
 
Machine learning and Autonomous System
Machine learning and Autonomous SystemMachine learning and Autonomous System
Machine learning and Autonomous SystemAnshul Saxena
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar
 
IRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET Journal
 
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...yieldWerx Semiconductor
 

Similar a Anomaly Detection and Spark Implementation - Meetup Presentation.pptx (20)

Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber Security
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
 
Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toes
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Predictive Analytics Overview
Predictive Analytics OverviewPredictive Analytics Overview
Predictive Analytics Overview
 
Nasscom how can you identify fraud in fintech lending using deep learning
Nasscom how can you identify fraud in fintech lending using deep learningNasscom how can you identify fraud in fintech lending using deep learning
Nasscom how can you identify fraud in fintech lending using deep learning
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
SmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsSmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming Analytics
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articles
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
¿Como los modelos predictivos cambian los negocios?
¿Como los modelos predictivos cambian los negocios?¿Como los modelos predictivos cambian los negocios?
¿Como los modelos predictivos cambian los negocios?
 
CaseWare Monitor - New in 5.4 Release
CaseWare Monitor - New in 5.4 ReleaseCaseWare Monitor - New in 5.4 Release
CaseWare Monitor - New in 5.4 Release
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
 
Machine learning and Autonomous System
Machine learning and Autonomous SystemMachine learning and Autonomous System
Machine learning and Autonomous System
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
IRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection AnalysisIRJET- Credit Card Fraud Detection Analysis
IRJET- Credit Card Fraud Detection Analysis
 
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
 

Más de Impetus Technologies

The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...Impetus Technologies
 
Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Impetus Technologies
 
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarAutomated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarImpetus Technologies
 
Building a mature foundation for life in the cloud
Building a mature foundation for life in the cloudBuilding a mature foundation for life in the cloud
Building a mature foundation for life in the cloudImpetus Technologies
 
Best practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarBest practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarImpetus Technologies
 
Automate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeImpetus Technologies
 
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarInstantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarImpetus Technologies
 
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarKeys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarImpetus Technologies
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarImpetus Technologies
 
Keys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataKeys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataImpetus Technologies
 
Build Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesBuild Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesImpetus Technologies
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Impetus Technologies
 
Streaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkStreaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkImpetus Technologies
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies
 
Apache spark empowering the real time data driven enterprise - StreamAnalytix...
Apache spark empowering the real time data driven enterprise - StreamAnalytix...Apache spark empowering the real time data driven enterprise - StreamAnalytix...
Apache spark empowering the real time data driven enterprise - StreamAnalytix...Impetus Technologies
 

Más de Impetus Technologies (17)

The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
The fastest way to convert etl analytics and data warehouse to AWS- Impetus W...
 
Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...Eliminate cyber-security threats using data analytics – Build a resilient ent...
Eliminate cyber-security threats using data analytics – Build a resilient ent...
 
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus WebinarAutomated EDW Assessment and Actionable Recommendations - Impetus Webinar
Automated EDW Assessment and Actionable Recommendations - Impetus Webinar
 
Building a mature foundation for life in the cloud
Building a mature foundation for life in the cloudBuilding a mature foundation for life in the cloud
Building a mature foundation for life in the cloud
 
Best practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus WebinarBest practices to build a sustainable data lake on cloud - Impetus Webinar
Best practices to build a sustainable data lake on cloud - Impetus Webinar
 
Automate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to Snowflake
 
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinarInstantly convert Teradata ETL and EDW to Spark- Impetus webinar
Instantly convert Teradata ETL and EDW to Spark- Impetus webinar
 
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinarKeys to establish sustainable DW and analytics on the cloud -Impetus webinar
Keys to establish sustainable DW and analytics on the cloud -Impetus webinar
 
Solving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinarSolving the EDW transformation conundrum - Impetus webinar
Solving the EDW transformation conundrum - Impetus webinar
 
Keys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of DataKeys to Formulating an Effective Data Management Strategy in the Age of Data
Keys to Formulating an Effective Data Management Strategy in the Age of Data
 
Build Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in MinutesBuild Spark-based ETL Workflows on Cloud in Minutes
Build Spark-based ETL Workflows on Cloud in Minutes
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
Streaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache SparkStreaming Analytics for IoT with Apache Spark
Streaming Analytics for IoT with Apache Spark
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
 
Apache spark empowering the real time data driven enterprise - StreamAnalytix...
Apache spark empowering the real time data driven enterprise - StreamAnalytix...Apache spark empowering the real time data driven enterprise - StreamAnalytix...
Apache spark empowering the real time data driven enterprise - StreamAnalytix...
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
 

Último

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 

Último (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 

Anomaly Detection and Spark Implementation - Meetup Presentation.pptx

  • 1. Anomaly Detection & Spark Implementation Presenters:- Maxim Shkarayev Anand Venugopal Punit Shah DECEMBER 5, 2017 Meetup:
  • 2.
  • 3. Stream Processing and Machine Learning Platform for the Enterprise Thought Leadership / Advisory
  • 4. Impetus Introduction Mission critical technology solutions since 1996 Global leaders are our Big Data clients 1700 people: US, India, global reach Unique mix of Big Data products and services
  • 5. • Real-time C360 and Churn • Next Best Offer or Action • Streaming ETL • IoT and Log Analytics • Fraud, Risk Anomaly detection • Anomaly detection • Predictive Maintenance Enabling the Real-time Enterprise Delightful Customer Experiences Maximizing operational efficiency with real-time insights
  • 6. Build and Deploy use-cases fast Pre-built ETL, Analytics, Read-write operators Drag and Drop visual development and DevOps Fast Data and Big Data; On-premise and Cloud Enabling the Real-time Enterprise “I could do my 1.5 month Spark app in 1.5 days with this product” - Analytics Lead at Tier 1 US Telco
  • 7. Impetus Data Science Practice – Relevant Use-cases Banking and Finance Data Analytics & Modeling Finding fraudulent travel and expenses Text Mining & NLP Intent to Fraud Detection in e-coms Graph Analytics Business impact of customer loss Insurance Data Analytics & Modeling Insurance premium determination using Catastrophe Modeling Text Mining & NLP Detecting Intent to commit fraud in e- communications (AML, Dodd Frank etc.) Communication and Media Data Analytics & Modeling Finding root cause of No Dial Tone; Self-learning Anomaly Detection System Marketing Analytics Lead generation and Multi-touch Attribution for increasing conversion rates Manufacturing and Logistics Data Analytics & Modeling Lowering rejection rate of silicon wafers for a semiconductor company Early detection of paint defects for leading auto manufacturer Correlating multiple data sources to identify factors related to warranty issues Energy & Utilities Data Analytics & Modeling Reinforcement Learning model to enable bidding of electricity (price and quantity) Information Extraction Extract label information from P&IDs and make them searchable Create a Bill of Materials for Budgeting Healthcare Data Analytics & Modeling Predicting Patient Readmission Text Mining & NLP Competitive analysis of medicines Graph Analytics Drug-disease co-occurrence with Medline
  • 8. Anomaly Definition Anomaly: is an observation that greatly deviates from most of the other observations, i.e., a data point/behavior/pattern that appears to be statistically unusual or anomalous Basic qualities of anomaly: 1. Rare 2. Significantly different from others
  • 9. Impetus DSP – Some Applications of Anomaly Detection The problem of finding patterns in data that do not conform to expected behaviour Manufacturing Detect abnormal machine behavior to prevent cost overruns Finance, Insurance Detect and prevent Out of Pattern or Fraudulent spend, travel expenses Healthcare Detect fraud in claims and payments; Events from RFID and mobiles Banking Flag abnormally high purchases or deposits, detect cyber intrusions Networking Detect intrusion into networks, prevent theft of source code or IP Social Media Detect compromised accounts, bots that generate fake reviews Video Surveillance Detect or track objects and persons of interest in monotonous footage Smart Homes Detect energy leakage, Standardize smart sensor datasets Telecom Detect roaming abuse, Revenue fraud, Service disruptions etc. Transportation Ensure external communications to the vehicle are not intrusion
  • 10. Deep Dive on Anomaly Detection Thought Leadership / Advisory
  • 11. Anomaly Detection Algorithms Across Disciplines Host-based IDS • Statistical Profiling using histograms • Mixture of Models, Neural Networks • SVM, Rule-based systems Network Intrusion Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Non-parametric Statistical Modeling • Bayesian Networks, Neural Networks • SVM, Rule-based systems • Clustering based, Nearest Neighbor • Spectral, Information Theoretic Credit Card Fraud Detection • Neural Networks, • Rule-based systems • Clustering, Self-Organizing Map • Artificial Immune System • Decision Trees, SVM Mobile Phone Fraud Detection • Statistical Profiling using Histograms • Parametric Statistical Modeling • Neural networks, Rule-based systems Insider Trading Detection • Statistical Profiling using Histograms • Information Theoretic Medical and Public Health • Parametric Statistical Modeling • Neural Networks, Bayesian Networks • Rule-based systems • Nearest Neighbor Techniques Fault Detection in Mechanical Units • Parametric Statistical Modeling • Non-Parametric Statistical Modeling • Neural Networks, Spectral Methods • Rule-based Systems Structural Damage Detection • Statistical Profiling using histograms • Parametric Statistical Modeling • Mixture of Models, Neural Networks Image Processing, Surveilence • Mixture of Models, Regression, SVM • Bayesian Networks, Neural Networks, • Clustering, Nearest Neighbor Methods Anomalous Topic Detection • Mixture of Models, Neural Networks • Statistical Profiling using Histograms • Clustering, SVM Anomaly Detection in Sensor Networks • Parametric Statistical Modeling • Bayesian Networks, Nearest Neighbor • Rule-based Systems, Spectral Source: Chandola, V. et al. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
  • 12. Taxonomy for Anomaly Detection Algorithms Anomaly Detection Point Anomaly Detection Contextual Anomaly Detection Collective Anomaly Detection Data instance anomalous with respect to rest of the data (e.g. a large transaction) Data instance anomalous in a specific context (e.g. large power spike at night) A collection of related data instances are anomalous with respect to the entire data set
  • 13. Data – Types of Attributes Data Categorical Nominal Ordinal Numerical Named Categories Categories with an implied order Discrete Continuous Only particular numbers Any numerical value Binary Variables with only two options (Yes/No)
  • 14. Anomaly Detection Approaches Supervised (Classification) Data skewness, lack of counter examples Unsupervised (Clustering) Faces curse of dimensionality Semi-supervised (Novelty detection) Requires a “normal” training dataset • Anomalies are often a handful among millions of normal data • Given training data, this is a class imbalance problem • There are methods to address this and using SVM, Random Forests and ensemble learning • If the data is auto-correlated, then it maybe required to use time-series classification or Recurrent Neural Network based approaches • When there is no training data, unsupervised or semi-supervised methods can be used Source: https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/
  • 15. Unsupervised Anomaly Detection Algorithms Unsupervised AD Algorithms • k-NN Global Anomaly Detection (uses average distance to k neighbors) • kth-NN (uses distance to kth neighbor) • LOF – Local Outlier Factor • COF – Connectivity based OF • LoOP – Local Outlier Probability • LOCI – Local Correlation Integral • aLOCI – approximate LOCI • INFLO – Influenced Outlierness • CBLOF/ uCBLOF - Cluster-Based LOF • LDCOF - Local Density Cluster-based OF • CMGOS - Clustering-based Multivariate Gaussian Outlier Score • HBOS - Histogram-based Outlier Score • One-class Support Vector Machine • rPCA - Robust PCA LOF performance Global anomalies (x1, x2), a local anomaly x3 and a micro- cluster c3. K-NN underperforms on local anomalies Source: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173
  • 16. Some Anomaly Detection Methods Data has a mix of Categorical and Numeric attributes K-modes Generic Mixture Models Robust SVM Uses Hamming distance to measure distance for Categorical Features Extends the framework of Gaussian Mixture Models Kernel based approach that identifies regions in which data resides in alternate feature space • Makes standard SVM robust as it can be affected by outliers • Retains strengths of SVM – fast computation, handling high- dimensional data and kernels • Is based on GMMs which are latent variable models • A latent variable model is a probability model where some variables are never observed • K-Means cannot handle data that is non-numeric • K-Modes applies a dissimilarity measure for categorical items
  • 17. Some Anomaly Detection Methods Data has a sequential nature (timestamps, or sequences) State Space Models Hidden Markov Modes Graph based Methods Model the evolution of data in time to enable forecasting and flag an anomaly if it exceeds a threshold Markov Chains and HMMs measure the probability of different events happening in some sequence Graphs capture interdependencies, and allow discovery of relational associations such as in fraud • Network intrusion graph grows dynamically as events occur • An activity vector obtained from the graph can detect anomalies • Markov chains can be built from historical data • This chain can be used to find the probability of an anomalous sequence of events • Residual error between model and the real system is used to identify anomalous events • This works with streaming data System Behavior model Observe d behavior Expecte d behavior Observation Model Formation Anomaly Detection Simulation X
  • 18. Some Anomaly Detection Methods Other Methods Deep Learning (AutoEncoder) Deep Learning (RNN-based) Generative Adversarial Nets AutoEncoders can learn the latent representation of the data by using an encoder and a decoder together RNN-based architectures enable sequence prediction. The network can flag an anomaly when needed GANs combine two neural networks - a generator and a discriminator, and can be used to find anomalies • Deep Convolutional GANs are being used to learn a manifold of normal variability • This allows high accuracy in anomaly detection • RNN based models can detect anomalies in Time Series Data • More capable architectures such as LSTM are also possible • The output of the AutoEncoder is compared to the input to detect and flag anomalies • Anomalies are more likely to have a high reconstruction error
  • 19. Impetus DSP - Out of Pattern Transaction Detection The Challenge • Major credit card company has several thousand corporate customers • Customers have unique compliance policies around acceptable spend • Build a scalable product to identify out of pattern spend behavior at card level Benefits Realized • Value added service led to increase in charge volumes of corporate customers • Demonstrated the value of external facing product launches that leverage machine learning • Extending to fraud in travel expenses Impetus Contribution • Spend behavior of the card accounts was analyzed to identify normal spend • Implemented algorithm to determine out of pattern transactions and scaled it to ~ 2M card accounts • Launched the product in < 3 months
  • 20. Case Study – “Out of Pattern” Financial Transactions 2 possible reasons 1)Customer’s situation may have really changed 2)Fraudulent usage
  • 22. i. Introduction to web user interface for StreamAnalytix ii. Multi-tenancy feature support iii. Introduction to Data360 in StreamAnalytix • Data pipelines • Deploying the jobs • Real-time dashboards and monitoring in StreamAnalytix iv. Data Science in StreamAnalytix : • Network anomaly use case • Customer transaction anomaly detection use case • A-B testing use case v. Enterprise level features in StreamAnalytix • Versioning • Import & export data pipelines • Register entities • Data pipeline inspect
  • 23. Thank you. Questions? © 2017 Impetus Technologies Email: inquiry@streamanalytix.com Twitter : @StreamAnalytix