2. BigML, Inc 2
Association Discovery
Rule-Based Machine Learning
Greg Antell, Ph.D.
Product Manager and Machine Learning Scientist
BigML, Inc.
3. BigML, Inc 3Association Discovery
Association Discovery
An unsupervised learning technique
• No labels necessary
• Useful for data discovery
Finds "significant" correlations/associations/relations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
Expresses them as "if then rules"
• If "antecedent" then "consequent"
4. BigML, Inc 4Association Discovery
Review of methods: clustering
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
5. BigML, Inc 5Association Discovery
Review of methods: clustering
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
6. BigML, Inc 6Association Discovery
Review: anomaly detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
7. BigML, Inc 7Association Discovery
Review: anomaly detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
8. BigML, Inc 8Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
9. BigML, Inc 9Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
{customer = Bob, account = 3421}
10. BigML, Inc 10Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
11. BigML, Inc 11Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
{class = gas}
12. BigML, Inc 12Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
13. BigML, Inc 13Association Discovery
Association Discovery
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
14. BigML, Inc 14Association Discovery
Use Cases
• Data Discovery: how do instances relate?
• Market Basket Analysis: Items that go together
• Behaviors that occur together
• Web usage patterns
• Intrusion detection
• Fraud detection
• Medical risk factors
15. BigML, Inc 15Association Discovery
Association Metrics
• Coverage
• Support
• Confidence
• Lift
• Leverage
Associations between grocery items
16. BigML, Inc 16Association Discovery
Association Metrics: coverage
Coverage
Percentage of instances
which match antecedent “A”
Instances
A
C
17. BigML, Inc 17Association Discovery
Association Metrics: support
Instances
A
C
Support
Percentage of instances
which match antecedent “A”
and Consequent “C”
18. BigML, Inc 18Association Discovery
Confidence
Percentage of instances in
the antecedent which also
contain the consequent.
Association Metrics: confidence
Coverage
Support
Instances
A
C
19. BigML, Inc 19Association Discovery
Association Metrics: confidence
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
A >> C A = C A << C
20. BigML, Inc 20Association Discovery
Association Metrics: lift
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
21. BigML, Inc 21Association Discovery
Association Metrics: lift
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
22. BigML, Inc 22Association Discovery
Association Metrics: leverage
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
23. BigML, Inc 23Association Discovery
Association Metrics: leverage
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
24. BigML, Inc 24Association Discovery
Basic AD Configuration
Search Strategy: Support/Coverage/Confidence/Lift/Leverage
Max Number of Associations: 1 to 500 (default 100)
Max Items in Antecedent: 1 to 10 (default 4)
Complement Items: True / False
False: Coffee and…
True: Not Coffee and…
Minimum Significance: lower values reduce spurious rules
Consequent: Restrict rules to a specific consequent criteria
25. BigML, Inc 25Association Discovery
Items Type
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
26. BigML, Inc 26Association Discovery
Use Cases
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at checkout
28. BigML, Inc 28Association Discovery
Summary
• Unsupervised learning technique for discovering
interesting associations
• Outputs antecedent/consequent rules
• Metrics: Support / Coverage / Confidence / Lift / Leverage
• Useful for “items” type and market basket analysis
• Applicable to understanding clusters and anomaly detectors
29. BigML, Inc 29
Anomaly Detection
Identifying Outliers in Data
Greg Antell, Ph.D.
Product Manager and Machine Learning Scientist
BigML, Inc.
30. BigML, Inc 30Anomaly Detection
Anomaly Detection?
• An unsupervised learning technique
• No labels necessary
• Useful for finding unusual instances
• Defines each unusual instance by an “anomaly score”
• in BigML: 0=normal, 1=unusual, and 0.7 ≫ 0.6 ﹥0.5
• Distribution of scores is returned
31. BigML, Inc 31Anomaly Detection
Recall: clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
32. BigML, Inc 32Anomaly Detection
Recall: clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
33. BigML, Inc 33Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
34. BigML, Inc 34Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
• Amount $2,459 is higher than all other transactions
• It is the only transaction
• In zip 21350
• for the purchase class “tech"
35. BigML, Inc 35Anomaly Detection
Use Cases
• Identify Incorrect Data - "looking for mistakes"
• Remove Outliers - "improve model quality"
• Intrusion Detection - "looking for unusual usage patterns"
• Fraud - "looking for unusual behavior"
• Model Competence / Input Data Drift
36. BigML, Inc 36Anomaly Detection
Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
37. BigML, Inc 37Anomaly Detection
Removing Outliers
WARNING: never remove data because it is convenient.
Better to check that it is actually an error
38. BigML, Inc 38Anomaly Detection
Diabetes Anomalies
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
MODEL
CLEAN
DATASET
FILTER
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
39. BigML, Inc 39Anomaly Detection
Fraud detection
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
40. BigML, Inc 40Anomaly Detection
Model Competence
• After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
Confidence 86 % 84 %
Anomaly Score 0,5367 0,7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
41. BigML, Inc 41Anomaly Detection
Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the squared
differences from the Mean.)
• Depending on the number of instances, choose a “multiple”
of standard deviations to indicate an anomaly. A multiple of 3
for 1000 instances removes ~ 3 outliers.
42. BigML, Inc 42Anomaly Detection
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
46. BigML, Inc 46Anomaly Detection
Human Expert
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight
The “most unusual” object
is different in some way from
every partition of the features.
Most unusual
47. BigML, Inc 47Anomaly Detection
Human Expert
• Human used prior knowledge to select possible features
that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen features
• Each cluster was then examined to see which object fit
the least well in its cluster and did not fit any other cluster
48. BigML, Inc 48Anomaly Detection
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
50. BigML, Inc 50Anomaly Detection
length/width > 5
smooth?
box
blockeraser
knob
penny/dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that “splits” matter - don’t know the order
TrueFalse
TrueFalse TrueFalse
FalseTrue
TrueFalse
Random Splits
51. BigML, Inc 51Anomaly Detection
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
53. BigML, Inc 53Anomaly Detection
Isolation Forest Scoring
D = 3
D = 6
D = 2
S=0.45
Map avg depth
to final score
f1 f2 f3
i1 red cat ball
i2 red cat ball
i3 red cat box
i4 blue dog pen
For the instance, i2
Find the depth in each tree
55. BigML, Inc 55Anomaly Detection
Summary
• Anomaly detection is the process of finding unusual instances
• Some techniques and how they work:
• Univariate: standard deviation
• Benford’s law
• Isolation Forest
• Applications
• Filtering to improve models
• Finding mistakes, fraud, and intruders
• Knowing when to retrain a model (competence)
• In general… unsupervised learning techniques:
• Require more finesse and interpretation
• Are more commonly part of a multistep workflow