DutchMLSchool. Association Discovery and Topic Modeling (Unsupervised II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
4. BigML, Inc #DutchMLSchool
Association Discovery
4
An unsupervised learning technique
• No labels necessary
• Useful for data discovery
Finds "significant" correlations/associations/relations
• Shopping cart: Coffee and sugar
• Medical: High plasma glucose and diabetes
Expresses them as "if then rules"
• If "antecedent" then "consequent"
5. BigML, Inc #DutchMLSchool
Review of methods: clustering
5
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
6. BigML, Inc #DutchMLSchool
Review of methods: clustering
6
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
7. BigML, Inc #DutchMLSchool
Review: anomaly detection
7
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
8. BigML, Inc #DutchMLSchool
Review: anomaly detection
8
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
9. BigML, Inc #DutchMLSchool
Association Discovery
9
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
10. BigML, Inc #DutchMLSchool
Association Discovery
10
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
{customer = Bob, account = 3421}
11. BigML, Inc #DutchMLSchool
Association Discovery
11
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
12. BigML, Inc #DutchMLSchool
Association Discovery
12
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140{customer = Bob, account = 3421}
{class = gas}
13. BigML, Inc #DutchMLSchool
Association Discovery
13
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
{customer = Bob, account = 3421}
{class = gas}
14. BigML, Inc #DutchMLSchool
Association Discovery
14
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
zip = 46140
amount < 100
Rules:
Antecedent Consequent
{customer = Bob, account = 3421}
{class = gas}
15. BigML, Inc #DutchMLSchool
Use Cases
15
• Data Discovery: how do instances relate?
• Market Basket Analysis: Items that go together
• Behaviors that occur together
• Web usage patterns
• Intrusion detection
• Fraud detection
• Medical risk factors
18. BigML, Inc #DutchMLSchool
Association Metrics: support
18
Instances
A
C
Support
Percentage of instances
which match antecedent “A”
and Consequent “C”
20. BigML, Inc #DutchMLSchool
Association Metrics: confidence
20
C
Instances
A
C
A
Instances
C
Instances
A
Instances
A
C
0% 100%
Instances
A
C
Confidence
A never
implies C
A sometimes
implies C
A always
implies C
A >> C A = C A << C
21. BigML, Inc #DutchMLSchool
Association Metrics: lift
21
Lift
Ratio of observed support
to support if A and C were
statistically independent.
Support == Confidence
p(A) * p(C) p(C)
Independent
A
C
C
Observed
A
Problem:
if p(C) is "small" then…
lift may be large.
22. BigML, Inc #DutchMLSchool
Association Metrics: lift
22
C
Observed
A
Observed
A
C
< 1 > 1
Independent
A
C
Lift = 1
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
23. BigML, Inc #DutchMLSchool
Association Metrics: leverage
23
Leverage
Difference of observed
support and support if A
and C were statistically
independent.
Support - [ p(A) * p(C) ]
Independent
A
C
C
Observed
A
24. BigML, Inc #DutchMLSchool
Association Metrics: leverage
24
C
Observed
A
Observed
A
C
< 0 > 0
Independent
A
C
Leverage = 0
Negative
Correlation
No Correlation
Positive
Correlation
Independent
A
C
Independent
A
C
Observed
A
C
-1…
25. BigML, Inc #DutchMLSchool
Items Type
25
itemscoffee, sugar, milk, honey,
dish soap, bread
items
• Canonical example: shopping cart contents
• Single feature describing a list of items
• Each item separated by a comma (default)
26. BigML, Inc #DutchMLSchool
Use Cases
26
GOAL: Discover “interesting” rules about what store items
are typically purchased together.
• Dataset of 9,834 grocery cart transactions
• Each row is a list of all items in a cart at checkout
30. BigML, Inc #DutchMLSchool
What is Topic Modeling?
30
• Unsupervised algorithm
• Learns only from text fields
• Finds hidden topics that
model the text
Text Fields
• How is this different from the Text Analysis
that BigML already offers?
• What does it output and how do we use it?
Questions:
31. BigML, Inc #DutchMLSchool
What is Topic Modeling?
31
• Finds topics in your text fields
• A topic is a distribution over terms
• Terms with high probability in the same topic often occur
together in the same document
• Topics often correspond to real-world things that the
document may be “about” (e.g., sports, cooking,
technology)
• Each document is “about” one or more topics
• Usually each document is only about one or two topics
• But in practice we assign a probability to every topic for
every document
32. BigML, Inc #DutchMLSchool
Text Analysis
32
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
great: appears 4 times
1. Stem Words -> Tokens
2. Remove tokens that
occur too often
3. Remove tokens that do
not occur often enough
4. Count occurrences of
remaining “interesting”
tokens
33. BigML, Inc #DutchMLSchool
Text Analysis
33
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
… great afraid born achieve … …
… 4 1 1 1 … …
… … … … … … …
Model
The token “great”
occurs more than 3 times
The token “afraid”
occurs no more than once
36. BigML, Inc #DutchMLSchool
Text Analysis vs. Topic Modeling
36
Text Topic Model
Creates thousands of
hidden token counts
Token counts are
independently
uninteresting
No semantic importance
Co-occurrence limited to
consecutive n-grams
Creates tens of topics
that model the text
Topics are independently
interesting
Semantic meaning
extracted
Topics indicate broader
co-occurrences
37. BigML, Inc #DutchMLSchool
Generating Documents
37
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
shoe asteroid
flashlight
pizza…
plate giraffe
purple jump…
Be not afraid
of greatness:
some are born
great, some
achieve
greatness…
• "Machine" that generates a random word with equal
probability with each pull.
• Pull random number of times to generate a document.
• All documents can be generated, but most are nonsense.
word probability
shoe ϵ
asteroid ϵ
flashlight ϵ
pizza ϵ
… ϵ
38. BigML, Inc #DutchMLSchool
Topic Model
38
• Written documents have meaning - one way to
describe meaning is to assign a topic.
• For our random machine, the topic can be thought
of as increasing the probability of certain words.
Intuition:
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
airplane
passport pizza
…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
mars quasar
lightyear soda
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
39. BigML, Inc #DutchMLSchool
Topic Model
39
plate giraffe
purple
jump…
Topic: "1"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
Topic: "k"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
shoe 12,12 %
coffee 3,39 %
telephone 13,43 %
paper 4,11 %
… ϵ
…Topic: "2"
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
airplane
passport
pizza …
plate giraffe
purple
jump…
• Each text field in a row is concatenated into a document
• The documents are analyzed to generate "k" related topics
• Each topic is represented by a distribution of term
probabilities
41. BigML, Inc #DutchMLSchool
Topic Distribution
41
• Any given document is likely a mixture of the
modeled topics…
• This can be represented as a distribution of topic
probabilities
Intuition:
Will 2020 be
the year that
humans will
embrace
space
exploration
and finally
travel to Mars?
Topic: travel
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
travel 23,55 %
airplane 2,33 %
mars 0,003 %
mantle ϵ
… ϵ
11%
Topic: space
cat shoe zebra
ball tree jump
pen asteroid
cable box step
cabinet yellow
plate flashlight…
word probability
space 38,94 %
airplane ϵ
mars 13,43 %
mantle 0,05 %
… ϵ
89%
44. BigML, Inc #DutchMLSchool
Topic Model Use Cases
44
• As a preprocessor for other techniques
• Building better models
• Bootstrapping categories for classification
• Recommendation
• Discovery in large, heterogeneous text datasets