Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.
2. Overview
Chapter 1
Introduction
Part I: Preliminaries
Chapter 2 Chapter 3
Process Modeling and Data Mining
Analysis
Part II: From Event Logs to Process Models
Chapter 4 Chapter 5 Chapter 6
Getting the Data Process Discovery: An Advanced Process
Introduction Discovery Techniques
Part III: Beyond Process Discovery
Chapter 7 Chapter 8 Chapter 9
Conformance Mining Additional Operational Support
Checking Perspectives
Part IV: Putting Process Mining to Work
Chapter 10 Chapter 11 Chapter 12
Tool Support Analyzing “Lasagna Analyzing “Spaghetti
Processes” Processes”
Part V: Reflection
Chapter 13 Chapter 14
Cartography and Epilogue
Navigation
PAGE 1
3. Data mining
• The growth of the “digital universe” is the main
driver for the popularity of data mining.
• Initially, the term “data mining” had a negative
connotation (“data snooping”, “fishing”, and “data
dredging”).
• Now a mature discipline.
• Data-centric, not process-centric.
PAGE 2
4. Data about 860 recently deceased
Data set 1 persons to study the effects of
drinking, smoking, and body weight
on the life expectancy.
Questions:
- What is the effect of smoking and drinking on a person’s bodyweight?
- Do people that smoke also drink?
- What factors influence a person’s life expectancy the most?
- Can one identify groups of people having a similar lifestyle?
PAGE 3
5. Data about 420 students to investigate
Data set 2 relationships among course grades
and the student’s overall performance
in the Bachelor program.
Questions:
- Are the marks of certain courses highly correlated?
- Which electives do excellent students (cum laude) take?
- Which courses significantly delay the moment of graduation?
- Why do students drop out?
- Can one identify groups of students having a similar study
behavior? PAGE 4
6. Data on 240 customer orders
Data set 3 in a coffee bar recorded by
the cash register.
Questions:
- Which products are frequently purchased together?
- When do people buy a particular product?
- Is it possible to characterize typical customer groups?
- How to promote the sales of products with a higher margin?
PAGE 5
7. Variables
• Data set (sample or table) consists of instances
(individuals, entities, cases, objects, or records).
• Variables are often referred to as attributes, features,
or data elements.
• Two types:
− categorical variables:
− ordinal (high-med-low, cum laude-passed-failed) or
− nominal (true-false, red-pink-green)
− numerical variables (ordered, cannot be enumerated
easily)
PAGE 6
8. Supervised Learning
• Labeled data, i.e., there is a response variable that
labels each instance.
• Goal: explain response variable (dependent variable)
in terms of predictor variables (independent
variables).
• Classification techniques (e.g., decision tree
learning) assume a categorical response variable
and the goal is to classify instances based on the
predictor variables.
• Regression techniques assume a numerical
response variable. The goal is to find a function that
fits the data with the least error.
PAGE 7
9. Unsupervised Learning
• Unsupervised learning assumes unlabeled data, i.e.,
the variables are not split into response and
predictor variables.
• Examples: clustering (e.g., k-means clustering and
agglomerative hierarchical clustering) and pattern
discovery (association rules)
PAGE 8
10. Decision tree learning: data set 1
smoker
yes no
young
drinker
(195/11) yes no
old
weight
<90 ≥90 (65/2)
old young
(219/34) (381/55)
PAGE 9
11. Decision tree learning: data set 2
logic ≥8
-
failed <8 program
(79/10) ming ≥7
linear <7
algebra cum laude
≥6
<6 (20/2)
linear
algebra ≥6
passed
operat. (87/11)
<6
<6 research ≥6 passed
(31/7)
failed
failed passed (20/4)
(101/8) (82/7)
PAGE 10
12. Decision tree learning: data set 3
tea
0 ≥1
muffin
latte
0 ≥2 (30/1)
no muffin 1 muffin
(189/10) (4/0)
espresso
0 ≥1
muffin no muffin
(6/2) (11/3)
PAGE 11
13. Basic idea
#young=546
young Overall E = 0.946848
#old=314
E=0.946848 (860/303)
information gain
• Split the set of is 0.107012
split on attribute smoker
instances in
subsets such that Overall E = 0.839836
#young=184
the variation within #old=11
E = 0.313027
yes
smoker
no
information gain
each subset young young
#young=362
#old=303
is 0.076468
(195/11)
becomes smaller. (665/303) E=0.994314
• Based on notion of split on attribute drinker
entropy or similar.
#young=184 Overall E = 0.763368
• Minimize average #old=11
E = 0.313027
yes
smoker
no
entropy; maximize young #young=2
drinker #old=63
information gain (195/11) yes no E=0.198234
per step. #young=360
#old=240
young
(600/240)
old
(65/2)
E=0.970951
PAGE 12
14. Clustering
age
age
+ +
cluster A cluster B
+
cluster C
weight weight
PAGE 13
17. Levels introduced by agglomerative
hierarchical clustering
abcdefghij
a c
b d efghij
abcd
efg hij
f h
ab cd fg hi
e
g i j
a b c d e f g h i j
(a) (b)
Any horizontal line in dendrogram
corresponds to a concrete clustering at
a particular level of abstraction
PAGE 16
20. Example
(people that order tea and latte also order muffins)
• Support should be as high as possible (but will be low in case of many items).
• Confidence should be close to 1.
• High lift values suggest a positive correlation (1 if independent).
PAGE 19
24. Episode mining
(32 time windows of length 5)
a c b e d f c b b c a e b e c d c b
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
b b b
a d a d
c c c
E1 E2 E3
PAGE 23
25. Occurrences
b b b
a d a d
c c c
E1 E2 E3
E2 (16x)
a c b e d f c b b c a e b e c d c b
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
E1 E1 E3
PAGE 24
26. Hidden Markov models
• Given an observation sequence, s state
how to compute the probability of
x observation
the sequence given a hidden
Markov model? 0.7
transition with probability
• Given an observation sequence 0.5 observation probability
and a hidden Markov model, how
to compute the most likely 1.0
“hidden path” in the model? 0.7 0.2
0.3
• Given a set of observation s1 s2 s3
0.8
sequences, how to derive the
0.5 0.5 0.6 0.4 0.8 0.2
hidden Markov model that
maximizes the probability of
producing these sequences?
a b c d e
PAGE 25
27. Relation between data mining and
process mining
• Process mining: about end-to-end processes.
• Data mining: data-centric and not process-centric.
• Judging the quality of data mining and process
mining: many similarities, but also some differences.
• Clearly, process mining techniques can benefit from
experiences in the data mining field.
• Let us now focus on the quality of mining results.
PAGE 26
28. Confusion matrix
logic ≥8
-
failed <8 program
(79/10) ming ≥7
linear <7 cum laude
predicted
algebra ≥6
<6
linear
(20/2) class
algebra ≥6
cum laude
passed
operat. (87/11)
passed
<6
<6 research ≥6 passed
failed
(31/7)
failed
failed passed (20/4)
(101/8) (82/7)
failed 178 22 0
actual class passed 21 175 2
cum laude 1 3 18
PAGE 27
29. Confusion matrix: metrics
predicted name formula
class
error (fp+fn)/N
+ - accuracy
tp-rate
(tp+tn)/N
tp/p
+ tp fn p fp/n
actual
fp-rate
class
- fp tn n precision tp/p’
recall tp/p
p’ n’ N
(a) (b)
tp is the number of true positives, i.e., instances that are correctly classified as positive.
fn is the number of false negatives, i.e., instances that are predicted to be negative but
should have been classified as positive.
fp is the number of false positives, i.e., instances that are predicted to be positive but should
have been classified as negative.
PAGE 28
tn is the number of true negatives, i.e., instances that are correctly classified as negative.
30. Example
#young=546
young Overall E = 0.946848
#old=314
E=0.946848 (860/303)
information gain
is 0.107012
split on attribute smoker
#young=184 Overall E = 0.839836
#old=11 smoker
yes no
E = 0.313027 information gain
is 0.076468
#young=362
young young #old=303
(195/11) (665/303) E=0.994314
predicted predicted
class class
split on attribute drinker
young
young
old
old
#young=184 Overall E = 0.763368
#old=11 smoker
yes no young 546 0 young 544 2
actual
actual
class
class
E = 0.313027
young
drinker
#young=2
#old=63
old 314 0 old 251 63
(195/11) yes no E=0.198234
(a) (b)
#young=360 young old
#old=240 (600/240) (65/2)
E=0.970951
PAGE 29
31. Cross-validation
learning
algorithm
model
training set
split
test
data set
performance
test set indicator
PAGE 30
32. k-fold cross-validation
learning
algorithm
model
split
test
data set
performance
indicator
k data sets rotate
PAGE 31
33. Occam’s Razor
• Principle attributed to the 14thcentury English logician
William of Ockham.
• The principle states that “one should not increase,
beyond what is necessary, the number of entities
required to explain anything”, i.e., one should look for
the “simplest model” that can explain what is observed
in the data set.
• The Minimal Description Length (MDL) principle tries to
operationalize Occam’s. In MDL performance is judged
on the training data alone and not measured against
new, unseen instances. The basic idea is that the
“best” model is the one that minimizes the encoding of
both model and data set.
PAGE 32