Process Mining - Chapter 3 - Data Mining

Chapter 3
Data Mining

prof.dr.ir. Wil van der Aalst
www.processmining.org

Overview
Chapter 1
Introduction

Part I: Preliminaries

Chapter 2 Chapter 3
Process Modeling and Data Mining
Analysis

Part II: From Event Logs to Process Models

Chapter 4 Chapter 5 Chapter 6
Getting the Data Process Discovery: An Advanced Process
Introduction Discovery Techniques

Part III: Beyond Process Discovery

Conformance Mining Additional Operational Support
Checking Perspectives

Part IV: Putting Process Mining to Work

Tool Support Analyzing “Lasagna Analyzing “Spaghetti
Processes” Processes”

Part V: Reflection

Chapter 13 Chapter 14
Cartography and Epilogue
Navigation
PAGE 1

Data mining

• The growth of the “digital universe” is the main
driver for the popularity of data mining.
• Initially, the term “data mining” had a negative
connotation (“data snooping”, “fishing”, and “data
dredging”).
• Now a mature discipline.
• Data-centric, not process-centric.

PAGE 2

Data about 860 recently deceased
Data set 1 persons to study the effects of
drinking, smoking, and body weight
on the life expectancy.

Questions:
- What is the effect of smoking and drinking on a person’s bodyweight?
- Do people that smoke also drink?
- What factors influence a person’s life expectancy the most?
- Can one identify groups of people having a similar lifestyle?
PAGE 3

Data about 420 students to investigate
Data set 2 relationships among course grades
and the student’s overall performance
in the Bachelor program.

Questions:
- Are the marks of certain courses highly correlated?
- Which electives do excellent students (cum laude) take?
- Which courses significantly delay the moment of graduation?
- Why do students drop out?
- Can one identify groups of students having a similar study
behavior? PAGE 4

Data on 240 customer orders
Data set 3 in a coffee bar recorded by
the cash register.

Questions:
- Which products are frequently purchased together?
- When do people buy a particular product?
- Is it possible to characterize typical customer groups?
- How to promote the sales of products with a higher margin?
PAGE 5

Variables

• Data set (sample or table) consists of instances
(individuals, entities, cases, objects, or records).
• Variables are often referred to as attributes, features,
or data elements.
• Two types:
− categorical variables:
− ordinal (high-med-low, cum laude-passed-failed) or
− nominal (true-false, red-pink-green)
− numerical variables (ordered, cannot be enumerated
easily)

PAGE 6

Supervised Learning

• Labeled data, i.e., there is a response variable that
labels each instance.
• Goal: explain response variable (dependent variable)
in terms of predictor variables (independent
variables).
• Classification techniques (e.g., decision tree
learning) assume a categorical response variable
and the goal is to classify instances based on the
predictor variables.
• Regression techniques assume a numerical
response variable. The goal is to find a function that
fits the data with the least error.
PAGE 7

Unsupervised Learning

• Unsupervised learning assumes unlabeled data, i.e.,
the variables are not split into response and
predictor variables.
• Examples: clustering (e.g., k-means clustering and
agglomerative hierarchical clustering) and pattern
discovery (association rules)

PAGE 8

Decision tree learning: data set 1

smoker
yes no

young
drinker
(195/11) yes no

old
weight
<90 ≥90 (65/2)

old young
(219/34) (381/55)

PAGE 9


logic ≥8
-

failed <8 program
(79/10) ming ≥7

linear <7
algebra cum laude
≥6
<6 (20/2)
linear
algebra ≥6
passed
operat. (87/11)
<6
<6 research ≥6 passed
(31/7)
failed
failed passed (20/4)
(101/8) (82/7)

PAGE 10


tea
0 ≥1

muffin
latte
0 ≥2 (30/1)

no muffin 1 muffin
(189/10) (4/0)

espresso
0 ≥1

muffin no muffin
(6/2) (11/3)

PAGE 11

Basic idea
#young=546
young Overall E = 0.946848
#old=314
E=0.946848 (860/303)
information gain
• Split the set of is 0.107012

split on attribute smoker
instances in
subsets such that Overall E = 0.839836
#young=184
the variation within #old=11
E = 0.313027
yes
smoker
no
information gain

each subset young young
#young=362
#old=303
is 0.076468

(195/11)
becomes smaller. (665/303) E=0.994314

• Based on notion of split on attribute drinker

entropy or similar.
#young=184 Overall E = 0.763368
• Minimize average #old=11
E = 0.313027
yes
smoker
no

entropy; maximize young #young=2
drinker #old=63
information gain (195/11) yes no E=0.198234

per step. #young=360
#old=240
young
(600/240)
old
(65/2)
E=0.970951

PAGE 12

Clustering
age

age
+ +
cluster A cluster B

+
cluster C
weight weight

PAGE 13

k-means clustering

+
+
+ +
+
+

(a) (b) (c)

PAGE 14

Agglomerative hierarchical clustering

dendrogram
abcdefghij
a c
b d efghij
abcd

efg hij
f h
ab cd fg hi
e
g i j
a b c d e f g h i j
(a) (b)

PAGE 15

Levels introduced by agglomerative
hierarchical clustering

abcdefghij
a c
b d efghij
abcd

efg hij
f h
ab cd fg hi
e
g i j
a b c d e f g h i j
(a) (b)

Any horizontal line in dendrogram
corresponds to a concrete clustering at
a particular level of abstraction

PAGE 16

Association rule learning

• Rules of form “IF X THEN Y”

PAGE 17

Special case: market basket analysis

PAGE 18

Example
(people that order tea and latte also order muffins)

• Support should be as high as possible (but will be low in case of many items).
• Confidence should be close to 1.
• High lift values suggest a positive correlation (1 if independent).
PAGE 19

Brute force algorithm

PAGE 20

Apriori (optimization based on two
observations)

PAGE 21

Sequence
mining

PAGE 22

Episode mining
(32 time windows of length 5)

a c b e d f c b b c a e b e c d c b
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

b b b

a d a d

c c c

E1 E2 E3

PAGE 23

Occurrences

b b b

a d a d

c c c

E1 E2 E3

E2 (16x)

a c b e d f c b b c a e b e c d c b
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

E1 E1 E3

PAGE 24

Hidden Markov models

• Given an observation sequence, s state

how to compute the probability of
x observation
the sequence given a hidden
Markov model? 0.7
transition with probability

• Given an observation sequence 0.5 observation probability
and a hidden Markov model, how
to compute the most likely 1.0
“hidden path” in the model? 0.7 0.2
0.3
• Given a set of observation s1 s2 s3
0.8
sequences, how to derive the
0.5 0.5 0.6 0.4 0.8 0.2
hidden Markov model that
maximizes the probability of
producing these sequences?
a b c d e

PAGE 25

Relation between data mining and
process mining

• Process mining: about end-to-end processes.
• Data mining: data-centric and not process-centric.
• Judging the quality of data mining and process
mining: many similarities, but also some differences.
• Clearly, process mining techniques can benefit from
experiences in the data mining field.
• Let us now focus on the quality of mining results.

PAGE 26

Confusion matrix

logic ≥8
-

failed <8 program
(79/10) ming ≥7

linear <7 cum laude
predicted
algebra ≥6
<6
linear
(20/2) class
algebra ≥6

cum laude
passed
operat. (87/11)

passed
<6
<6 research ≥6 passed

failed
(31/7)
failed
failed passed (20/4)
(101/8) (82/7)

failed 178 22 0

actual class passed 21 175 2

cum laude 1 3 18

PAGE 27

Confusion matrix: metrics

predicted name formula
class
error (fp+fn)/N

+ - accuracy
tp-rate
(tp+tn)/N
tp/p
+ tp fn p fp/n
actual

fp-rate
class

- fp tn n precision tp/p’
recall tp/p
p’ n’ N

(a) (b)

tp is the number of true positives, i.e., instances that are correctly classified as positive.
fn is the number of false negatives, i.e., instances that are predicted to be negative but
should have been classified as positive.
fp is the number of false positives, i.e., instances that are predicted to be positive but should
have been classified as negative.
PAGE 28
tn is the number of true negatives, i.e., instances that are correctly classified as negative.

Example
#young=546
young Overall E = 0.946848
#old=314
E=0.946848 (860/303)
information gain
is 0.107012

split on attribute smoker

#old=11 smoker
yes no
E = 0.313027 information gain
is 0.076468
#young=362
young young #old=303
(195/11) (665/303) E=0.994314
predicted predicted
class class
split on attribute drinker

young

young
old

old
#old=11 smoker
yes no young 546 0 young 544 2

actual

actual
class

class
E = 0.313027

young
drinker
#young=2
#old=63
old 314 0 old 251 63
(195/11) yes no E=0.198234

(a) (b)
#young=360 young old
#old=240 (600/240) (65/2)
E=0.970951

PAGE 29

Cross-validation

learning
algorithm

model
training set
split
test

data set
performance
test set indicator

PAGE 30

k-fold cross-validation

learning
algorithm

model

split
test

data set
performance
indicator
k data sets rotate

PAGE 31

Occam’s Razor

• Principle attributed to the 14thcentury English logician
William of Ockham.
• The principle states that “one should not increase,
beyond what is necessary, the number of entities
required to explain anything”, i.e., one should look for
the “simplest model” that can explain what is observed
in the data set.
• The Minimal Description Length (MDL) principle tries to
operationalize Occam’s. In MDL performance is judged
on the training data alone and not measured against
new, unseen instances. The basic idea is that the
“best” model is the one that minimizes the encoding of
both model and data set.
PAGE 32

Process Mining - Chapter 3 - Data Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Process Mining - Chapter 3 - Data Mining

Similar a Process Mining - Chapter 3 - Data Mining (20)

Más de Wil van der Aalst

Más de Wil van der Aalst (18)

Último

Último (20)

Process Mining - Chapter 3 - Data Mining