SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Chapter 3
Data Mining

prof.dr.ir. Wil van der Aalst
www.processmining.org
Overview
Chapter 1
Introduction



Part I: Preliminaries

Chapter 2                   Chapter 3
Process Modeling and        Data Mining
Analysis


Part II: From Event Logs to Process Models

Chapter 4                  Chapter 5               Chapter 6
Getting the Data           Process Discovery: An   Advanced Process
                           Introduction            Discovery Techniques


Part III: Beyond Process Discovery

Chapter 7                   Chapter 8              Chapter 9
Conformance                 Mining Additional      Operational Support
Checking                    Perspectives


Part IV: Putting Process Mining to Work

Chapter 10                  Chapter 11             Chapter 12
Tool Support                Analyzing “Lasagna     Analyzing “Spaghetti
                            Processes”             Processes”


Part V: Reflection

Chapter 13                  Chapter 14
Cartography and             Epilogue
Navigation
                                                                          PAGE 1
Data mining

• The growth of the “digital universe” is the main
  driver for the popularity of data mining.
• Initially, the term “data mining” had a negative
  connotation (“data snooping”, “fishing”, and “data
  dredging”).
• Now a mature discipline.
• Data-centric, not process-centric.




                                                       PAGE 2
Data about 860 recently deceased
  Data set 1                              persons to study the effects of
                                          drinking, smoking, and body weight
                                          on the life expectancy.




Questions:
- What is the effect of smoking and drinking on a person’s bodyweight?
- Do people that smoke also drink?
- What factors influence a person’s life expectancy the most?
- Can one identify groups of people having a similar lifestyle?
                                                                         PAGE 3
Data about 420 students to investigate
  Data set 2                         relationships among course grades
                                     and the student’s overall performance
                                     in the Bachelor program.




Questions:
- Are the marks of certain courses highly correlated?
- Which electives do excellent students (cum laude) take?
- Which courses significantly delay the moment of graduation?
- Why do students drop out?
- Can one identify groups of students having a similar study
behavior?                                                              PAGE 4
Data on 240 customer orders
  Data set 3                                in a coffee bar recorded by
                                            the cash register.




Questions:
- Which products are frequently purchased together?
- When do people buy a particular product?
- Is it possible to characterize typical customer groups?
- How to promote the sales of products with a higher margin?
                                                                     PAGE 5
Variables

• Data set (sample or table) consists of instances
  (individuals, entities, cases, objects, or records).
• Variables are often referred to as attributes, features,
  or data elements.
• Two types:
   − categorical variables:
     − ordinal (high-med-low, cum laude-passed-failed) or
     − nominal (true-false, red-pink-green)
   − numerical variables (ordered, cannot be enumerated
     easily)



                                                       PAGE 6
Supervised Learning

• Labeled data, i.e., there is a response variable that
  labels each instance.
• Goal: explain response variable (dependent variable)
  in terms of predictor variables (independent
  variables).
• Classification techniques (e.g., decision tree
  learning) assume a categorical response variable
  and the goal is to classify instances based on the
  predictor variables.
• Regression techniques assume a numerical
  response variable. The goal is to find a function that
  fits the data with the least error.
                                                      PAGE 7
Unsupervised Learning

• Unsupervised learning assumes unlabeled data, i.e.,
  the variables are not split into response and
  predictor variables.
• Examples: clustering (e.g., k-means clustering and
  agglomerative hierarchical clustering) and pattern
  discovery (association rules)




                                                   PAGE 8
Decision tree learning: data set 1

           smoker
    yes             no


 young
                    drinker
(195/11)      yes              no


                                 old
           weight
    <90             ≥90        (65/2)


   old               young
(219/34)            (381/55)




                                        PAGE 9
Decision tree learning: data set 2

                      logic          ≥8
              -


           failed      <8                 program
          (79/10)                           ming     ≥7

                      linear                <7
                     algebra                        cum laude
                               ≥6
              <6                                      (20/2)
                                           linear
                                          algebra    ≥6
                               passed
           operat.             (87/11)
                                            <6
  <6      research    ≥6                             passed
                                                      (31/7)
                                           failed
 failed              passed                (20/4)
(101/8)               (82/7)




                                                                PAGE 10
Decision tree learning: data set 3

                         tea
                   0               ≥1


                                   muffin
              latte
     0                 ≥2          (30/1)


no muffin      1       muffin
(189/10)               (4/0)

            espresso
      0                 ≥1


 muffin                no muffin
 (6/2)                  (11/3)




                                            PAGE 11
Basic idea
                                  #young=546
                                                        young                   Overall E = 0.946848
                                     #old=314
                                  E=0.946848          (860/303)
                                                                                               information gain
• Split the set of                                                                                is 0.107012

                                                             split on attribute smoker
  instances in
  subsets such that                                                             Overall E = 0.839836
                          #young=184
  the variation within        #old=11
                         E = 0.313027
                                                yes
                                                       smoker
                                                                    no
                                                                                                   information gain

  each subset                             young                      young
                                                                                      #young=362
                                                                                      #old=303
                                                                                                      is 0.076468

                                         (195/11)
  becomes smaller.                                                 (665/303)          E=0.994314




• Based on notion of                                         split on attribute drinker

  entropy or similar.
                          #young=184                                            Overall E = 0.763368
• Minimize average            #old=11
                         E = 0.313027
                                                yes
                                                       smoker
                                                                    no

  entropy; maximize                        young                                             #young=2
                                                                    drinker                  #old=63
  information gain                        (195/11)         yes                   no          E=0.198234



  per step.                     #young=360
                                   #old=240
                                                        young
                                                      (600/240)
                                                                                    old
                                                                                  (65/2)
                                E=0.970951


                                                                                                    PAGE 12
Clustering
age




                      age
                              +             +
                            cluster A    cluster B


                                        +
                            cluster C
             weight                     weight




                                                 PAGE 13
k-means clustering



                               +
                     +
   +         +
                           +
                                     +

       (a)           (b)       (c)




                                         PAGE 14
Agglomerative hierarchical clustering


                                                                           dendrogram
                                               abcdefghij
   a                 c
       b                 d                                        efghij
                                         abcd

                                                            efg                  hij
           f                 h
                                     ab         cd                fg        hi
   e
           g                 i   j
                                     a     b     c   d    e       f    g    h      i   j
               (a)                                       (b)




                                                                                       PAGE 15
Levels introduced by agglomerative
hierarchical clustering

                                                    abcdefghij
      a                 c
          b                 d                                          efghij
                                              abcd

                                                                 efg                 hij
              f                 h
                                        ab           cd                fg       hi
      e
              g                 i   j
                                          a     b     c   d    e       f    g   h      i   j
                  (a)                                         (b)



Any horizontal line in dendrogram
corresponds to a concrete clustering at
a particular level of abstraction

                                                                                           PAGE 16
Association rule learning

• Rules of form “IF X THEN Y”




                                PAGE 17
Special case: market basket analysis




                                       PAGE 18
Example
 (people that order tea and latte also order muffins)




• Support should be as high as possible (but will be low in case of many items).
• Confidence should be close to 1.
• High lift values suggest a positive correlation (1 if independent).
                                                                               PAGE 19
Brute force algorithm




                        PAGE 20
Apriori (optimization based on two
observations)




                                     PAGE 21
Sequence
mining




           PAGE 22
Episode mining
(32 time windows of length 5)

            a c b e d         f     c b b c           a e b e c d             c        b
      10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37




           b                              b                              b

a                        d                                 a                           d

           c                              c                              c

          E1                             E2                             E3


                                                                                            PAGE 23
Occurrences

                b                       b                           b

 a                       d                              a                        d

                c                        c                          c

                E1                      E2                         E3



     E2 (16x)


       a c b e d         f     c b b c           a e b e c d             c        b
 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37


                E1                                E1        E3


                                                                                     PAGE 24
Hidden Markov models

• Given an observation sequence,                                    s         state

  how to compute the probability of
                                                                    x         observation
  the sequence given a hidden
  Markov model?                                                    0.7
                                                                              transition with probability

• Given an observation sequence                                    0.5        observation probability
  and a hidden Markov model, how
  to compute the most likely                1.0
  “hidden path” in the model?                                0.7                  0.2
                                      0.3
• Given a set of observation                      s1                     s2                   s3
                                                             0.8
  sequences, how to derive the
                                            0.5        0.5       0.6      0.4           0.8         0.2
  hidden Markov model that
  maximizes the probability of
  producing these sequences?
                                                  a          b           c              d            e


                                                                                               PAGE 25
Relation between data mining and
process mining

• Process mining: about end-to-end processes.
• Data mining: data-centric and not process-centric.
• Judging the quality of data mining and process
  mining: many similarities, but also some differences.
• Clearly, process mining techniques can benefit from
  experiences in the data mining field.
• Let us now focus on the quality of mining results.




                                                    PAGE 26
Confusion matrix

                      logic          ≥8
              -


           failed      <8                 program
          (79/10)                           ming     ≥7

                      linear                <7      cum laude
                                                                                               predicted
                     algebra   ≥6
              <6
                                           linear
                                                      (20/2)                                     class
                                          algebra    ≥6




                                                                                                               cum laude
                               passed
           operat.             (87/11)




                                                                                                    passed
                                            <6
  <6      research    ≥6                             passed




                                                                                           failed
                                                      (31/7)
                                           failed
 failed              passed                (20/4)
(101/8)               (82/7)

                                                                                   failed 178 22                0

                                                                actual class     passed 21 175                  2

                                                                               cum laude   1         3        18

                                                                                                             PAGE 27
Confusion matrix: metrics

                              predicted                                               name            formula
                                class
                                                                                       error         (fp+fn)/N

                               + -                                                 accuracy
                                                                                     tp-rate
                                                                                                     (tp+tn)/N
                                                                                                          tp/p
                   +            tp       fn        p                                                      fp/n
     actual




                                                                                     fp-rate
     class




                   -            fp       tn        n                               precision              tp/p’
                                                                                      recall              tp/p
                                p’       n’        N

                      (a)                                                                           (b)

tp is the number of true positives, i.e., instances that are correctly classified as positive.
fn is the number of false negatives, i.e., instances that are predicted to be negative but
should have been classified as positive.
fp is the number of false positives, i.e., instances that are predicted to be positive but should
have been classified as negative.
                                                                                                                  PAGE 28
tn is the number of true negatives, i.e., instances that are correctly classified as negative.
Example
         #young=546
                               young                   Overall E = 0.946848
            #old=314
         E=0.946848          (860/303)
                                                                      information gain
                                                                         is 0.107012

                                    split on attribute smoker



 #young=184                                            Overall E = 0.839836
     #old=11                  smoker
                       yes                 no
E = 0.313027                                                              information gain
                                                                             is 0.076468
                                                             #young=362
                 young                      young            #old=303
                (195/11)                  (665/303)          E=0.994314
                                                                                                            predicted                        predicted
                                                                                                              class                            class
                                    split on attribute drinker




                                                                                                                young




                                                                                                                                                 young
                                                                                                                        old




                                                                                                                                                         old
 #young=184                                            Overall E = 0.763368
     #old=11                  smoker
                       yes                 no                                                         young 546         0              young 544         2



                                                                                             actual




                                                                                                                              actual
                                                                                             class




                                                                                                                              class
E = 0.313027


                  young
                                           drinker
                                                                    #young=2
                                                                    #old=63
                                                                                                        old 314         0                old 251 63
                 (195/11)         yes                   no          E=0.198234

                                                                                                          (a)                              (b)
       #young=360              young                       old
          #old=240           (600/240)                   (65/2)
       E=0.970951




                                                                                                                                                   PAGE 29
Cross-validation

                                    learning
                                    algorithm




                                     model
                    training set
            split
                                       test

 data set
                                   performance
                      test set       indicator



                                                 PAGE 30
k-fold cross-validation


                                            learning
                                            algorithm




                                             model


            split
                                               test


 data set
                                           performance
                                             indicator
                    k data sets   rotate


                                                         PAGE 31
Occam’s Razor

• Principle attributed to the 14thcentury English logician
  William of Ockham.
• The principle states that “one should not increase,
  beyond what is necessary, the number of entities
  required to explain anything”, i.e., one should look for
  the “simplest model” that can explain what is observed
  in the data set.
• The Minimal Description Length (MDL) principle tries to
  operationalize Occam’s. In MDL performance is judged
  on the training data alone and not measured against
  new, unseen instances. The basic idea is that the
  “best” model is the one that minimizes the encoding of
  both model and data set.
                                                       PAGE 32

Más contenido relacionado

La actualidad más candente

Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
 Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
Yandex
 
Data Warehouse Architecture
Data Warehouse ArchitectureData Warehouse Architecture
Data Warehouse Architecture
pcherukumalla
 

La actualidad más candente (20)

Process Mining - Chapter 11 - Analyzing Lasagna Processes
Process Mining - Chapter 11 - Analyzing Lasagna ProcessesProcess Mining - Chapter 11 - Analyzing Lasagna Processes
Process Mining - Chapter 11 - Analyzing Lasagna Processes
 
Process Mining - Chapter 7 - Conformance Checking
Process Mining - Chapter 7 - Conformance CheckingProcess Mining - Chapter 7 - Conformance Checking
Process Mining - Chapter 7 - Conformance Checking
 
Process mining
Process miningProcess mining
Process mining
 
Process Mining and Predictive Process Monitoring
Process Mining and Predictive Process MonitoringProcess Mining and Predictive Process Monitoring
Process Mining and Predictive Process Monitoring
 
Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
 Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
Process Mining: Data Science in Action - Wil van der Aalst, TU/e, DSC/e, HSE
 
Introduction to Business Process Monitoring and Process Mining
Introduction to Business Process Monitoring and Process MiningIntroduction to Business Process Monitoring and Process Mining
Introduction to Business Process Monitoring and Process Mining
 
Event Logs: What kind of data does process mining require?
Event Logs: What kind of data does process mining require?Event Logs: What kind of data does process mining require?
Event Logs: What kind of data does process mining require?
 
Process Mining - a new governance approach
Process Mining - a new governance approachProcess Mining - a new governance approach
Process Mining - a new governance approach
 
Process mining in business process management
Process mining in business process managementProcess mining in business process management
Process mining in business process management
 
Business intelligence overview
Business intelligence overviewBusiness intelligence overview
Business intelligence overview
 
Business Process Modeling
Business Process ModelingBusiness Process Modeling
Business Process Modeling
 
Process Mining - Chapter 9 - Operational Support
Process Mining - Chapter 9 - Operational SupportProcess Mining - Chapter 9 - Operational Support
Process Mining - Chapter 9 - Operational Support
 
Latency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesLatency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed Databases
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Process Mining Book
Process Mining BookProcess Mining Book
Process Mining Book
 
OLAP in Data Warehouse
OLAP in Data WarehouseOLAP in Data Warehouse
OLAP in Data Warehouse
 
Adatbázis kezelés
Adatbázis kezelésAdatbázis kezelés
Adatbázis kezelés
 
BPMN 2.0 overview
BPMN 2.0 overviewBPMN 2.0 overview
BPMN 2.0 overview
 
Data Warehouse Architecture
Data Warehouse ArchitectureData Warehouse Architecture
Data Warehouse Architecture
 
BPMN Introduction
BPMN IntroductionBPMN Introduction
BPMN Introduction
 

Destacado

Distributed Process Discovery and Conformance Checking
Distributed Process Discovery and Conformance CheckingDistributed Process Discovery and Conformance Checking
Distributed Process Discovery and Conformance Checking
Wil van der Aalst
 

Destacado (9)

Process Mining Introduction
Process Mining IntroductionProcess Mining Introduction
Process Mining Introduction
 
Process Mining for ERP Systems
Process Mining for ERP SystemsProcess Mining for ERP Systems
Process Mining for ERP Systems
 
Process Mining - Chapter 13 - Cartography and Navigation
Process Mining - Chapter 13 - Cartography and NavigationProcess Mining - Chapter 13 - Cartography and Navigation
Process Mining - Chapter 13 - Cartography and Navigation
 
Process Mining - Chapter 10 - Tool Support
Process Mining - Chapter 10 - Tool SupportProcess Mining - Chapter 10 - Tool Support
Process Mining - Chapter 10 - Tool Support
 
Process Mining - Chapter 14 - Epilogue
Process Mining - Chapter 14 - EpilogueProcess Mining - Chapter 14 - Epilogue
Process Mining - Chapter 14 - Epilogue
 
Process Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big DataProcess Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big Data
 
Distributed Process Discovery and Conformance Checking
Distributed Process Discovery and Conformance CheckingDistributed Process Discovery and Conformance Checking
Distributed Process Discovery and Conformance Checking
 
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti ProcessesProcess Mining - Chapter 12 - Analyzing Spaghetti Processes
Process Mining - Chapter 12 - Analyzing Spaghetti Processes
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 

Similar a Process Mining - Chapter 3 - Data Mining

Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
tttiba
 
wekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdfwekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdf
Dr. Rajesh P Barnwal
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
blondellchancy
 

Similar a Process Mining - Chapter 3 - Data Mining (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Outliers and Inconsistency
Outliers and InconsistencyOutliers and Inconsistency
Outliers and Inconsistency
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Classification.pptx
Classification.pptxClassification.pptx
Classification.pptx
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of Bioinformatics
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Experiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance ClassificationExperiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance Classification
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
AIML2 DNN 3.5hr (111-1).pdf
AIML2 DNN  3.5hr (111-1).pdfAIML2 DNN  3.5hr (111-1).pdf
AIML2 DNN 3.5hr (111-1).pdf
 
Weka for clustering and regression itb vgsom
Weka for clustering and regression itb vgsomWeka for clustering and regression itb vgsom
Weka for clustering and regression itb vgsom
 
wekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdfwekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdf
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Disease Identification and Detection in Apple Tree
Disease Identification and Detection in Apple TreeDisease Identification and Detection in Apple Tree
Disease Identification and Detection in Apple Tree
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Comprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction TechniquesComprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction Techniques
 
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
 

Más de Wil van der Aalst

On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
On the Role of Fitness, Precision, Generalization and Simplicity in Process D...On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
Wil van der Aalst
 
A Decade of Business Process Management Conferences: Reflections on a Develop...
A Decade of Business Process Management Conferences: Reflections on a Develop...A Decade of Business Process Management Conferences: Reflections on a Develop...
A Decade of Business Process Management Conferences: Reflections on a Develop...
Wil van der Aalst
 
Business Process Configuration in the Cloud: How to Support and Analyze Multi...
Business Process Configuration in the Cloud: How to Support and Analyze Multi...Business Process Configuration in the Cloud: How to Support and Analyze Multi...
Business Process Configuration in the Cloud: How to Support and Analyze Multi...
Wil van der Aalst
 

Más de Wil van der Aalst (18)

Process Mining: BPM on Steroids (CPOs@BPM&O 2019 Keynote)
Process Mining: BPM on Steroids (CPOs@BPM&O 2019 Keynote)Process Mining: BPM on Steroids (CPOs@BPM&O 2019 Keynote)
Process Mining: BPM on Steroids (CPOs@BPM&O 2019 Keynote)
 
Everything You Always Wanted To Know About Petri Nets, But Were Afraid To Ask
Everything You Always Wanted To Know About Petri Nets, But Were Afraid To AskEverything You Always Wanted To Know About Petri Nets, But Were Afraid To Ask
Everything You Always Wanted To Know About Petri Nets, But Were Afraid To Ask
 
20 years of Process Mining Research (ICPM 2019 keynote)
20 years of Process Mining Research (ICPM 2019 keynote)20 years of Process Mining Research (ICPM 2019 keynote)
20 years of Process Mining Research (ICPM 2019 keynote)
 
Earth Movers’ Stochastic Conformance Checking
Earth Movers’ Stochastic Conformance CheckingEarth Movers’ Stochastic Conformance Checking
Earth Movers’ Stochastic Conformance Checking
 
Using Process Mining to Remove Operational Friction in Shared Services
Using Process Mining to Remove Operational Friction in Shared ServicesUsing Process Mining to Remove Operational Friction in Shared Services
Using Process Mining to Remove Operational Friction in Shared Services
 
Object-Centric Process Mining: Dealing With Divergence and Convergence in Eve...
Object-Centric Process Mining: Dealing With Divergence and Convergence in Eve...Object-Centric Process Mining: Dealing With Divergence and Convergence in Eve...
Object-Centric Process Mining: Dealing With Divergence and Convergence in Eve...
 
Process Mining In Today’s Platforms Economy: Opportunities and Challenges (WI...
Process Mining In Today’s Platforms Economy: Opportunities and Challenges (WI...Process Mining In Today’s Platforms Economy: Opportunities and Challenges (WI...
Process Mining In Today’s Platforms Economy: Opportunities and Challenges (WI...
 
Configurable Declare: Designing Customizable Flexible Models
Configurable Declare: Designing Customizable Flexible ModelsConfigurable Declare: Designing Customizable Flexible Models
Configurable Declare: Designing Customizable Flexible Models
 
On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
On the Role of Fitness, Precision, Generalization and Simplicity in Process D...On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
On the Role of Fitness, Precision, Generalization and Simplicity in Process D...
 
A Decade of Business Process Management Conferences: Reflections on a Develop...
A Decade of Business Process Management Conferences: Reflections on a Develop...A Decade of Business Process Management Conferences: Reflections on a Develop...
A Decade of Business Process Management Conferences: Reflections on a Develop...
 
Business Process Configuration in the Cloud: How to Support and Analyze Multi...
Business Process Configuration in the Cloud: How to Support and Analyze Multi...Business Process Configuration in the Cloud: How to Support and Analyze Multi...
Business Process Configuration in the Cloud: How to Support and Analyze Multi...
 
Discovering Concurrency: Learning (Business) Process Models from Examples
Discovering Concurrency: Learning (Business) Process Models from ExamplesDiscovering Concurrency: Learning (Business) Process Models from Examples
Discovering Concurrency: Learning (Business) Process Models from Examples
 
Service Interaction: Patterns, Formalization, and Analysis
Service Interaction: Patterns, Formalization, and AnalysisService Interaction: Patterns, Formalization, and Analysis
Service Interaction: Patterns, Formalization, and Analysis
 
Keynote Gartner Business Process Management Summit, February 2009, London
Keynote Gartner Business Process Management Summit, February 2009, London Keynote Gartner Business Process Management Summit, February 2009, London
Keynote Gartner Business Process Management Summit, February 2009, London
 
Keynote on Process Mining at SSCI 2010 / CIDM 2011
Keynote on Process Mining at SSCI 2010 / CIDM 2011Keynote on Process Mining at SSCI 2010 / CIDM 2011
Keynote on Process Mining at SSCI 2010 / CIDM 2011
 
Discovering Petri Nets: Evidence-Based Business Process Management
Discovering Petri Nets: Evidence-Based Business Process ManagementDiscovering Petri Nets: Evidence-Based Business Process Management
Discovering Petri Nets: Evidence-Based Business Process Management
 
TomTom for Business Process Managment (TomTom4BPM)
TomTom for Business Process Managment (TomTom4BPM)TomTom for Business Process Managment (TomTom4BPM)
TomTom for Business Process Managment (TomTom4BPM)
 
Keynote at 18th International Conference on Cooperative Information Systems (...
Keynote at 18th International Conference on Cooperative Information Systems (...Keynote at 18th International Conference on Cooperative Information Systems (...
Keynote at 18th International Conference on Cooperative Information Systems (...
 

Último

Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
amitlee9823
 
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Anamikakaur10
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
allensay1
 
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂EscortCall Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
dlhescort
 
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
lizamodels9
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 

Último (20)

Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
 
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceMalegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
 
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLBAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂EscortCall Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
Call Girls In Nangloi Rly Metro ꧂…….95996 … 13876 Enjoy ꧂Escort
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 

Process Mining - Chapter 3 - Data Mining

  • 1. Chapter 3 Data Mining prof.dr.ir. Wil van der Aalst www.processmining.org
  • 2. Overview Chapter 1 Introduction Part I: Preliminaries Chapter 2 Chapter 3 Process Modeling and Data Mining Analysis Part II: From Event Logs to Process Models Chapter 4 Chapter 5 Chapter 6 Getting the Data Process Discovery: An Advanced Process Introduction Discovery Techniques Part III: Beyond Process Discovery Chapter 7 Chapter 8 Chapter 9 Conformance Mining Additional Operational Support Checking Perspectives Part IV: Putting Process Mining to Work Chapter 10 Chapter 11 Chapter 12 Tool Support Analyzing “Lasagna Analyzing “Spaghetti Processes” Processes” Part V: Reflection Chapter 13 Chapter 14 Cartography and Epilogue Navigation PAGE 1
  • 3. Data mining • The growth of the “digital universe” is the main driver for the popularity of data mining. • Initially, the term “data mining” had a negative connotation (“data snooping”, “fishing”, and “data dredging”). • Now a mature discipline. • Data-centric, not process-centric. PAGE 2
  • 4. Data about 860 recently deceased Data set 1 persons to study the effects of drinking, smoking, and body weight on the life expectancy. Questions: - What is the effect of smoking and drinking on a person’s bodyweight? - Do people that smoke also drink? - What factors influence a person’s life expectancy the most? - Can one identify groups of people having a similar lifestyle? PAGE 3
  • 5. Data about 420 students to investigate Data set 2 relationships among course grades and the student’s overall performance in the Bachelor program. Questions: - Are the marks of certain courses highly correlated? - Which electives do excellent students (cum laude) take? - Which courses significantly delay the moment of graduation? - Why do students drop out? - Can one identify groups of students having a similar study behavior? PAGE 4
  • 6. Data on 240 customer orders Data set 3 in a coffee bar recorded by the cash register. Questions: - Which products are frequently purchased together? - When do people buy a particular product? - Is it possible to characterize typical customer groups? - How to promote the sales of products with a higher margin? PAGE 5
  • 7. Variables • Data set (sample or table) consists of instances (individuals, entities, cases, objects, or records). • Variables are often referred to as attributes, features, or data elements. • Two types: − categorical variables: − ordinal (high-med-low, cum laude-passed-failed) or − nominal (true-false, red-pink-green) − numerical variables (ordered, cannot be enumerated easily) PAGE 6
  • 8. Supervised Learning • Labeled data, i.e., there is a response variable that labels each instance. • Goal: explain response variable (dependent variable) in terms of predictor variables (independent variables). • Classification techniques (e.g., decision tree learning) assume a categorical response variable and the goal is to classify instances based on the predictor variables. • Regression techniques assume a numerical response variable. The goal is to find a function that fits the data with the least error. PAGE 7
  • 9. Unsupervised Learning • Unsupervised learning assumes unlabeled data, i.e., the variables are not split into response and predictor variables. • Examples: clustering (e.g., k-means clustering and agglomerative hierarchical clustering) and pattern discovery (association rules) PAGE 8
  • 10. Decision tree learning: data set 1 smoker yes no young drinker (195/11) yes no old weight <90 ≥90 (65/2) old young (219/34) (381/55) PAGE 9
  • 11. Decision tree learning: data set 2 logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 algebra cum laude ≥6 <6 (20/2) linear algebra ≥6 passed operat. (87/11) <6 <6 research ≥6 passed (31/7) failed failed passed (20/4) (101/8) (82/7) PAGE 10
  • 12. Decision tree learning: data set 3 tea 0 ≥1 muffin latte 0 ≥2 (30/1) no muffin 1 muffin (189/10) (4/0) espresso 0 ≥1 muffin no muffin (6/2) (11/3) PAGE 11
  • 13. Basic idea #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain • Split the set of is 0.107012 split on attribute smoker instances in subsets such that Overall E = 0.839836 #young=184 the variation within #old=11 E = 0.313027 yes smoker no information gain each subset young young #young=362 #old=303 is 0.076468 (195/11) becomes smaller. (665/303) E=0.994314 • Based on notion of split on attribute drinker entropy or similar. #young=184 Overall E = 0.763368 • Minimize average #old=11 E = 0.313027 yes smoker no entropy; maximize young #young=2 drinker #old=63 information gain (195/11) yes no E=0.198234 per step. #young=360 #old=240 young (600/240) old (65/2) E=0.970951 PAGE 12
  • 14. Clustering age age + + cluster A cluster B + cluster C weight weight PAGE 13
  • 15. k-means clustering + + + + + + (a) (b) (c) PAGE 14
  • 16. Agglomerative hierarchical clustering dendrogram abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b) PAGE 15
  • 17. Levels introduced by agglomerative hierarchical clustering abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b) Any horizontal line in dendrogram corresponds to a concrete clustering at a particular level of abstraction PAGE 16
  • 18. Association rule learning • Rules of form “IF X THEN Y” PAGE 17
  • 19. Special case: market basket analysis PAGE 18
  • 20. Example (people that order tea and latte also order muffins) • Support should be as high as possible (but will be low in case of many items). • Confidence should be close to 1. • High lift values suggest a positive correlation (1 if independent). PAGE 19
  • 22. Apriori (optimization based on two observations) PAGE 21
  • 23. Sequence mining PAGE 22
  • 24. Episode mining (32 time windows of length 5) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 b b b a d a d c c c E1 E2 E3 PAGE 23
  • 25. Occurrences b b b a d a d c c c E1 E2 E3 E2 (16x) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 E1 E1 E3 PAGE 24
  • 26. Hidden Markov models • Given an observation sequence, s state how to compute the probability of x observation the sequence given a hidden Markov model? 0.7 transition with probability • Given an observation sequence 0.5 observation probability and a hidden Markov model, how to compute the most likely 1.0 “hidden path” in the model? 0.7 0.2 0.3 • Given a set of observation s1 s2 s3 0.8 sequences, how to derive the 0.5 0.5 0.6 0.4 0.8 0.2 hidden Markov model that maximizes the probability of producing these sequences? a b c d e PAGE 25
  • 27. Relation between data mining and process mining • Process mining: about end-to-end processes. • Data mining: data-centric and not process-centric. • Judging the quality of data mining and process mining: many similarities, but also some differences. • Clearly, process mining techniques can benefit from experiences in the data mining field. • Let us now focus on the quality of mining results. PAGE 26
  • 28. Confusion matrix logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 cum laude predicted algebra ≥6 <6 linear (20/2) class algebra ≥6 cum laude passed operat. (87/11) passed <6 <6 research ≥6 passed failed (31/7) failed failed passed (20/4) (101/8) (82/7) failed 178 22 0 actual class passed 21 175 2 cum laude 1 3 18 PAGE 27
  • 29. Confusion matrix: metrics predicted name formula class error (fp+fn)/N + - accuracy tp-rate (tp+tn)/N tp/p + tp fn p fp/n actual fp-rate class - fp tn n precision tp/p’ recall tp/p p’ n’ N (a) (b) tp is the number of true positives, i.e., instances that are correctly classified as positive. fn is the number of false negatives, i.e., instances that are predicted to be negative but should have been classified as positive. fp is the number of false positives, i.e., instances that are predicted to be positive but should have been classified as negative. PAGE 28 tn is the number of true negatives, i.e., instances that are correctly classified as negative.
  • 30. Example #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain is 0.107012 split on attribute smoker #young=184 Overall E = 0.839836 #old=11 smoker yes no E = 0.313027 information gain is 0.076468 #young=362 young young #old=303 (195/11) (665/303) E=0.994314 predicted predicted class class split on attribute drinker young young old old #young=184 Overall E = 0.763368 #old=11 smoker yes no young 546 0 young 544 2 actual actual class class E = 0.313027 young drinker #young=2 #old=63 old 314 0 old 251 63 (195/11) yes no E=0.198234 (a) (b) #young=360 young old #old=240 (600/240) (65/2) E=0.970951 PAGE 29
  • 31. Cross-validation learning algorithm model training set split test data set performance test set indicator PAGE 30
  • 32. k-fold cross-validation learning algorithm model split test data set performance indicator k data sets rotate PAGE 31
  • 33. Occam’s Razor • Principle attributed to the 14thcentury English logician William of Ockham. • The principle states that “one should not increase, beyond what is necessary, the number of entities required to explain anything”, i.e., one should look for the “simplest model” that can explain what is observed in the data set. • The Minimal Description Length (MDL) principle tries to operationalize Occam’s. In MDL performance is judged on the training data alone and not measured against new, unseen instances. The basic idea is that the “best” model is the one that minimizes the encoding of both model and data set. PAGE 32