SlideShare una empresa de Scribd logo
1 de 38
Presented and Contributed by:
                         Ahmet Selman Bozkır
                    Hacettepe University Ph.D. Student



November 29, 2011                                        1
   What is data mining?
   Motivation: Why data mining?
   Classification of data mining systems
   Architecture: Typical Data Mining System
   Data mining functionality

November 29, 2011                              2
   Data mining (knowledge discovery from data)
     Extraction of interesting (non-trivial, implicit, previously unknown
        and potentially useful) patterns or knowledge from huge amount of
        data
     Data mining: a misnomer?
   Alternative names
     Knowledge discovery (mining) in databases (KDD), knowledge
        extraction, data/pattern analysis, data archeology, data
        dredging, information harvesting, business intelligence, etc.
   Watch out: Is everything “data mining”?
     (Deductive) query processing.
     Expert systems or small ML/statistical programs


November 29, 2011                                                            3
   Data explosion problem

     Automated data collection tools and mature database technology lead

        to tremendous amounts of data accumulated and/or to be analyzed in
        databases, data warehouses, and other information repositories
   We are drowning in data, but starving for knowledge!
   Solution: Data warehousing and data mining

     Data warehousing and on-line analytical processing

     Mining interesting knowledge (rules, regularities, patterns, constraints)

        from data in large databases



November 29, 2011                                                                 4
   Data analysis and decision support
     Market analysis and management
        ▪ Target marketing, customer relationship management
          (CRM), market basket analysis, cross selling, market segmentation
     Risk analysis and management
        ▪ Forecasting, customer retention, improved underwriting, quality
          control, competitive analysis
     Fraud detection and detection of unusual patterns (outliers)
   Other Applications
     Text mining (news group, email, documents) and Web mining
     Bioinformatics and bio-data analysis


November 29, 2011                                                             5
   Target marketing
     Find clusters of “model” customers who share the same characteristics:
      interest, income level, spending habits, etc.
     Determine customer purchasing patterns over time

   Cross-market analysis—Find associations/co-relations between product
    sales, & predict based on such association

   Customer profiling —What types of customers buy what products
    (clustering or classification)

   Customer requirement analysis
     Identify the best products for different groups of customers
     Predict what factors will attract new customers


November 29, 2011    Data Mining: Concepts and Techniques                      6
   Finance planning and asset evaluation
     cash flow analysis and prediction cross-sectional and time series
        analysis (financial-ratio, trend analysis, etc.)
   Resource planning
     summarize and compare the resources and spending
   Competition
     monitor competitors and market directions

     group customers into classes and a class-based pricing procedure

     set pricing strategy in a highly competitive market




November 29, 2011                                                         7
   Approaches: Clustering & model construction for frauds, outlier analysis
   Applications: Health care, retail, credit card service, telecomm.
     Auto insurance: ring of collisions
     Money laundering: suspicious monetary transactions
     Medical insurance
        ▪ Professional patients, ring of doctors, and ring of references
        ▪ Unnecessary or correlated screening tests
     Telecommunications: phone-call fraud
        ▪ Phone call model: destination of the call, duration, time of day or week.
          Analyze patterns that deviate from an expected norm
     Retail industry
        ▪ Analysts estimate that 38% of retail shrink is due to dishonest employees
     Anti-terrorism


November 29, 2011       Data Mining: Concepts and Techniques                          8
Pattern Evaluation
  Data mining—core of
    knowledge discovery
    process                            Data Mining

                        Task-relevant Data


        Data Warehouse            Selection


Data Cleaning

               Data Integration


            Databases
 November 29, 2011                                                 9
Increasing potential
 to support
 business decisions                                                         End User
                                        Making
                                        Decisions

                                     Data Presentation                      Business
                                                                             Analyst
                                 Visualization Techniques
                                       Data Mining                            Data
                                    Information Discovery                   Analyst

                                      Data Exploration
                        Statistical Analysis, Querying and Reporting

                              Data Warehouses / Data Marts
                                      OLAP, MDA                                DBA
                                     Data Sources
              Paper, Files, Information Providers, Database Systems, OLTP
November 29, 2011                                                                      10
   Learning the application domain
     relevant prior knowledge and goals of application
   Creating a target data set: data selection
   Data cleaning and preprocessing: (may take 70% of effort!)
   Data reduction and transformation
     Find useful features, dimensionality/variable reduction, invariant
        representation.
   Choosing functions of data mining
       summarization, classification, regression, association, clustering.
   Choosing the mining algorithm(s)
   Data mining: search for patterns of interest
   Pattern evaluation and knowledge presentation
     visualization, transformation, removing redundant patterns, etc.
   Use of discovered knowledge

November 29, 2011                                                             11
Graphical user interface


                     Pattern evaluation

                    Data mining engine
                                                   Knowledge-base
                        Database or data
                        warehouse server
Data cleaning &
                                           Filtering
data integration
                                      Data
                    Databases       Warehouse

November 29, 2011                                              12
   General functionality
       Descriptive data mining
       Predictive data mining
     Different views, different classifications
       Kinds of databases to be mined
       Kinds of knowledge to be discovered
       Kinds of techniques utilized
       Kinds of applications adapted

November 29, 2011                                  13
   Concept description: Characterization and discrimination
     Generalize, summarize, and contrast data characteristics, e.g., dry vs.
        wet regions
   Association (correlation and causality)
     Diaper  Beer [0.5%, 75%]
   Classification and Prediction
     Construct models (functions) that describe and distinguish classes or
        concepts for future prediction
        ▪ E.g., classify countries based on climate, or classify cars based on gas
          mileage
     Presentation: decision-tree, classification rule, neural network
     Predict some unknown or missing numerical values
November 29, 2011                                                                    14
   Cluster analysis
     Class label is unknown: Group data to form new
      classes, e.g., cluster houses to find distribution patterns
     Maximizing intra-class similarity & minimizing interclass
      similarity
   Outlier analysis
     Outlier: a data object that does not comply with the general
      behavior of the data
     Noise or exception? No! useful in fraud detection, rare
      events analysis



November 29, 2011                                                    15
   Data mining: discovering interesting patterns from large amounts of data
   A natural evolution of database technology, in great demand, with wide
    applications
   A KDD process includes data cleaning, data integration, data
    selection, transformation, data mining, pattern evaluation, and knowledge
    presentation
   Mining can be performed in a variety of information repositories
   Data mining functionalities:
    characterization, discrimination, association, classification, clustering, outl
    ier and trend analysis, etc.
   Data mining systems and architectures
   Major issues in data mining


November 29, 2011                                                                 16
   R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A
    Database Perspective, Morgan Kaufmann (in preparation)
   J. Han and M. Kamber. Data Mining: Concepts and Techniques.
    Morgan Kaufmann, 2001




November 29, 2011                                                    17
November 29, 2011
                    Thank you !!!   18
   • A decision tree (DT) is a hierarchical classification
    and prediction model

    • It is organized as a rooted tree with 2 types of
    nodes called decision nodes and inter nodes

    • It is a supervised data mining model used for
    classification or prediction


November 29, 2011                                             19
November 29, 2011   20
   Chance and Terminal Nodes

    •Each internal node of a DT is a decision point, where some
    condition is tested
    •The result of this condition determines which branch of the
    tree is to be taken next
    •Thus they are called decision node, chance node or non-
    terminal node
    •Chance nodes partition the available data at that point to
    maximize dependent variable differences


November 29, 2011                                                  21
   Terminal nodes

    •The leaf nodes of a DT are called terminal node
    •They indicate the class into which a data instance will
    be classified
    •They have just one incoming node
    •They do not have child nodes (outgoing nodes)
    •There are no conditions tested at terminal nodes
    •Tree traversal from the root to the leaf produces the
    production rule for that class

November 29, 2011                                              22
November 29, 2011   23
   Advantages of DT

    • Easy to understand and interpret
    • Works for categorical and continious data
    • High performance classification (generally)
    • DT can grow to any depth
    • On-the-fly prediction
    • Pruning a DT is very easy
    • Works for missing or null values

November 29, 2011                                   24
   Advantages contd.

    • Can be used to identify outliers
    • Production rules can be obtained directly from the built DT
    • They are relatively faster than other classification models
    • DT can be used even when domain experts are absent
    • Provide clear indication of which field is important for
    predication and classification




November 29, 2011                                                   25
   Disadvantages

    •Class-overlap problem (due to the curse of
    dimensionality)
    •Complex production rules
    •A DT can be sub-optimal (for this reason ensembe
    methods are developed)
    • Some decision tree can deal only with binary-valued.



November 29, 2011                                            26
November 29, 2011   27
•Training set - - to derive classifier
        (Generally %70-%80)

      •Test set - - to measure accuracy
        (Generally %20-%30)




November 29, 2011                              28
   Construction Phase: Initial Decision tree is
    Constructed in this Phase
    Q:How to split nodes?
    A: Different approaches with algorithms

   Pruning Phase: In this stage lower branches
    are removed to improve the performance
    Q:Why?
    A: Avoiding overfitting/overtraining
November 29, 2011                                  29
   ID3 (Available Everywhere)
   C4.5 / C5.0 (Weka/Spss Clementine)
   CART (Spss Clementine)
   CHAID (Spss Clementine, etc..)
   Microsoft Decision Trees (MS Analysis Services)
   Random Forests (Statistica)




November 29, 2011                                     30
   ID3 induction algorithm

    •ID3 (Interactive dichotomiser)
    •Introduced in 1986 by Quinlan
    •Designed for only classification
    •Works on categorical attributes only
    •Uses entropy measure as splitting criteria
    •Missing value handling is absent

November 29, 2011                                 31
   C4.5 induction algorithm

    •Invented by Quinlan in 1993
    •Is an extension of ID3 algorithm
    •Designed for only classification
    •Numerical attributes can be input
    •Uses entropy measure as splitting criteria
    •Uses multi-way splits
    •Missing value handling is provided
    •Tree pruning is also provided
November 29, 2011                                 32
   Classification and Regression Trees

    •Invented by Breiman, et.al. in 1984
    •Uses binary recursive partitioning method
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Uses Gini measure as splitting criteria
    •Uses two-way splits
    •Missing value handling is provided
    •Tree pruning is also provided
November 29, 2011                                       33
   Chi-squared Automatic Interaction Detection

    •Invented by Kass, et.al. in 1980
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Uses Karl Pearson's X2 test as splitting criteria
    •Uses multi-way splits
    •Missing value handling is provided
    •Avoids tree pruning

November 29, 2011                                        34
   Micorosoft Decision Trees

    •Invented by MS, in 1999
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Serves entropy, Bayesian K2, and Bayesian
    Dirichlet Equivalent with Uniform prior choices as
    splitting criteria
    •Uses multi-way splits and support binary splitting
    •Missing value handling is provided
    •Avoids tree pruning
November 29, 2011                                         35
   Overfitting: An induced tree may overfit the training data
     Too many branches, some may reflect anomalies due to noise or outliers
     Poor accuracy for unseen samples




November 29, 2011                                                              36
   Two approaches to avoid overfitting
     Prepruning: Halt tree construction early—do not split a node if this
       would result in the goodness measure falling below a threshold
       ▪ Difficult to choose an appropriate threshold
     Postpruning: Remove branches from a “fully grown” tree—get a
       sequence of progressively pruned trees
       ▪ Use a set of data different from the training data to decide which is
         the “best pruned tree”




November 29, 2011                                                                37
Validation error




                      Training error


                                Time
November 29, 2011                      38

Más contenido relacionado

La actualidad más candente

Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1Gautam Kumar
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationSridhar Nomula
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
Inductive analytical approaches to learning
Inductive analytical approaches to learningInductive analytical approaches to learning
Inductive analytical approaches to learningswapnac12
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selectionDavis David
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsDezyreAcademy
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

La actualidad más candente (20)

Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for TelecomFraud Analytics with Machine Learning and Big Data Engineering for Telecom
Fraud Analytics with Machine Learning and Big Data Engineering for Telecom
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
 
Decision tree
Decision treeDecision tree
Decision tree
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Inductive analytical approaches to learning
Inductive analytical approaches to learningInductive analytical approaches to learning
Inductive analytical approaches to learning
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Ensemble Learning.pptx
Ensemble Learning.pptxEnsemble Learning.pptx
Ensemble Learning.pptx
 

Destacado

Destacado (8)

Hopfield Ağı
Hopfield AğıHopfield Ağı
Hopfield Ağı
 
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food CourtsADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
 
Yapay sinir agları
Yapay sinir aglarıYapay sinir agları
Yapay sinir agları
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
hopfield neural network
hopfield neural networkhopfield neural network
hopfield neural network
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Hopfield Networks
Hopfield NetworksHopfield Networks
Hopfield Networks
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 

Similar a Data mining & Decison Trees

What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhardeepikakaler1
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhianadeepikakaler1
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhardeepikakaler1
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhianadeepikakaler1
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.pptbommaiah
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 

Similar a Data mining & Decison Trees (20)

Data mining
Data miningData mining
Data mining
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
D
DD
D
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Introduction
IntroductionIntroduction
Introduction
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
isd314-01
isd314-01isd314-01
isd314-01
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data mining
Data miningData mining
Data mining
 

Más de Selman Bozkır

23--Web-Design-Principles
23--Web-Design-Principles23--Web-Design-Principles
23--Web-Design-PrinciplesSelman Bozkır
 
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Selman Bozkır
 
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Selman Bozkır
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionSelman Bozkır
 
Measurement and metrics in model driven software development
Measurement and metrics in model driven software developmentMeasurement and metrics in model driven software development
Measurement and metrics in model driven software developmentSelman Bozkır
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)Selman Bozkır
 
Predicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesPredicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesSelman Bozkır
 
Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Selman Bozkır
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolSelman Bozkır
 

Más de Selman Bozkır (12)

lecture_07.pptx
lecture_07.pptxlecture_07.pptx
lecture_07.pptx
 
23--Web-Design-Principles
23--Web-Design-Principles23--Web-Design-Principles
23--Web-Design-Principles
 
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
 
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
 
Measurement and metrics in model driven software development
Measurement and metrics in model driven software developmentMeasurement and metrics in model driven software development
Measurement and metrics in model driven software development
 
UML ile Modelleme
UML ile ModellemeUML ile Modelleme
UML ile Modelleme
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)
 
Predicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesPredicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approaches
 
Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 

Último

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Último (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

Data mining & Decison Trees

  • 1. Presented and Contributed by: Ahmet Selman Bozkır Hacettepe University Ph.D. Student November 29, 2011 1
  • 2. What is data mining?  Motivation: Why data mining?  Classification of data mining systems  Architecture: Typical Data Mining System  Data mining functionality November 29, 2011 2
  • 3. Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  (Deductive) query processing.  Expert systems or small ML/statistical programs November 29, 2011 3
  • 4. Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases November 29, 2011 4
  • 5. Data analysis and decision support  Market analysis and management ▪ Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Bioinformatics and bio-data analysis November 29, 2011 5
  • 6. Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association  Customer profiling —What types of customers buy what products (clustering or classification)  Customer requirement analysis  Identify the best products for different groups of customers  Predict what factors will attract new customers November 29, 2011 Data Mining: Concepts and Techniques 6
  • 7. Finance planning and asset evaluation  cash flow analysis and prediction cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning  summarize and compare the resources and spending  Competition  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market November 29, 2011 7
  • 8. Approaches: Clustering & model construction for frauds, outlier analysis  Applications: Health care, retail, credit card service, telecomm.  Auto insurance: ring of collisions  Money laundering: suspicious monetary transactions  Medical insurance ▪ Professional patients, ring of doctors, and ring of references ▪ Unnecessary or correlated screening tests  Telecommunications: phone-call fraud ▪ Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm  Retail industry ▪ Analysts estimate that 38% of retail shrink is due to dishonest employees  Anti-terrorism November 29, 2011 Data Mining: Concepts and Techniques 8
  • 9. Pattern Evaluation  Data mining—core of knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases November 29, 2011 9
  • 10. Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP November 29, 2011 10
  • 11. Learning the application domain  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 70% of effort!)  Data reduction and transformation  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge November 29, 2011 11
  • 12. Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & Filtering data integration Data Databases Warehouse November 29, 2011 12
  • 13. General functionality  Descriptive data mining  Predictive data mining  Different views, different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adapted November 29, 2011 13
  • 14. Concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Association (correlation and causality)  Diaper  Beer [0.5%, 75%]  Classification and Prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction ▪ E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Predict some unknown or missing numerical values November 29, 2011 14
  • 15. Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Noise or exception? No! useful in fraud detection, rare events analysis November 29, 2011 15
  • 16. Data mining: discovering interesting patterns from large amounts of data  A natural evolution of database technology, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, discrimination, association, classification, clustering, outl ier and trend analysis, etc.  Data mining systems and architectures  Major issues in data mining November 29, 2011 16
  • 17. R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A Database Perspective, Morgan Kaufmann (in preparation)  J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001 November 29, 2011 17
  • 18. November 29, 2011 Thank you !!! 18
  • 19. • A decision tree (DT) is a hierarchical classification and prediction model • It is organized as a rooted tree with 2 types of nodes called decision nodes and inter nodes • It is a supervised data mining model used for classification or prediction November 29, 2011 19
  • 21. Chance and Terminal Nodes •Each internal node of a DT is a decision point, where some condition is tested •The result of this condition determines which branch of the tree is to be taken next •Thus they are called decision node, chance node or non- terminal node •Chance nodes partition the available data at that point to maximize dependent variable differences November 29, 2011 21
  • 22. Terminal nodes •The leaf nodes of a DT are called terminal node •They indicate the class into which a data instance will be classified •They have just one incoming node •They do not have child nodes (outgoing nodes) •There are no conditions tested at terminal nodes •Tree traversal from the root to the leaf produces the production rule for that class November 29, 2011 22
  • 24. Advantages of DT • Easy to understand and interpret • Works for categorical and continious data • High performance classification (generally) • DT can grow to any depth • On-the-fly prediction • Pruning a DT is very easy • Works for missing or null values November 29, 2011 24
  • 25. Advantages contd. • Can be used to identify outliers • Production rules can be obtained directly from the built DT • They are relatively faster than other classification models • DT can be used even when domain experts are absent • Provide clear indication of which field is important for predication and classification November 29, 2011 25
  • 26. Disadvantages •Class-overlap problem (due to the curse of dimensionality) •Complex production rules •A DT can be sub-optimal (for this reason ensembe methods are developed) • Some decision tree can deal only with binary-valued. November 29, 2011 26
  • 28. •Training set - - to derive classifier (Generally %70-%80) •Test set - - to measure accuracy (Generally %20-%30) November 29, 2011 28
  • 29. Construction Phase: Initial Decision tree is Constructed in this Phase Q:How to split nodes? A: Different approaches with algorithms  Pruning Phase: In this stage lower branches are removed to improve the performance Q:Why? A: Avoiding overfitting/overtraining November 29, 2011 29
  • 30. ID3 (Available Everywhere)  C4.5 / C5.0 (Weka/Spss Clementine)  CART (Spss Clementine)  CHAID (Spss Clementine, etc..)  Microsoft Decision Trees (MS Analysis Services)  Random Forests (Statistica) November 29, 2011 30
  • 31. ID3 induction algorithm •ID3 (Interactive dichotomiser) •Introduced in 1986 by Quinlan •Designed for only classification •Works on categorical attributes only •Uses entropy measure as splitting criteria •Missing value handling is absent November 29, 2011 31
  • 32. C4.5 induction algorithm •Invented by Quinlan in 1993 •Is an extension of ID3 algorithm •Designed for only classification •Numerical attributes can be input •Uses entropy measure as splitting criteria •Uses multi-way splits •Missing value handling is provided •Tree pruning is also provided November 29, 2011 32
  • 33. Classification and Regression Trees •Invented by Breiman, et.al. in 1984 •Uses binary recursive partitioning method •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Gini measure as splitting criteria •Uses two-way splits •Missing value handling is provided •Tree pruning is also provided November 29, 2011 33
  • 34. Chi-squared Automatic Interaction Detection •Invented by Kass, et.al. in 1980 •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Karl Pearson's X2 test as splitting criteria •Uses multi-way splits •Missing value handling is provided •Avoids tree pruning November 29, 2011 34
  • 35. Micorosoft Decision Trees •Invented by MS, in 1999 •Designed for both classification and regression •Works on both categorical & numerical attributes •Serves entropy, Bayesian K2, and Bayesian Dirichlet Equivalent with Uniform prior choices as splitting criteria •Uses multi-way splits and support binary splitting •Missing value handling is provided •Avoids tree pruning November 29, 2011 35
  • 36. Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples November 29, 2011 36
  • 37. Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold ▪ Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees ▪ Use a set of data different from the training data to decide which is the “best pruned tree” November 29, 2011 37
  • 38. Validation error Training error Time November 29, 2011 38