SlideShare una empresa de Scribd logo
1 de 35
What is Data Mining?
Agenda
• What Data Mining IS and IS NOT
• Steps in the Data Mining Process
  – CRISP-DM
  – Explanation of Models
  – Examples of Data Mining
    Applications
• Questions
The Evolution of Data Analysis
Evolutionary Step Business Question Enabling                Product Providers Characteristics
                                    Technologies

Data Collection    "What was my total Computers, tapes,     IBM, CDC            Retrospective,
(1960s)            revenue in the last disks                                    static data delivery
                   five years?"

Data Access        "What were unit     Relational           Oracle, Sybase,     Retrospective,
(1980s)            sales in New        databases            Informix, IBM,      dynamic data
                   England last        (RDBMS),             Microsoft           delivery at record
                   March?"             Structured Query                         level
                                       Language (SQL),
                                       ODBC

Data Warehousing   "What were unit     On-line analytic     SPSS, Comshare,   Retrospective,
& Decision         sales in New        processing           Arbor, Cognos,    dynamic data
Support            England last        (OLAP),              Microstrategy,NCR delivery at multiple
(1990s)            March? Drill down   multidimensional                       levels
                   to Boston."         databases, data
                                       warehouses

Data Mining        "What’s likely to   Advanced             SPSS/Clementine,    Prospective,
(Emerging Today)   happen to Boston    algorithms,          Lockheed, IBM,      proactive
                   unit sales next     multiprocessor       SGI, SAS, NCR,      information
                   month? Why?"        computers, massive   Oracle, numerous    delivery
                                       databases            startups
Results of Data Mining
       Include:
  • Forecasting what may happen in
    the future
  • Classifying people or things into
    groups by recognizing patterns
  • Clustering people or things into
    groups based on their attributes
  • Associating what events are likely
    to occur together
  • Sequencing what events are likely
    to lead to later events
Data mining is not
•Brute-force crunching of bulk
data
•“Blind” application of algorithms
•Going to find relationships
where none exist
•Presenting data in different
ways
•A database intensive task
•A difficult to understand
technology requiring an
advanced degree in computer
science
Data Mining Is
        •A hot buzzword for a class of
        techniques that find patterns in data
        •A user-centric, interactive process
        which leverages analysis
        technologies and computing power
        •A group of techniques that find
        relationships that have not
        previously been discovered
        •Not reliant on an existing database
        •A relatively easy task that requires
        knowledge of the business problem/
        subject matter expertise
Data Mining versus
         OLAP
•OLAP - On-line
Analytical
Processing
   – Provides you
     with a very
     good view of
     what is
     happening, but
     can not predict
     what will
     happen in the
     future or why it
     is happening
Data Mining Versus Statistical
             Analysis
•Data Mining                     •Data Analysis
    – Originally developed to act – Tests for statistical
      as expert systems to solve       correctness of models
      problems                          • Are statistical
    – Less interested in the               assumptions of models
      mechanics of the                     correct?
      technique                              – Eg Is the R-Square
    – If it makes sense then let’s             good?
      use it                         – Hypothesis testing
    – Does not require                  • Is the relationship
      assumptions to be made               significant?
      about data                             – Use a t-test to validate
    – Can find patterns in very                significance
      large amounts of data          – Tends to rely on sampling
    – Requires understanding         – Techniques are not
      of data and business             optimised for large amounts
      problem                          of data
                                     – Requires strong statistical
                                       skills
Examples of What People
    are Doing with Data Mining:
•Fraud/Non-Compliance           •Recruiting/Attracting
Anomaly detection               customers
   – Isolate the factors that   •Maximizing
     lead to fraud, waste and   profitability (cross
                                selling, identifying
     abuse                      profitable customers)
   – Target auditing and
                                •Service Delivery and
     investigative efforts more Customer Retention
     effectively                  – Build profiles of
•Credit/Risk Scoring                customers likely
                                    to use which
•Intrusion detection                services
•Parts failure prediction      •Web Mining
How Can We Do Data
  Mining?
By Utilizing the CRISP-
 DM Methodology
  – a standard process
  – existing data
  – software
    technologies
  – situational expertise
Why Should There be a
Standard Process?
                               •Framework for recording
                               experience
                                   – Allows projects to be
The data mining process must         replicated
be reliable and repeatable by •Aid to project planning and
people with little data mining management
                               •“Comfort factor” for new
background.                    adopters
                                   – Demonstrates maturity of
                                     Data Mining
                                   – Reduces dependency on
                                     “stars”
Process
    Standardization
CRISP-DM:
•   CRoss Industry Standard Process for Data Mining
•   Initiative launched Sept.1996
•   SPSS/ISL, NCR, Daimler-Benz, OHRA
•   Funding from European commission
•   Over 200 members of the CRISP-DM SIG worldwide
    – DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
      Syllogic, Magnify, ..
    – System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte
      & Touche, …
    – End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
CRISP-DM
•Non-proprietary
•Application/Industry
neutral
•Tool neutral
•Focus on business issues
    – As well as technical
      analysis
•Framework for guidance
•Experience base
    – Templates for
      Analysis
The
CRISP-
DM
Process
Model
Why CRISP-DM?
•The data mining process must be reliable and repeatable by
people with little data mining skills

•CRISP-DM provides a uniform framework for
   –guidelines
   –experience documentation

•CRISP-DM is flexible to account for differences
   –Different business/agency problems
   –Different data
Phases and Tasks
      B u s in e s s               D a ta                 D a ta
                                                                                               M o d e lin g              E v a lu a t io n             D e p lo y m e n t
  U n d e r s t a n d in g U n d e r s t a n d in g P r e p a r a t io n


D e t e r m in e               C o lle c t In it ia l D a t a   D ata Set                  S e le c t M o d e lin g E v a lu a t e R e s u lt s   P la n D e p lo y m e n t
   B u s i n e s s O b j e c t Ii v e s D ata C ollection
                                  nitial                        D ata Set D escription         T e c h n iq u e      A ssessment of D ata         D eployment P lan
B ackground                        R eport                                                 M odeling T echnique         M ining R esults w.r.t.
B usiness Objectives                                             S e le c t D a t a        M odeling A ssumptions       B usiness Success         P la n M o n it o r in g a n d
B usiness Success              D e s c r ib e D a t a            R ationale for I nclusion /                            C riteria                    M a in t e n a n c e
  C riteria                    D ata D escription R eport           E xclusion             G e n e r a t e T e s t D A pproved M odels
                                                                                                                     e s ig n                     M onitoring and
                                                                                           T est D esign                                           M aintenance P lan
S i t u a t i o n A s s e s s mEex p l o r e D a t a
                                  nt                             C le a n D a t a                                    R e v ie w P r o c e s s
I nventory of R esources       D ata E xploration R eport        D ata C leaning R eport B u i l d M o d e l         R eview of P rocess          P r o d u c e F in a l R e p o
R equirements,                                                                             P arameter Settings                                    F inal R eport
  A ssumptions, and            V e r i f y D a t a Q u a l i t y C o n s t r u c t D a tM odels
                                                                                            a                        D e t e r m in e N e x t S   F e p s resentation
                                                                                                                                                  t inal P
  C onstraints                 D ata Q uality R eport            D erived A ttributes      M odel D escription       List of P ossible A ctions
R isks and C ontingencies                                        Generated R ecords                                  D ecision                    R e v ie w P r o je c t
T erminology                                                                               As s es s Model                                        E xperience
C osts and B enefits                                             I n t e g r a t e D a t a odel A ssessment
                                                                                           M                                                        D ocumentation
                                                                 M erged D ata             R evised P arameter
D e t e r m in e                                                                            Settings
    D a t a M in in g G o a l                                    F o rma t D a ta
D ata M ining Goals                                              R eformatted D ata
D ata M ining Success
   C riteria

P r o d u c e P r o je c t P la n
P roj P lan
     ect
I nitial A sessment of
  T ools and T echniques
Phases in the DM Process:
CRISP-DM
Phases in the DM
    Process (1 & 2)
•Business Understanding:
   – Statement of
     Business Objective
   – Statement of Data
                         •Data Understanding
     Mining objective
                            – Explore the data and
   – Statement of Success
                              verify the quality
     Criteria
                            – Find outliers
Phases in the DM
  Process (3)
• Data preparation:
   – Takes usually over 90% of our time
      • Collection
      • Assessment
      • Consolidation and Cleaning
          – table links, aggregation level,
            missing values, etc
      • Data selection
          – active role in ignoring non-
            contributory data?
          – outliers?
          – Use of samples
          – visualization tools
      • Transformations - create new
        variables
Phases in the DM Process
            (4)
 • Model building
   – Selection of the modeling
     techniques is based upon
     the data mining objective
   – Modeling is an iterative
     process - different for
     supervised and
     unsupervised learning
      • May model for either
        description or prediction
Types of Models
•Prediction Models for       •Descriptive Models for
Predicting and               Grouping and Finding
Classifying                  Associations
   – Regression algorithms      – Clustering/Grouping
     (predict numeric
     outcome): neural             algorithms: K-
     networks, rule               means, Kohonen
     induction, CART (OLS       – Association
     regression, GLM)             algorithms: apriori,
   – Classification               GRI
     algorithm predict
     symbolic outcome):
     CHAID, C5.0
     (discriminant analysis,
     logistic regression)
Neural Network
  Input layer
           Hidden layer


                    Output
Neural Networks
• Description
  – Difficult interpretation
  – Tends to ‘overfit’ the data
  – Extensive amount of training time
  – A lot of data preparation
  – Works with all data types
Rule Induction
•Description
   – Produces decision trees:
      • income < $40K
          – job > 5 yrs then good
            risk
          – job < 5 yrs then bad                                                              Credit ranking (1=default)


            risk                                                                                   Cat. %
                                                                                                   Bad 52.01 168
                                                                                                                   n

                                                                                                   Good 47.99 155

      • income > $40K                                                                              Total (100.00) 323

                                                                                                Paid Weekly/Monthly
                                                                                     P-value=0.0000, Chi-square=179.6665, df=1
          – high debt then bad risk                              Weekly pay                                                                Monthly salary


          – low debt then good risk                           Cat. %
                                                              Bad 86.67 143
                                                              Good 13.33 22
                                                                             n                                                           Cat. %
                                                                                                                                         Bad 15.82 25
                                                                                                                                         Good 84.18 133
                                                                                                                                                        n


                                                              Total (51.08) 165                                                          Total (48.92) 158

   – Or Rule Sets:                                           Age Categorical
                                                 P-value=0.0000, Chi-square=30.1113, df=1
                                                                                                                                        Age Categorical
                                                                                                                            P-value=0.0000, Chi-square=58.7255, df=1


       • Rule #1 for good risk:       Young (< 25);Middle (25-35)

                                          Cat. %         n
                                                                                     Old ( > 35)

                                                                                  Cat. %           n               Cat. %
                                                                                                                         Young (< 25)

                                                                                                                                 n
                                                                                                                                                         Middle (25-35);Old ( > 35)

                                                                                                                                                            Cat. %         n

           – if income > $40K             Bad 90.51 143
                                          Good 9.49 15
                                          Total (48.92) 158
                                                                                  Bad 0.00
                                                                                  Good 100.00
                                                                                  Total (2.17)
                                                                                                   0
                                                                                                   7
                                                                                                   7
                                                                                                                   Bad 48.98 24
                                                                                                                   Good 51.02 25
                                                                                                                   Total (15.17) 49
                                                                                                                                                            Bad 0.92 1
                                                                                                                                                            Good 99.08 108
                                                                                                                                                            Total (33.75) 109


           – if low debt                                                                                              Social Class
                                                                                                        P-value=0.0016, Chi-square=12.0388, df=1



       • Rule #2 for good risk:                                                                    Management;Clerical

                                                                                                   Cat. %         n
                                                                                                                                          Professional

                                                                                                                                        Cat. %        n

           – if income < $40K
                                                                                                   Bad 0.00       0                     Bad 58.54 24
                                                                                                   Good 100.00    8                     Good 41.46 17
                                                                                                   Total (2.48)   8                     Total (12.69) 41


           – if job > 5 years
Rule Induction
Description
• Intuitive output
• Handles all forms of numeric data, as well
  as non-numeric (symbolic) data

C5 Algorithm a special case of rule
  induction
• Target variable must be symbolic
Apriori
Description
• Seeks association rules in
  dataset
• ‘Market basket’ analysis
• Sequence discovery
Kohonen Network
Description
• unsupervised
• seeks to
  describe
  dataset in
  terms of
  natural
  clusters of
  cases
Phases in the DM
      Process (5)
• Model Evaluation
  – Evaluation of model: how well it
    performed on test data
  – Methods and criteria depend on
    model type:
     • e.g., coincidence matrix with
       classification models, mean
       error rate with regression
       models
  – Interpretation of model:
    important or not, easy or hard
    depends on algorithm
Phases in the DM
    Process (6)
•Deployment
   – Determine how the results need to be
     utilized
   – Who needs to use them?
   – How often do they need to be used
•Deploy Data Mining results by:
   – Scoring a database
   – Utilizing results as business rules
   – interactive scoring on-line
Specific Data Mining
Applications:
What data mining has
done for...
         The US Internal Revenue Service
         needed to improve customer
         service and...

    Scheduled its workforce
to provide faster, more accurate
     answers to questions.
What data mining has done
for...
          The US Drug Enforcement
          Agency needed to be more
          effective in their drug “busts”
          and

   analyzed suspects’ cell phone
   usage to focus investigations.
What data mining has done
for...
    HSBC need to cross-sell more
    effectively by identifying profiles
    that would be interested in higher
    yielding investments and...

 Reduced direct mail costs by 30%
    while garnering 95% of the
      campaign’s revenue.
Final Comments
 • Data Mining can be utilized in any
   organization that needs to find
   patterns or relationships in their
   data.
 • By using the CRISP-DM
   methodology, analysts can have a
   reasonable level of assurance that
   their Data Mining efforts will
   render useful, repeatable, and
   valid results.
Questions?

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Data analytics
Data analyticsData analytics
Data analytics
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data
Big dataBig data
Big data
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data
Big DataBig Data
Big Data
 
Presentation big data and social media final_video
Presentation big data and social media final_videoPresentation big data and social media final_video
Presentation big data and social media final_video
 
Big data
Big dataBig data
Big data
 
Big implications of Big Data in healthcare
Big implications of Big Data in healthcareBig implications of Big Data in healthcare
Big implications of Big Data in healthcare
 
Big Data
Big DataBig Data
Big Data
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
Bio-Electronics, Bio-Sensors, Smart Phones, and Health Care
Bio-Electronics, Bio-Sensors, Smart Phones, and Health CareBio-Electronics, Bio-Sensors, Smart Phones, and Health Care
Bio-Electronics, Bio-Sensors, Smart Phones, and Health Care
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Big data analytics in healthcare industry
Big data analytics in healthcare industryBig data analytics in healthcare industry
Big data analytics in healthcare industry
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
wireless usb ppt
wireless usb pptwireless usb ppt
wireless usb ppt
 

Destacado

Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDatamining Tools
 
An Introduction to Data Mining
An Introduction to Data MiningAn Introduction to Data Mining
An Introduction to Data Miningbutest
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparisonric_biet
 
Business IT Alignment Heuristic
Business IT Alignment HeuristicBusiness IT Alignment Heuristic
Business IT Alignment HeuristicKodok Ngorex
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : ConceptsPragya Pandey
 
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...Health Informatics New Zealand
 
Basic research
Basic researchBasic research
Basic researchManu Alias
 
Stack using Linked List
Stack using Linked ListStack using Linked List
Stack using Linked ListSayantan Sur
 
Organising skills
Organising skillsOrganising skills
Organising skillsNijaz N
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issuesKrish_ver2
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 

Destacado (20)

Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ymag56 hr
Ymag56 hrYmag56 hr
Ymag56 hr
 
An Introduction to Data Mining
An Introduction to Data MiningAn Introduction to Data Mining
An Introduction to Data Mining
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparison
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
Ch01
Ch01Ch01
Ch01
 
Ch02
Ch02Ch02
Ch02
 
Tax DSS
Tax DSSTax DSS
Tax DSS
 
Business IT Alignment Heuristic
Business IT Alignment HeuristicBusiness IT Alignment Heuristic
Business IT Alignment Heuristic
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Clinical decision support systems
Clinical decision support systemsClinical decision support systems
Clinical decision support systems
 
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...
Seyedjamal Zolhavarieh - A model of knowledge quality assessment in clinical ...
 
Basic research
Basic researchBasic research
Basic research
 
Stack using Linked List
Stack using Linked ListStack using Linked List
Stack using Linked List
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Organising skills
Organising skillsOrganising skills
Organising skills
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issues
 
Human Resource Management : The Importance of Effective Strategy and Planning
Human Resource Management : The Importance of Effective Strategy and PlanningHuman Resource Management : The Importance of Effective Strategy and Planning
Human Resource Management : The Importance of Effective Strategy and Planning
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
5. computer networks u5 ver 1.0
5. computer networks u5 ver 1.05. computer networks u5 ver 1.0
5. computer networks u5 ver 1.0
 

Similar a What is Data Mining? The Evolution and Process

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh hasmeerana605
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Bill Chambers
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiProfessor Lili Saghafi
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoTShivam Singh
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 

Similar a What is Data Mining? The Evolution and Process (20)

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh h
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili Saghafi
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 

Más de Dr. C.V. Suresh Babu (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Association rules
Association rulesAssociation rules
Association rules
 
Clustering
ClusteringClustering
Clustering
 
Classification
ClassificationClassification
Classification
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
DART
DARTDART
DART
 
Mycin
MycinMycin
Mycin
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 
Bayes network
Bayes networkBayes network
Bayes network
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Rule based system
Rule based systemRule based system
Rule based system
 
Formal Logic in AI
Formal Logic in AIFormal Logic in AI
Formal Logic in AI
 
Production based system
Production based systemProduction based system
Production based system
 
Game playing in AI
Game playing in AIGame playing in AI
Game playing in AI
 
Diagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AIDiagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AI
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 

Último

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 

Último (20)

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 

What is Data Mining? The Evolution and Process

  • 1. What is Data Mining?
  • 2. Agenda • What Data Mining IS and IS NOT • Steps in the Data Mining Process – CRISP-DM – Explanation of Models – Examples of Data Mining Applications • Questions
  • 3. The Evolution of Data Analysis Evolutionary Step Business Question Enabling Product Providers Characteristics Technologies Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective, (1960s) revenue in the last disks static data delivery five years?" Data Access "What were unit Relational Oracle, Sybase, Retrospective, (1980s) sales in New databases Informix, IBM, dynamic data England last (RDBMS), Microsoft delivery at record March?" Structured Query level Language (SQL), ODBC Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective, & Decision sales in New processing Arbor, Cognos, dynamic data Support England last (OLAP), Microstrategy,NCR delivery at multiple (1990s) March? Drill down multidimensional levels to Boston." databases, data warehouses Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective, (Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive unit sales next multiprocessor SGI, SAS, NCR, information month? Why?" computers, massive Oracle, numerous delivery databases startups
  • 4. Results of Data Mining Include: • Forecasting what may happen in the future • Classifying people or things into groups by recognizing patterns • Clustering people or things into groups based on their attributes • Associating what events are likely to occur together • Sequencing what events are likely to lead to later events
  • 5. Data mining is not •Brute-force crunching of bulk data •“Blind” application of algorithms •Going to find relationships where none exist •Presenting data in different ways •A database intensive task •A difficult to understand technology requiring an advanced degree in computer science
  • 6. Data Mining Is •A hot buzzword for a class of techniques that find patterns in data •A user-centric, interactive process which leverages analysis technologies and computing power •A group of techniques that find relationships that have not previously been discovered •Not reliant on an existing database •A relatively easy task that requires knowledge of the business problem/ subject matter expertise
  • 7. Data Mining versus OLAP •OLAP - On-line Analytical Processing – Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
  • 8. Data Mining Versus Statistical Analysis •Data Mining •Data Analysis – Originally developed to act – Tests for statistical as expert systems to solve correctness of models problems • Are statistical – Less interested in the assumptions of models mechanics of the correct? technique – Eg Is the R-Square – If it makes sense then let’s good? use it – Hypothesis testing – Does not require • Is the relationship assumptions to be made significant? about data – Use a t-test to validate – Can find patterns in very significance large amounts of data – Tends to rely on sampling – Requires understanding – Techniques are not of data and business optimised for large amounts problem of data – Requires strong statistical skills
  • 9. Examples of What People are Doing with Data Mining: •Fraud/Non-Compliance •Recruiting/Attracting Anomaly detection customers – Isolate the factors that •Maximizing lead to fraud, waste and profitability (cross selling, identifying abuse profitable customers) – Target auditing and •Service Delivery and investigative efforts more Customer Retention effectively – Build profiles of •Credit/Risk Scoring customers likely to use which •Intrusion detection services •Parts failure prediction •Web Mining
  • 10. How Can We Do Data Mining? By Utilizing the CRISP- DM Methodology – a standard process – existing data – software technologies – situational expertise
  • 11. Why Should There be a Standard Process? •Framework for recording experience – Allows projects to be The data mining process must replicated be reliable and repeatable by •Aid to project planning and people with little data mining management •“Comfort factor” for new background. adopters – Demonstrates maturity of Data Mining – Reduces dependency on “stars”
  • 12. Process Standardization CRISP-DM: • CRoss Industry Standard Process for Data Mining • Initiative launched Sept.1996 • SPSS/ISL, NCR, Daimler-Benz, OHRA • Funding from European commission • Over 200 members of the CRISP-DM SIG worldwide – DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, .. – System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, … – End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
  • 13. CRISP-DM •Non-proprietary •Application/Industry neutral •Tool neutral •Focus on business issues – As well as technical analysis •Framework for guidance •Experience base – Templates for Analysis
  • 15. Why CRISP-DM? •The data mining process must be reliable and repeatable by people with little data mining skills •CRISP-DM provides a uniform framework for –guidelines –experience documentation •CRISP-DM is flexible to account for differences –Different business/agency problems –Different data
  • 16. Phases and Tasks B u s in e s s D a ta D a ta M o d e lin g E v a lu a t io n D e p lo y m e n t U n d e r s t a n d in g U n d e r s t a n d in g P r e p a r a t io n D e t e r m in e C o lle c t In it ia l D a t a D ata Set S e le c t M o d e lin g E v a lu a t e R e s u lt s P la n D e p lo y m e n t B u s i n e s s O b j e c t Ii v e s D ata C ollection nitial D ata Set D escription T e c h n iq u e A ssessment of D ata D eployment P lan B ackground R eport M odeling T echnique M ining R esults w.r.t. B usiness Objectives S e le c t D a t a M odeling A ssumptions B usiness Success P la n M o n it o r in g a n d B usiness Success D e s c r ib e D a t a R ationale for I nclusion / C riteria M a in t e n a n c e C riteria D ata D escription R eport E xclusion G e n e r a t e T e s t D A pproved M odels e s ig n M onitoring and T est D esign M aintenance P lan S i t u a t i o n A s s e s s mEex p l o r e D a t a nt C le a n D a t a R e v ie w P r o c e s s I nventory of R esources D ata E xploration R eport D ata C leaning R eport B u i l d M o d e l R eview of P rocess P r o d u c e F in a l R e p o R equirements, P arameter Settings F inal R eport A ssumptions, and V e r i f y D a t a Q u a l i t y C o n s t r u c t D a tM odels a D e t e r m in e N e x t S F e p s resentation t inal P C onstraints D ata Q uality R eport D erived A ttributes M odel D escription List of P ossible A ctions R isks and C ontingencies Generated R ecords D ecision R e v ie w P r o je c t T erminology As s es s Model E xperience C osts and B enefits I n t e g r a t e D a t a odel A ssessment M D ocumentation M erged D ata R evised P arameter D e t e r m in e Settings D a t a M in in g G o a l F o rma t D a ta D ata M ining Goals R eformatted D ata D ata M ining Success C riteria P r o d u c e P r o je c t P la n P roj P lan ect I nitial A sessment of T ools and T echniques
  • 17. Phases in the DM Process: CRISP-DM
  • 18. Phases in the DM Process (1 & 2) •Business Understanding: – Statement of Business Objective – Statement of Data •Data Understanding Mining objective – Explore the data and – Statement of Success verify the quality Criteria – Find outliers
  • 19. Phases in the DM Process (3) • Data preparation: – Takes usually over 90% of our time • Collection • Assessment • Consolidation and Cleaning – table links, aggregation level, missing values, etc • Data selection – active role in ignoring non- contributory data? – outliers? – Use of samples – visualization tools • Transformations - create new variables
  • 20. Phases in the DM Process (4) • Model building – Selection of the modeling techniques is based upon the data mining objective – Modeling is an iterative process - different for supervised and unsupervised learning • May model for either description or prediction
  • 21. Types of Models •Prediction Models for •Descriptive Models for Predicting and Grouping and Finding Classifying Associations – Regression algorithms – Clustering/Grouping (predict numeric outcome): neural algorithms: K- networks, rule means, Kohonen induction, CART (OLS – Association regression, GLM) algorithms: apriori, – Classification GRI algorithm predict symbolic outcome): CHAID, C5.0 (discriminant analysis, logistic regression)
  • 22. Neural Network Input layer Hidden layer Output
  • 23. Neural Networks • Description – Difficult interpretation – Tends to ‘overfit’ the data – Extensive amount of training time – A lot of data preparation – Works with all data types
  • 24. Rule Induction •Description – Produces decision trees: • income < $40K – job > 5 yrs then good risk – job < 5 yrs then bad Credit ranking (1=default) risk Cat. % Bad 52.01 168 n Good 47.99 155 • income > $40K Total (100.00) 323 Paid Weekly/Monthly P-value=0.0000, Chi-square=179.6665, df=1 – high debt then bad risk Weekly pay Monthly salary – low debt then good risk Cat. % Bad 86.67 143 Good 13.33 22 n Cat. % Bad 15.82 25 Good 84.18 133 n Total (51.08) 165 Total (48.92) 158 – Or Rule Sets: Age Categorical P-value=0.0000, Chi-square=30.1113, df=1 Age Categorical P-value=0.0000, Chi-square=58.7255, df=1 • Rule #1 for good risk: Young (< 25);Middle (25-35) Cat. % n Old ( > 35) Cat. % n Cat. % Young (< 25) n Middle (25-35);Old ( > 35) Cat. % n – if income > $40K Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Bad 0.00 Good 100.00 Total (2.17) 0 7 7 Bad 48.98 24 Good 51.02 25 Total (15.17) 49 Bad 0.92 1 Good 99.08 108 Total (33.75) 109 – if low debt Social Class P-value=0.0016, Chi-square=12.0388, df=1 • Rule #2 for good risk: Management;Clerical Cat. % n Professional Cat. % n – if income < $40K Bad 0.00 0 Bad 58.54 24 Good 100.00 8 Good 41.46 17 Total (2.48) 8 Total (12.69) 41 – if job > 5 years
  • 25. Rule Induction Description • Intuitive output • Handles all forms of numeric data, as well as non-numeric (symbolic) data C5 Algorithm a special case of rule induction • Target variable must be symbolic
  • 26. Apriori Description • Seeks association rules in dataset • ‘Market basket’ analysis • Sequence discovery
  • 27. Kohonen Network Description • unsupervised • seeks to describe dataset in terms of natural clusters of cases
  • 28. Phases in the DM Process (5) • Model Evaluation – Evaluation of model: how well it performed on test data – Methods and criteria depend on model type: • e.g., coincidence matrix with classification models, mean error rate with regression models – Interpretation of model: important or not, easy or hard depends on algorithm
  • 29. Phases in the DM Process (6) •Deployment – Determine how the results need to be utilized – Who needs to use them? – How often do they need to be used •Deploy Data Mining results by: – Scoring a database – Utilizing results as business rules – interactive scoring on-line
  • 31. What data mining has done for... The US Internal Revenue Service needed to improve customer service and... Scheduled its workforce to provide faster, more accurate answers to questions.
  • 32. What data mining has done for... The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations.
  • 33. What data mining has done for... HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue.
  • 34. Final Comments • Data Mining can be utilized in any organization that needs to find patterns or relationships in their data. • By using the CRISP-DM methodology, analysts can have a reasonable level of assurance that their Data Mining efforts will render useful, repeatable, and valid results.

Notas del editor

  1. The US Internal Revenue Service is using data mining to improve customer service. [Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.
  2. The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions. [Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.
  3. Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies. With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to &amp;quot;roll over&amp;quot; maturing products, or on cross-selling new ones. [Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaign’s revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they don’t receive junk mail from the bank.