SlideShare una empresa de Scribd logo
1 de 92
Data Mining
                    UMUC CSMN 667

                         Lecture #2




By Dr. Borne 2005      UMUC Data Mining Lecture 2   1
Term Paper - Data Mining Case Analysis
• Refer to Project Descriptions section of WebTycho course
  Syllabus for detailed information.
• 1-page Summary (Abstract+Outline) due: April 4, 2005
• Final Paper Due Date: 12midnight, April 18, 2005
• Submit both in your WebTycho Assignments Folder
• Term Paper Page Restrictions: 5-8 pages
• I will submit your paper to TurnItIn.com for verification of
  originality – per UMUC Graduate School policies.
• Format/Style: Use the SPIE Conference Proceedings Style,
  which is available at:
          http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs
            [ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS]


  By Dr. Borne 2005                 UMUC Data Mining Lecture 2                               2
Case Analysis Instructions (1)
The goal of the paper assignment is to complete an in-depth study of
a data mining application. Examples of applications include
financial, scientific, medical, intrusion detection, and web mining.
Describe data types, data volumes, technical challenges, end-goals,
who is the user community, which data mining algorithms are most
relevant, why data mining, how is it used, what is the current status
of data mining usage in this field? --- Possible case topics include:
   A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick).
   A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank
   of America).
   A medical insurer looking to detect medical fraud.
   Gene detection in BioInformatics (e.g., Celera).
   Glitch or anomaly detection in scientific time series data.
   Abnormal network access behavior for detection of computer system intrusion and security
   violation.

  By Dr. Borne 2005                UMUC Data Mining Lecture 2                                3
Case Analysis Instructions (2)

• You may choose to go in depth in either one of
  these two areas:
  – A data mining application domain: Evaluate the application area
    in detail, as explained on the previous slide, including a review and analysis
    of the different data mining techniques employed there.
  Or

  – A data mining technique: Research in depth the different application
    domains where this technique has been used. Answer the questions on the
    previous slide when evaluating this technique‘s different application areas.



 By Dr. Borne 2005          UMUC Data Mining Lecture 2                       4
Case Analysis Paper - Instructions (3)
• Please e-mail me your suggested topic (application area to
  be researched) so that I may verify that it is okay.




   By Dr. Borne 2005   UMUC Data Mining Lecture 2      5
Case Analysis Paper - Instructions (4)

• Submit your completed paper in WebTycho.
• You may submit your paper in any of these
  formats: PDF, or Microsoft WORD, or postscript
  (PS).
• You must submit it no later than midnight on
  April 18. WebTycho will not allow submissions
  after that time.
• Submit the paper in your "Assignments
  Folder" (on the left menu bar within the
  WebTycho course website).
   By Dr. Borne 2005   UMUC Data Mining Lecture 2   6
Lecture 2:
                    ―Data Mining Roots‖
                    (Chapter 2 of Dunham textbook)




By Dr. Borne 2005          UMUC Data Mining Lecture 2   7
Lecture 2 Outline
•   Summary of ―What is Data Mining?‖ Tutorial
•   Foundations of Data Mining
•   Database Systems
•   Data Warehousing and OLAP
•   Statistics and Data Mining
•   Information Retrieval
•   Data Mining as ―Rule Induction‖
•   Fuzzy Sets and Logic
•   Machine Learning
•   Steps in the Data Mining Process
•   Major Issues in Data Mining
•   A Case Study: The NASA Mars Rover

By Dr. Borne 2005      UMUC Data Mining Lecture 2   8
“What is Data Mining?”


                     From online reading assigment --
                         Data Mining Tutorial at :
                    http://www.megaputer.com/dm/dm101.php3




By Dr. Borne 2005             UMUC Data Mining Lecture 2     9
Summary of ―What is Data Mining?‖ Tutorial
•   What is data mining?
•   Why use data mining?
•   What can Data Mining do for you?
•   Reasons for the growing popularity of Data Mining
•   Tasks Solved by Data Mining
•   Different DM Technologies and Systems
                    Subject-oriented analytical systems
                    Statistical packages
                    Neural Networks
                    Evolutionary Programming
                    Memory Based Reasoning
                    Decision Trees
                    Genetic Algorithms
                    Nonlinear Regression Methods
    By Dr. Borne 2005                 UMUC Data Mining Lecture 2   10
What can Data Mining do for you?
                    (business-focused list)
   • Identify your best prospects and then
     retain them as customers.
   • Predict cross-sell opportunities and make
     recommendations.
   • Learn parameters influencing trends in
     sales and margins.
   • Segment markets and personalize
     communications.


By Dr. Borne 2005       UMUC Data Mining Lecture 2   11
Reasons for the Growing Popularity of Data Mining
• Growing Data Volumes
• Limitations of Human Analysis
• Low Cost of Machine Learning

                    Tasks Solved by Data Mining
• Prediction
                                         • Explicit Modeling
• Classification
                                         • Clustering
• Detection of Relations
                                         • Market Basket Analysis
• Deviation Detection

By Dr. Borne 2005         UMUC Data Mining Lecture 2            12
Foundations of Data Mining




By Dr. Borne 2005   UMUC Data Mining Lecture 2   13
Foundations of Data Mining: Databases,
    Statistics, and Machine Learning
• David Hand (1998. ―Data Mining: Statistics and
  More?‖, The American Statistician, 52, pp. 112–
  118) used the following definition.
    – "Data mining is a new discipline lying at the interface of
      statistics, database technology, pattern recognition, machine
      learning, and other areas. It is concerned with the secondary
      analysis of large databases in order to find previously
      unsuspected relationships which are of interest or value to
      the database owners.”
    – Why “secondary”? … Because the data were typically
      collected for other purposes (such as billing, accounting,
      customer addresses, etc.). Primary analysis of large
      databases is generally the domain of STATISTICS.

By Dr. Borne 2005       UMUC Data Mining Lecture 2                 14
Slide from Lecture 1
                         Evolution of Data Mining
                  <http://www.thearling.com/text/dmwhite/dmwhite.htm>

Evolutionary Step        Business Question          Enabling                Characteristics
                                                    Technologies
Data Collection          "What was my total         Computers, tapes, disks Retrospective, static
(1960s)                  revenue in the last five                           data delivery
                         years?"


Data Access              "What were unit sales in Relational databases  Retrospective, dynamic
(1980s)                  New England last         (RDBMS), Structured   data delivery at record
                         March?"                  Query Language (SQL), level
                                                  ODBC

Data Warehousing &       "What were unit sales in
                                               On-line analytic              Retrospective, dynamic
Decision Support         New England last      processing (OLAP),            data delivery at multiple
(1990s)                  March? Drill down to  multidimensional              levels
                         Boston."              databases, data
                                               warehouses
Data Mining              "What’s likely to     Advanced algorithms,          Prospective, proactive
(Emerging Today)         happen to Boston unit multiprocessor                information delivery
                         sales next month?     computers, massive
                         Why?"                 databases

     By Dr. Borne 2005                UMUC Data Mining Lecture 2                               15
Foundation for Data Mining Techniques
• 1960s:
     – Data collection, database creation, IMS, and hierarchical DBMS
• 1970s:
     – Relational data model, relational DBMS implementation
• 1980s:
     – RDBMS, advanced data models (extended-relational, OO,
       deductive, etc.) and application-oriented DBMS (spatial, scientific,
       engineering, financial, manufacturing, sales, etc.)
• 1990s—2000s:
     – Data mining and data warehousing, multimedia databases, and
       Web databases

By Dr. Borne 2005          UMUC Data Mining Lecture 2                    16
History of Data Mining
  • Dates for specific events were imprecise in the
    preceding slides. This might be a little better :




By Dr. Borne 2005         UMUC Data Mining Lecture 2    17
Data Mining: Confluence of
            Multiple Disciplines
               Database
                                                  Statistics
              Technology



Machine
                            Data Mining                        Visualization
Learning



         Information                                      Other
           Science                                      Disciplines

By Dr. Borne 2005          UMUC Data Mining Lecture 2                    18
Data Mining Stepping Stones
                     http://www.cs.sfu.ca/~han/DM_Book.html



Increasing potential                                                     End User
to support                           Making
business decisions                   Decisions

                                 Data Presentation                       Business
                              Visualization Techniques                    Analyst

                                   Data Mining
                               Information Discovery                        Data
                                                                          Analyst
                                  Data Exploration
                     Statistical Analysis, Querying and Reporting
                          Data Warehouses / Data Marts
                                   OLAP, MDA
                                                                             DBA
                                  Data Sources
           Paper, Files, Information Providers, Database Systems, OLTP

By Dr. Borne 2005              UMUC Data Mining Lecture 2                    19
Database Systems




By Dr. Borne 2005       UMUC Data Mining Lecture 2   20
Database Systems
• DBMS joins ―AI and statistics‖ to become Data Mining
• Data mining usually asks complex statistical questions
  that are difficult to answer via traditional SQL queries
• Data mining relies on special algorithms outside of the
  standard DBMS/SQL family of tools
• Data mining is used to extract knowledge from DBMS,
  not just the data bits (i.e., KDD)
• Data mining applies familiar statistical concepts to
  large DBMS (e.g., outlier detection; cluster analysis;
  data modeling; evolutionary analysis; prediction)
  By Dr. Borne 2005      UMUC Data Mining Lecture 2    21
Data Mining is a core database function
          • Data Mining has many names / aliases :
                –   Knowledge Discovery in Databases (KDD)
                –   Machine Learning (ML)
                –   Exploratory Data Analysis (EDA)
                –   Intelligent Data Analysis (IDA)
                –   On-Line Analytical Processing (OLAP)
                –   Business Intelligence (BI)
                –   Customer Relationship Management (CRM)
                –   Business Analytics
                –   Target Marketing
                –   Cross-Selling
                –   Market Basket Analysis
                –   Credit Scoring
                –   Case-Based Reasoning (CBR)
                –   Connecting the Dots
                –   Intrusion Detection Systems (IDS)
                –   Recommendation / Personalization Systems!


By Dr. Borne 2005             UMUC Data Mining Lecture 2        22
Database Systems and Data Mining
• Data mining brings novel non-traditional concepts to
  large DBMS (e.g., association mining; neural nets;
  decision trees; link analysis; pattern recognition;
  classification; regression; SOMs). For example:
   – Clustering Analysis = group together similar items and
     separate dissimilar items
   – Classification Prediction = predict the class label
   – Regression = predict a numeric attribute value
   – Association Analysis = detect attribute-value conditions that
     occur frequently together (e.g., Beer & Diapers example)


  By Dr. Borne 2005     UMUC Data Mining Lecture 2            23
Types of Databases to be Mined
•   Relational databases
•   Data warehouses
•   Transactional databases
•   Advanced DB and information repositories:
     –   Object-oriented and object-relational databases
     –   Spatial databases
     –   Time-series data and temporal data
     –   Text databases and multimedia databases
     –   Heterogeneous and legacy databases
     –   WWW, and eventually the Semantic Web
By Dr. Borne 2005     UMUC Data Mining Lecture 2       24
Data Warehousing and OLAP




By Dr. Borne 2005   UMUC Data Mining Lecture 2   25
Data Warehousing
• Data warehouse = Materialized view
• Integrated view of data from distributed sources
• If transformation process can be represented via SQL,
  then data warehouse can be seen as a DB view:
   – CREATE VIEW warehouse_table AS
      SELECT …
      FROM source_table1, source_table2, …
      WHERE …
   – except that the view is materialized = result is stored
     and needs to be maintained when source data change

   By Dr. Borne 2005      UMUC Data Mining Lecture 2     26
Order of Database Operations (1)

• When building a DW, pay attention to the
  order of operations in the SQL command
    – particularly if large data need to be selected,
      grouped, and ordered
    – perhaps build intermediate views to cull data
      down to manageable size
• Order of operations . . .


By Dr. Borne 2005    UMUC Data Mining Lecture 2         27
Order of Database Operations (2)
(4)      select .....            specifies attributes and computations to
                                 appear in answer

(1)      from ....               indicates Cartesian product of source tables

(2)      where .....             provides boolean to filter Cartesian product

         groupby ....            specifies attributes necessary to cluster the
(3)
                                 results of the where-filter

(5)      orderby ....            indicates attributes on which to order any
                                 visual display or sequential tuple returns

(6)      into ....               specifies a temporary table to hold the answer



      Operational order

      By Dr. Borne 2005       UMUC Data Mining Lecture 2                 28
Maintaining the Data Warehouse
The key concept is ETL :
  – Extraction: extract relevant
    data and/or changes from the
    DB sources
  – Transformation: transform
    the data to match the
    warehouse schema
  – Loading: integrate data (and
    subsequent changes to data)
    into the warehouse
   By Dr. Borne 2005   UMUC Data Mining Lecture 2   29
Data Warehousing ―features‖
• Data are integrated into the DW in advance,
  prior to queries being formulated
   – Caution: Query results could therefore be stale
• Data are copied from distributed sources
   – Care must be exercised to maintain consistency
   – Query processing is local to the DW:
         • faster
         • can operate even when data sources are unavailable

 By Dr. Borne 2005       UMUC Data Mining Lecture 2             30
Selecting views to materialize
• Factors that affect what to materialize:
   –   Storage cost
   –   Update cost
   –   Which queries will benefit from it
   –   How much will those queries benefit from it
• Examples:
   – GROUP BY A1 is small, but not useful for most
     queries
   – GROUP BY A1, B2, C3 is useful for most
     queries, but too large to be of much benefit
 By Dr. Borne 2005   UMUC Data Mining Lecture 2      31
Data Warehousing and OLAP
     (On-Line Analytical Processing)
• OLAP as Data Mining:
  – Read data from integrated view of data sources
  – Complex queries of DW for Data Analysis
  – Data Analysis for Knowledge Discovery
    (KDD = Data Mining)
  – Knowledge Discovery for Decision Making
  – Goal: optimize reads and data warehouse
    queries for data exploration, mining, analysis
  By Dr. Borne 2005   UMUC Data Mining Lecture 2   32
OLTP versus OLAP
(On-Line Transaction Processing vs. On-Line Analytical Processing)

 • OLTP                                   • OLAP
    – Mostly updates                            – Mostly reads
    – Short, simple                             – Long, complex
      transactions                                queries
    – DBA, clerical users                       – Analysts, decision
    – Goal: transaction                           makers
      throughput                                – Goal: fast queries
    – Local sources:                            – Distributed sources:
      heterogeneous DBs                           single integrated view
                                                  (data warehouse)
   By Dr. Borne 2005       UMUC Data Mining Lecture 2               33
OLAP Operations in the Warehouse
• Slice (select one dimensional view)
• Dice (select multi-dimensional view;
  aids in the search for trends and
  patterns)
• Roll-up (consolidation; dimension
  reduction; aggregation; using simple
  or complex expressions)
• Drill-down (querying specific items)
• Visualize (―see‖ the results; allows
  for intuitive data understanding)
  By Dr. Borne 2005   UMUC Data Mining Lecture 2   34
From Lecture #1




 The Data Warehouse as the Source
      for the Mining Process




   By Dr. Borne 2005   UMUC Data Mining Lecture 2   35
From ―DataMines for DataWarehouses‖ article
       (available in Webliography)


                                             Data Mining external
                                             to the Data Warehouse




Data Mining within
the Data Warehouse


By Dr. Borne 2005    UMUC Data Mining Lecture 2                  36
Statistics and Data Mining




By Dr. Borne 2005   UMUC Data Mining Lecture 2   37
Data Mining = Statistical Analysis?
•   "Data mining … is the exploration and analysis, by automatic and
    semi-automatic means, of large quantities of data in order to
    discover meaningful patterns and rules." (Berry, J. A. & Linoff, G.
    [1997]. Data mining Techniques For Marketing, Sales and Customer
    Support, John Wiley & Sons, Inc. New York, p.5, http://www.data-
    miners.com/books/order.html )
•   "Data mining is the process of selecting, exploring, and modeling
    large amounts of data to uncover previously unknown patterns of
    data for business advantage." (SAS Institute Inc.,
    http://www.sas.com/technologies/analytics/datamining/index.html )
•   "Data mining simply means finding patterns in your business data
    which you can use to do your business better" (SPSS Inc.,
    http://www.statistical.com.au/dm.htm )
•   ”Data mining is the use of statistical analysis and machine learning
    techniques, in a semiautomatic fashion, on large collections of
    data." (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance
    11, 34–42.)

By Dr. Borne 2005         UMUC Data Mining Lecture 2                  38
Statistics and Data Mining
• Data mining got a bad name initially because it was
  initially viewed as ―statistical dredging‖ or a ―fishing
  expedition‖.
• Data mining became an acceptable practice because
  its users exercised statistical rigor in their analyses.
• Challenges and concerns:
    –   Data volumes are huge. Techniques don‘t often scale.
    –   Contaminated or corrupt data values (6-sigma effect)
    –   Selection bias; non-independent observations
    –   Fishing expedition = if you look hard enough, you will
        find something. But, is it really useful or not? … …
          this is the “Interestingness” Problem …
          • Are the data mining results interesting to anyone?

 By Dr. Borne 2005       UMUC Data Mining Lecture 2              39
Quality Management and Data Mining
• The focus of TQM (Total Quality Management) is total customer
  satisfaction.
• This can be realized through CRM (Customer Relationship
  Management) systems = a data mining technology :
   – Gather data
   – Analyze data
   – Make decisions based upon results
• Related to this are 6-Sigma quality control processes : customer
  satisfaction maximized through minimizing defects in products
  and services delivered.
• Some references:
   – http://www.sbaer.uca.edu/newsletter/2002/012202.pdf
   – http://www.qualitydigest.com/apr99/html/body_spcguide.html



   By Dr. Borne 2005       UMUC Data Mining Lecture 2             40
Information Retrieval




By Dr. Borne 2005    UMUC Data Mining Lecture 2   41
Information Retrieval (IR)
• IR is a combination of data discovery and
  data mining in digital libraries or other
  information repositories.
• An IR system operates on a collection of
  documents (e.g., the WWW)
• IR is sometimes called Text Mining or Web
  Mining
• Effectiveness of an IR project is measured by
  precision and recall
By Dr. Borne 2005   UMUC Data Mining Lecture 2   42
Information Retrieval Metrics
Precision = (relevant & retrieved) / (retrieved)
  – “Am I interested in the documents retrieved?”
  – High Precision means most of the retrieved
    documents are relevant to my query

Recall = (relevant & retrieved) / (relevant)
  – “Have all relevant documents been retrieved?”
  – High Recall means that most of the relevant
    documents have been retrieved.
  By Dr. Borne 2005   UMUC Data Mining Lecture 2    43
IR and Text/Web Mining
• Semantic markup of Web or other text documents using
  XML (eXtensible Markup Language)
• XML enables metadata / keyword harvesting from
  document collections (e.g., Web screen-scraping)
• Harvested metadata can be stored in a Data Warehouse for
  mining -- this is clearly an example of a materialized view
  of distributed data sources
• Other metrics: ―similarity‖ to other documents
     (e.g., common keywords, common keyphrases)
• Application area: Automated Recommendation System
    By Dr. Borne 2005   UMUC Data Mining Lecture 2      44
Information Retrieval Issues
• Semantic content of documents
• Unstructured versus structured content
• Multi-modal content (image, text, numeric)
• Reliability of sources
• Quality of sources
• Indexing for efficient & effective access
• Similarity metrics (e.g., how do you do a
  Groupby or a Roll-up ?)
• Privacy, Copyright, Intellectual Property
By Dr. Borne 2005   UMUC Data Mining Lecture 2   45
IR and Image Mining
• Image Mining is a form of IR and data mining
• Techniques:
   – Wavelet analysis and summarization
   – Pixel value (color) histograms and vectorization
   – Scene pattern recognition and indexing
   – Event/anomaly detection and cataloguing
      (e.g, forest fires seen in satellite photos)
   – Edge detection (unsharp masking) and graphs
• The data to be mined are the information databases
  extracted from the images (not the raw image data
  themselves)
  By Dr. Borne 2005        UMUC Data Mining Lecture 2   46
Data Mining as “Rule Induction”




By Dr. Borne 2005   UMUC Data Mining Lecture 2   47
From Lecture #1



              Decision Tree Classification:
       based on rules at each node of the tree

Should I play
tennis today?




   By Dr. Borne 2005   UMUC Data Mining Lecture 2   48
Intelligent actions (decision support) are
    often represented by a set of rules…

 IF age = ―<=30‖ AND student = ―no‖                THEN buys_computer = ―no‖
 IF age = ―<=30‖ AND student = ―yes‖               THEN buys_computer = ―yes‖
 IF age = ―31…40‖                                  THEN buys_computer = ―yes‖
 IF age = ―>40‖ AND credit_rating = ―excellent‖    THEN buys_computer = ―yes‖
 IF age = ―>40‖ AND credit_rating = ―fair‖         THEN buys_computer = ―no‖

                    (example of Decision Tree rules)




By Dr. Borne 2005          UMUC Data Mining Lecture 2                      49
Rule-Based Algorithms (RBA)
• RBA = Decision Support via ―if-then rules‖
• Can generate the rules from a Decision Tree (DT).
• But, rules do not need to be derived from a DT.
• Rules have no order, unlike Decision Trees.
• Trees are built by examining all cases; whereas
  rules are generated one case at a time.
• Rule Induction is the method for deriving rules.
• Case-Based Reasoning (CBR) is a related
  application of rule-based algorithms.
    By Dr. Borne 2005   UMUC Data Mining Lecture 2   50
Sometimes the rules are fuzzy…




                    (example of Fuzzy Rule Induction)


By Dr. Borne 2005            UMUC Data Mining Lecture 2   51
Fuzzy Sets and Logic




By Dr. Borne 2005   UMUC Data Mining Lecture 2   52
Fuzzy Sets and Logic
• Data mining does not always yield absolute answers, but
  statistical answers that indicate the probability frequency
  of occurrence of patterns or classes, or the likelihood that
  an object in the database belongs to a given class.
• In predictive data mining, the result is fuzzy (e.g.,
  predicting loan default through bank account analysis
  does not guarantee that the customer will indeed default
  on their loan).
• Fuzzy Logic is a method for handling uncertainty in
  data, in decision-making, and in control systems.
   By Dr. Borne 2005        UMUC Data Mining Lecture 2   53
Sets and Logic - Classical (Boolean)




By Dr. Borne 2005   UMUC Data Mining Lecture 2   54
Sets and Logic - Fuzzy




By Dr. Borne 2005        UMUC Data Mining Lecture 2   55
Classical versus Fuzzy




By Dr. Borne 2005    UMUC Data Mining Lecture 2   56
Fuzzy Logic, Control Systems, and Data Mining

• Suppose you have a R/T (real-time) data monitoring
  (data mining) control system attached to machinery in a
  large manufacturing plant.
• Temperature sensor on a machine says that it is running
  very hot (... what is ―hot‖? -- that‘s fuzzy).
• Motion sensor within machine says that it is running at
  high RPM, very fast (… what is ―fast‖? -- that‘s fuzzy).
• The machine is not technically over-heating, which you
  know because of past experience and common sense.
• Control System responds to data and knowledge-base by
  invoking a rule to slow down the motor speed a little bit.
  By Dr. Borne 2005   UMUC Data Mining Lecture 2       57
Application of Fuzzy Logic to Data Mining - 1
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>
Direct Mailing System
• The problem is to identify customers from a customer database who can be
  targeted for a sale under the assumption that these customers responded
  positively to advertisements mailed to them. The additional constraint is that
  the mailing list budget is limited and number of advertisements to be mailed
  are to be controlled to increase profit. The first step involves analyzing the
  database for attributes like "frequency of visits to the store", "sum of
  purchases", etc. Analysis and plots of the data then determine the cluster of
  good customers. Next, one has to find the attribute relationships to define a
  query condition which is represented by a pair of attributes and a fuzzy
  linguistic value. One then verifies and refines the query condition by using
  another customer database. Thus the customer database is ranked and sorted
  by degree values based on a given fuzzy query condition. The customers
  retrieved by the query determine the list of the potential of good customers.



   By Dr. Borne 2005         UMUC Data Mining Lecture 2                     58
Application of Fuzzy Logic to Data Mining - 2
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>

Vibration Sensor
• A product which was used to sense vibrations and predict the causes of
  these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy
  rules. The original sensor was based on simple threshold rule. The error rate
  for this sensor was around 12%. The fuzzy rules were created by analyzing
  the actual data in specified cases of earthquakes, automobiles etc. A feature
  extraction was done on the data set to identify each kind of cause.
  Relationships between the feature parameters and the kind of vibration were
  discovered to develop the fuzzy rules. These rules were then tested and
  refined. The accuracy of the sensor‘s prediction improved dramatically, with
  the error rate falling to within 1%.




  By Dr. Borne 2005          UMUC Data Mining Lecture 2                    59
Non-Fuzzy Logic System




By Dr. Borne 2005    UMUC Data Mining Lecture 2   60
Adaptive Fuzzy Logic System
                                           This example is related
                                           to air conditioner settings
                                           in a warm room, but the
                                           adaptive fuzzy logic system
                                           may be applied to activate
                                           other ―thinking machines‖.




By Dr. Borne 2005   UMUC Data Mining Lecture 2                  61
Machine Learning – a tool for
       Data Mining and Intelligent
           Decision Support




By Dr. Borne 2005   UMUC Data Mining Lecture 2   62
Machine Learning
• What is Machine Learning? -- “ML is the application of
  computer algorithms that improve automatically
  through experience.”
• Why is ML applicable to Data Mining? --
   – Refer to earlier slide “Reasons for the growing popularity of
     data mining” :
        • Growing Data Volume -- ML enables the intelligent analysis of
          overwhelmingly large data/knowledge repositories
        • Limitations of Human Analysis -- ML enables automated searches for
          complex multifactor dependencies in data
        • Low Cost of Machine Learning -- machines and software are cheaper
          than people; the ML process is repeatable, consistent, and robust in
          handling very large data analysis tasks; adaptive ML algorithms can
          scale with the problem.
   By Dr. Borne 2005        UMUC Data Mining Lecture 2                    63
Machine Learning and Data Mining
• ML Techniques for DM (to be covered later):
   –   Decision Trees
   –   Rule Mining and Rule Learning
   –   Case-Based Reasoning (CBR)
   –   Neural Nets (NN)
   –   Supervised and Unsupervised Learning
   –   Support Vector Machines (SVM)
   –   Bayesian Networks
   –   Genetic Algorithms (GA)

By Dr. Borne 2005   UMUC Data Mining Lecture 2   64
Neural Nets
• “Neural networks are the second best way of
  doing just about anything.” (John Denker)

                     Neural Network                  Fuzzy
       Data
                                                     Rules


• The best way is “is to apply all available domain
  knowledge and spend a considerable amount
  of time, money and effort in building a rule
  system that will give the right answer. The
  second best way of doing anything is to learn
  from experience.” (Burbidge & Buxton)
 By Dr. Borne 2005      UMUC Data Mining Lecture 2           65
Supervised vs. Unsupervised Learning
• In Supervised Learning algorithms, a training
  set is provided (data with correct answers),
  which is used to mine for known patterns.
• In Unsupervised Learning algorithms, data are
  provided with no a priori knowledge of the
  hidden patterns (knowledge) that they contain.
  The goal is to discover (learn) these patterns.
• A class known as Semi-Supervised Learning
  also exists, where knowledge is known and
  applied from one data collection in order to
  mine, analyze, classify, and interpret a related
  data collection.
 By Dr. Borne 2005   UMUC Data Mining Lecture 2   66
Machine Learning, Data Mining, and
     Support Vector Machines (SVM)
• SVM is the tool of choice for the application of
  ML to the data mining classification problem.
• So what are they? … ―a statistical learning
  system for predictive data mining -- for
  estimating regression functions.‖
• Loads of information available here:
              http://www.cs.rpi.edu/~bij2/svm.html
          http://www.kernel-machines.org/tutorial.html


 By Dr. Borne 2005      UMUC Data Mining Lecture 2       67
SVM Process Overview
      Initial           Data
   Classification
                                                            Data

                     SVM
                    Training


                               Weights                      SVM
                                                        Classification




                                           Elements                  Elements
                                              In                      Out of
                                         Classification            Classification


By Dr. Borne 2005          UMUC Data Mining Lecture 2                               68
SVM Classification
• SVM attempts to find an optimal separating
  hyperplane between members of the two
  initial classifications.

                                                     Separating
                                                     hyperplane
           Class ―A‖
          Class ―B‖




By Dr. Borne 2005       UMUC Data Mining Lecture 2           69
SVM Class Separation Problem
• An optimal hyperplane partitions the initial
  classification correctly and maximizes distance
  from the plane to elements on either ‗side‘:
  positive and negative examples.
• When the training examples (initial classification)
  consist of very diverse expression patterns, then
  finding an optimal hyperplane can be impossible.



By Dr. Borne 2005   UMUC Data Mining Lecture 2      70
SVM Kernel Construction
  The expression data can be transformed to a higher
  dimensional space (feature space) by applying a
  kernel function. This transformation can have the
  effect of allowing a separating hyperplane to be
  found.




By Dr. Borne 2005   UMUC Data Mining Lecture 2     71
Practical SVM Issues
• Results depend heavily on the input
  parameters.
• Using a high degree kernel function risks
  artificial separation of the data.
• An iterative approach to increasing the
  kernel power is advisable.


By Dr. Borne 2005        UMUC Data Mining Lecture 2   72
SVM Results
• Two classes are produced:
   – Positive Class: contains elements with expression
     patterns similar to those in the positive examples in the
     training set.
   – Negative Class: contains all other members of the input
     set.
• Each of these classes has elements that fall in two groups:
   – Those initially in the class (true positives and true
     negatives)
   – Those recruited into the class (false positives and false
     negatives)

By Dr. Borne 2005     UMUC Data Mining Lecture 2            73
Machine Learning Resources
• 1. Massive compilation of ML resources at :
http://home.earthlink.net/~dwaha/research/machine-learning.html
• 2. Excellent Reference Book: Tom Mitchell‘s
  ―Machine Learning‖ (1997; McGraw-Hill) :
     http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
• 3. Machine Learning & Data Mining Resources :
                                                     My favorite ML site …
                    http://www.mlnet.org/            Click on Software
   … a site dedicated to ―machine learning,
   knowledge discovery, case-based reasoning,
   knowledge acquisition, and data mining.‖


By Dr. Borne 2005       UMUC Data Mining Lecture 2                  74
Recap of ML and DM
• DM requires machine assistance in the search and analysis of very
  large (often distributed, heterogeneous) databases
• Intelligent analysis of complex multi-dimensional multiple-
  dependency data also demands machine assistance
• Algorithms for DM are most efficient when they are adaptable to
  the type and content of the data (i.e., the system ―learns‖)
• Machines are less expensive than humans
• Machines are usually scalable as the problem size grows
• Actionable data (the end-goal of DM) depends in many cases on an
  embedded ML algorithm to take appropriate action (in control
  systems; decision-support systems; robotics; autonomous systems)
• ML and DM are historically, technically, and functionally
  intertwined (e.g, some data mining research groups call themselves
  Machine Learning Groups)
    By Dr. Borne 2005       UMUC Data Mining Lecture 2         75
Steps in the Data Mining Process




By Dr. Borne 2005   UMUC Data Mining Lecture 2   76
Steps in the Data Mining Process
                      http://www.cs.sfu.ca/~han/DM_Book.html
• Learning the application domain:
   – relevant prior knowledge and goals of DM application
• Creating a target data set: Data selection
• Data cleaning and preprocessing: (may take 40-60% of effort!)
• Data reduction and transformation:
   – Find useful features, dimensionality/variable reduction, invariant
     representation.
• Choosing data mining functions
   – summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining & KDD: search for patterns of interest
• Pattern evaluation and knowledge presentation
   – visualization, transformation, removing redundant patterns, etc.
• Using the discovered knowledge = Actionable Data!
  By Dr. Borne 2005            UMUC Data Mining Lecture 2                 77
Steps in the Data Mining Process - Pictorial View




 By Dr. Borne 2005   UMUC Data Mining Lecture 2   78
Cleaning the ―Dirty Data‖
• Excellent reference: Dorian Pyle‘s book ―Data Preparation
  for Data Mining‖ (1999, Morgan Kaufmann; 540pp)
• Frequent problem: missing (NULL) values
• Empty value           Missing value (must treat each case
  differently)
• Various options for NULLs (may introduce bias):
     –   use ―fill value‖ (e.g, -999)
     –   use estimated value (prediction from data model)
     –   use interpolated value (from surrounding entries)
     –   ignore any records with nulls
• November 2003 Workshop on Data Cleaning:
              http://dimacs.rutgers.edu/Workshops/DataCleaning/


By Dr. Borne 2005           UMUC Data Mining Lecture 2            79
Data Preprocessing (Laundering the Data)
   (may take 40-80% of the total data mining project effort!)
       (Reference: ―Data Scrubbing‖ article in Computerworld 2003)




By Dr. Borne 2005         UMUC Data Mining Lecture 2                 80
"Data Scrubbing by the Numbers‖
(http://www.computerworld.com/printthis/2003/0,4814,78260,00.html)

Here are some of the findings:
      Data cleansing accounts for up to 70% of the cost and effort of
      implementing most data warehouse projects, according to analysts.
      In 2001, The Data Warehousing Institute estimated that dirty data
      costs U.S. businesses $600 billion per year.
      Data cleanliness and quality was the No. 2 problem -- right behind
      budget cuts -- cited in a 2003 IDC survey of 1,648 companies
      implementing business analytics software enterprise-wide.
      Only 23% of 130 companies surveyed by Cutter Consortium on their
      data warehousing and business-intelligence practices use specialized
      data cleansing tools.
      Of those companies in the Cutter Consortium study using specialized
      data scrubbing software, 31% are using tools that were built in-house.

 By Dr. Borne 2005          UMUC Data Mining Lecture 2                     81
Major Issues in Data Mining




By Dr. Borne 2005   UMUC Data Mining Lecture 2   82
Major Issues in Data Mining (1)
• Mining methodology and user interaction
   – Mining different kinds of knowledge in databases
   – Interactive mining of knowledge at multiple levels of abstraction
   – Incorporation of background knowledge
   – Data mining query languages and ad-hoc data mining
   – Expression and visualization of data mining results
   – Handling of noise and incomplete data
   – Pattern evaluation: the interestingness problem
• Performance and scalability
   – Handling very large data volumes (the ―data flood‖)
   – Efficiency and scalability of data mining algorithms
   – Parallel, distributed, and incremental mining methods

   By Dr. Borne 2005         UMUC Data Mining Lecture 2                  83
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
   – Handling relational and complex types of data
   – Mining information from heterogeneous databases and global
     information systems (WWW)
• Issues related to applications and social impacts
   – Application of discovered knowledge
        • Domain-specific data mining tools
        • Intelligent query answering
        • Process control and decision making
   – Integration of the discovered knowledge with existing knowledge:
       A knowledge fusion problem
   – Protection of data security, integrity, and privacy
• Dirty data (60% of the effort, or more)
   – Preparing the data for mining (transformation, cleaning, processing)
   By Dr. Borne 2005           UMUC Data Mining Lecture 2                   84
Case Study - The Mars Rover




http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html




By Dr. Borne 2005      UMUC Data Mining Lecture 2                85
Data Mining in Action

• Data Mining facilitates
  Intelligent Data
  Understanding

• Data Mining enables
  Decision Support and
  Active Control Systems

  By Dr. Borne 2005        UMUC Data Mining Lecture 2   86
What is Intelligent Data Understanding?
• IDU refers to the application of techniques for
  transforming data into understanding.
      … (sound familiar?)
Data  Information  Knowledge  Understanding / Wisdom!

• Web reference: http://is.arc.nasa.gov/IDU/index.html
• IDU specifically refers to automating the following
  techniques for machine-assisted data analysis:
   – Data Mining (e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html)
   – Knowledge Discovery
   – Machine Learning
 By Dr. Borne 2005       UMUC Data Mining Lecture 2                87
Intelligent Data System Applications (1)

• Rove around the surface of Mars and take samples of
  rocks (mass spectroscopy = a data histogram)
• Supervised Learning (search for rocks with known
  compositions)
• Unsupervised Learning (discover what types of rocks
  are present, without preconceived biases)
• Association Mining (find unusual associations)
• Clustering (find the set of unique classes of rocks)
• Classification (assign rocks to known classes)
• Deviation/Outlier Detection (one-of-kind; interesting?)
 By Dr. Borne 2005   UMUC Data Mining Lecture 2      88
Intelligent Data System Applications (2)
• On-board Intelligent Data Understanding & Decision
  Support Systems (Fuzzy Logic & Decision Trees &
  Cased-Based Reasoning ) – Science Goal Monitoring:
   – “stay here and do more”; or else “move on to another rock”
   – “send results to Earth immediately”; or “send results later”
• Learn as it goes (Machine Learning & Neural Nets)
• Relate the results to other factors, such as dust storms
  (XML & Information Retrieval & Information Fusion
  with other data from orbiting satellite ―mother ship‖)
• Predict where to go in order to find interesting rocks
  (Logistic Regression & Case-Based Reasoning)
   By Dr. Borne 2005    UMUC Data Mining Lecture 2            89
Mars Rover as an
         Adaptive Fuzzy Logic System



• Decisions are based on data mined, prior
  experience, new knowledge, and fuzzy logic
• Rover acts autonomously, without human
  intervention, in Deep Space environment
• Actions are driven by mining actionable
  data from all sensors
By Dr. Borne 2005   UMUC Data Mining Lecture 2   90
Summary




By Dr. Borne 2005   UMUC Data Mining Lecture 2   91
Summary of Topics Covered
•   Summary of ―What is Data Mining?‖ Tutorial
•   Foundations of Data Mining
•   Database Systems
•   Data Warehousing and OLAP
•   Statistics and Data Mining
•   Information Retrieval
•   Data Mining as ―Rule Induction‖
•   Fuzzy Sets and Logic
•   Machine Learning
•   Steps in the Data Mining Process
•   Major Issues in Data Mining
•   A Case Study: The NASA Mars Rover
By Dr. Borne 2005   UMUC Data Mining Lecture 2   92

Más contenido relacionado

La actualidad más candente

Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Miningijtsrd
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...IJECEIAES
 
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...Leandro de Castro
 
Data Science - Poster - Kirk Borne - RDAP12
Data Science - Poster - Kirk Borne - RDAP12Data Science - Poster - Kirk Borne - RDAP12
Data Science - Poster - Kirk Borne - RDAP12ASIS&T
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Data and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewData and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewKai Li
 
Information retrieval on the web
Information retrieval on the webInformation retrieval on the web
Information retrieval on the webunyil96
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 

La actualidad más candente (13)

Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
 
1105.1950
1105.19501105.1950
1105.1950
 
Data Mining
Data MiningData Mining
Data Mining
 
Intro dm
Intro dmIntro dm
Intro dm
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
 
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
 
Data Science - Poster - Kirk Borne - RDAP12
Data Science - Poster - Kirk Borne - RDAP12Data Science - Poster - Kirk Borne - RDAP12
Data Science - Poster - Kirk Borne - RDAP12
 
Towards Knowledge-Enabled Society
Towards Knowledge-Enabled SocietyTowards Knowledge-Enabled Society
Towards Knowledge-Enabled Society
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
Data and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewData and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature Review
 
Information retrieval on the web
Information retrieval on the webInformation retrieval on the web
Information retrieval on the web
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 

Destacado

Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Fuzzy logic application (aircraft landing)
Fuzzy logic application (aircraft landing)Fuzzy logic application (aircraft landing)
Fuzzy logic application (aircraft landing)Piyumal Samarathunga
 
ELECTRONIC DATA INTERCHANGE
ELECTRONIC DATA INTERCHANGE ELECTRONIC DATA INTERCHANGE
ELECTRONIC DATA INTERCHANGE alraee
 
Fuzzy logic and neural networks
Fuzzy logic and neural networksFuzzy logic and neural networks
Fuzzy logic and neural networksqazi
 
Chapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicChapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicAshique Rasool
 
Application of fuzzy logic
Application of fuzzy logicApplication of fuzzy logic
Application of fuzzy logicViraj Patel
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 

Destacado (17)

Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Neural network and fuzzy logic
Neural network and fuzzy logicNeural network and fuzzy logic
Neural network and fuzzy logic
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
EDI
EDIEDI
EDI
 
Data mining
Data miningData mining
Data mining
 
Fuzzy logic application (aircraft landing)
Fuzzy logic application (aircraft landing)Fuzzy logic application (aircraft landing)
Fuzzy logic application (aircraft landing)
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
ELECTRONIC DATA INTERCHANGE
ELECTRONIC DATA INTERCHANGE ELECTRONIC DATA INTERCHANGE
ELECTRONIC DATA INTERCHANGE
 
Fuzzy logic and neural networks
Fuzzy logic and neural networksFuzzy logic and neural networks
Fuzzy logic and neural networks
 
Chapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy LogicChapter 5 - Fuzzy Logic
Chapter 5 - Fuzzy Logic
 
Application of fuzzy logic
Application of fuzzy logicApplication of fuzzy logic
Application of fuzzy logic
 
Fuzzy logic ppt
Fuzzy logic pptFuzzy logic ppt
Fuzzy logic ppt
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Similar a Lecture 2

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
John Eberhardt NSTAC Testimony
John Eberhardt NSTAC TestimonyJohn Eberhardt NSTAC Testimony
John Eberhardt NSTAC TestimonyJohn Eberhardt
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 

Similar a Lecture 2 (20)

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
DBMS
DBMSDBMS
DBMS
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Dwdm
DwdmDwdm
Dwdm
 
John Eberhardt NSTAC Testimony
John Eberhardt NSTAC TestimonyJohn Eberhardt NSTAC Testimony
John Eberhardt NSTAC Testimony
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
DOWLD SLIDES.pptx
DOWLD SLIDES.pptxDOWLD SLIDES.pptx
DOWLD SLIDES.pptx
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Lecture 2

  • 1. Data Mining UMUC CSMN 667 Lecture #2 By Dr. Borne 2005 UMUC Data Mining Lecture 2 1
  • 2. Term Paper - Data Mining Case Analysis • Refer to Project Descriptions section of WebTycho course Syllabus for detailed information. • 1-page Summary (Abstract+Outline) due: April 4, 2005 • Final Paper Due Date: 12midnight, April 18, 2005 • Submit both in your WebTycho Assignments Folder • Term Paper Page Restrictions: 5-8 pages • I will submit your paper to TurnItIn.com for verification of originality – per UMUC Graduate School policies. • Format/Style: Use the SPIE Conference Proceedings Style, which is available at: http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs [ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS] By Dr. Borne 2005 UMUC Data Mining Lecture 2 2
  • 3. Case Analysis Instructions (1) The goal of the paper assignment is to complete an in-depth study of a data mining application. Examples of applications include financial, scientific, medical, intrusion detection, and web mining. Describe data types, data volumes, technical challenges, end-goals, who is the user community, which data mining algorithms are most relevant, why data mining, how is it used, what is the current status of data mining usage in this field? --- Possible case topics include: A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick). A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank of America). A medical insurer looking to detect medical fraud. Gene detection in BioInformatics (e.g., Celera). Glitch or anomaly detection in scientific time series data. Abnormal network access behavior for detection of computer system intrusion and security violation. By Dr. Borne 2005 UMUC Data Mining Lecture 2 3
  • 4. Case Analysis Instructions (2) • You may choose to go in depth in either one of these two areas: – A data mining application domain: Evaluate the application area in detail, as explained on the previous slide, including a review and analysis of the different data mining techniques employed there. Or – A data mining technique: Research in depth the different application domains where this technique has been used. Answer the questions on the previous slide when evaluating this technique‘s different application areas. By Dr. Borne 2005 UMUC Data Mining Lecture 2 4
  • 5. Case Analysis Paper - Instructions (3) • Please e-mail me your suggested topic (application area to be researched) so that I may verify that it is okay. By Dr. Borne 2005 UMUC Data Mining Lecture 2 5
  • 6. Case Analysis Paper - Instructions (4) • Submit your completed paper in WebTycho. • You may submit your paper in any of these formats: PDF, or Microsoft WORD, or postscript (PS). • You must submit it no later than midnight on April 18. WebTycho will not allow submissions after that time. • Submit the paper in your "Assignments Folder" (on the left menu bar within the WebTycho course website). By Dr. Borne 2005 UMUC Data Mining Lecture 2 6
  • 7. Lecture 2: ―Data Mining Roots‖ (Chapter 2 of Dunham textbook) By Dr. Borne 2005 UMUC Data Mining Lecture 2 7
  • 8. Lecture 2 Outline • Summary of ―What is Data Mining?‖ Tutorial • Foundations of Data Mining • Database Systems • Data Warehousing and OLAP • Statistics and Data Mining • Information Retrieval • Data Mining as ―Rule Induction‖ • Fuzzy Sets and Logic • Machine Learning • Steps in the Data Mining Process • Major Issues in Data Mining • A Case Study: The NASA Mars Rover By Dr. Borne 2005 UMUC Data Mining Lecture 2 8
  • 9. “What is Data Mining?” From online reading assigment -- Data Mining Tutorial at : http://www.megaputer.com/dm/dm101.php3 By Dr. Borne 2005 UMUC Data Mining Lecture 2 9
  • 10. Summary of ―What is Data Mining?‖ Tutorial • What is data mining? • Why use data mining? • What can Data Mining do for you? • Reasons for the growing popularity of Data Mining • Tasks Solved by Data Mining • Different DM Technologies and Systems Subject-oriented analytical systems Statistical packages Neural Networks Evolutionary Programming Memory Based Reasoning Decision Trees Genetic Algorithms Nonlinear Regression Methods By Dr. Borne 2005 UMUC Data Mining Lecture 2 10
  • 11. What can Data Mining do for you? (business-focused list) • Identify your best prospects and then retain them as customers. • Predict cross-sell opportunities and make recommendations. • Learn parameters influencing trends in sales and margins. • Segment markets and personalize communications. By Dr. Borne 2005 UMUC Data Mining Lecture 2 11
  • 12. Reasons for the Growing Popularity of Data Mining • Growing Data Volumes • Limitations of Human Analysis • Low Cost of Machine Learning Tasks Solved by Data Mining • Prediction • Explicit Modeling • Classification • Clustering • Detection of Relations • Market Basket Analysis • Deviation Detection By Dr. Borne 2005 UMUC Data Mining Lecture 2 12
  • 13. Foundations of Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 13
  • 14. Foundations of Data Mining: Databases, Statistics, and Machine Learning • David Hand (1998. ―Data Mining: Statistics and More?‖, The American Statistician, 52, pp. 112– 118) used the following definition. – "Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners.” – Why “secondary”? … Because the data were typically collected for other purposes (such as billing, accounting, customer addresses, etc.). Primary analysis of large databases is generally the domain of STATISTICS. By Dr. Borne 2005 UMUC Data Mining Lecture 2 14
  • 15. Slide from Lecture 1 Evolution of Data Mining <http://www.thearling.com/text/dmwhite/dmwhite.htm> Evolutionary Step Business Question Enabling Characteristics Technologies Data Collection "What was my total Computers, tapes, disks Retrospective, static (1960s) revenue in the last five data delivery years?" Data Access "What were unit sales in Relational databases Retrospective, dynamic (1980s) New England last (RDBMS), Structured data delivery at record March?" Query Language (SQL), level ODBC Data Warehousing & "What were unit sales in On-line analytic Retrospective, dynamic Decision Support New England last processing (OLAP), data delivery at multiple (1990s) March? Drill down to multidimensional levels Boston." databases, data warehouses Data Mining "What’s likely to Advanced algorithms, Prospective, proactive (Emerging Today) happen to Boston unit multiprocessor information delivery sales next month? computers, massive Why?" databases By Dr. Borne 2005 UMUC Data Mining Lecture 2 15
  • 16. Foundation for Data Mining Techniques • 1960s: – Data collection, database creation, IMS, and hierarchical DBMS • 1970s: – Relational data model, relational DBMS implementation • 1980s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, financial, manufacturing, sales, etc.) • 1990s—2000s: – Data mining and data warehousing, multimedia databases, and Web databases By Dr. Borne 2005 UMUC Data Mining Lecture 2 16
  • 17. History of Data Mining • Dates for specific events were imprecise in the preceding slides. This might be a little better : By Dr. Borne 2005 UMUC Data Mining Lecture 2 17
  • 18. Data Mining: Confluence of Multiple Disciplines Database Statistics Technology Machine Data Mining Visualization Learning Information Other Science Disciplines By Dr. Borne 2005 UMUC Data Mining Lecture 2 18
  • 19. Data Mining Stepping Stones http://www.cs.sfu.ca/~han/DM_Book.html Increasing potential End User to support Making business decisions Decisions Data Presentation Business Visualization Techniques Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP By Dr. Borne 2005 UMUC Data Mining Lecture 2 19
  • 20. Database Systems By Dr. Borne 2005 UMUC Data Mining Lecture 2 20
  • 21. Database Systems • DBMS joins ―AI and statistics‖ to become Data Mining • Data mining usually asks complex statistical questions that are difficult to answer via traditional SQL queries • Data mining relies on special algorithms outside of the standard DBMS/SQL family of tools • Data mining is used to extract knowledge from DBMS, not just the data bits (i.e., KDD) • Data mining applies familiar statistical concepts to large DBMS (e.g., outlier detection; cluster analysis; data modeling; evolutionary analysis; prediction) By Dr. Borne 2005 UMUC Data Mining Lecture 2 21
  • 22. Data Mining is a core database function • Data Mining has many names / aliases : – Knowledge Discovery in Databases (KDD) – Machine Learning (ML) – Exploratory Data Analysis (EDA) – Intelligent Data Analysis (IDA) – On-Line Analytical Processing (OLAP) – Business Intelligence (BI) – Customer Relationship Management (CRM) – Business Analytics – Target Marketing – Cross-Selling – Market Basket Analysis – Credit Scoring – Case-Based Reasoning (CBR) – Connecting the Dots – Intrusion Detection Systems (IDS) – Recommendation / Personalization Systems! By Dr. Borne 2005 UMUC Data Mining Lecture 2 22
  • 23. Database Systems and Data Mining • Data mining brings novel non-traditional concepts to large DBMS (e.g., association mining; neural nets; decision trees; link analysis; pattern recognition; classification; regression; SOMs). For example: – Clustering Analysis = group together similar items and separate dissimilar items – Classification Prediction = predict the class label – Regression = predict a numeric attribute value – Association Analysis = detect attribute-value conditions that occur frequently together (e.g., Beer & Diapers example) By Dr. Borne 2005 UMUC Data Mining Lecture 2 23
  • 24. Types of Databases to be Mined • Relational databases • Data warehouses • Transactional databases • Advanced DB and information repositories: – Object-oriented and object-relational databases – Spatial databases – Time-series data and temporal data – Text databases and multimedia databases – Heterogeneous and legacy databases – WWW, and eventually the Semantic Web By Dr. Borne 2005 UMUC Data Mining Lecture 2 24
  • 25. Data Warehousing and OLAP By Dr. Borne 2005 UMUC Data Mining Lecture 2 25
  • 26. Data Warehousing • Data warehouse = Materialized view • Integrated view of data from distributed sources • If transformation process can be represented via SQL, then data warehouse can be seen as a DB view: – CREATE VIEW warehouse_table AS SELECT … FROM source_table1, source_table2, … WHERE … – except that the view is materialized = result is stored and needs to be maintained when source data change By Dr. Borne 2005 UMUC Data Mining Lecture 2 26
  • 27. Order of Database Operations (1) • When building a DW, pay attention to the order of operations in the SQL command – particularly if large data need to be selected, grouped, and ordered – perhaps build intermediate views to cull data down to manageable size • Order of operations . . . By Dr. Borne 2005 UMUC Data Mining Lecture 2 27
  • 28. Order of Database Operations (2) (4) select ..... specifies attributes and computations to appear in answer (1) from .... indicates Cartesian product of source tables (2) where ..... provides boolean to filter Cartesian product groupby .... specifies attributes necessary to cluster the (3) results of the where-filter (5) orderby .... indicates attributes on which to order any visual display or sequential tuple returns (6) into .... specifies a temporary table to hold the answer Operational order By Dr. Borne 2005 UMUC Data Mining Lecture 2 28
  • 29. Maintaining the Data Warehouse The key concept is ETL : – Extraction: extract relevant data and/or changes from the DB sources – Transformation: transform the data to match the warehouse schema – Loading: integrate data (and subsequent changes to data) into the warehouse By Dr. Borne 2005 UMUC Data Mining Lecture 2 29
  • 30. Data Warehousing ―features‖ • Data are integrated into the DW in advance, prior to queries being formulated – Caution: Query results could therefore be stale • Data are copied from distributed sources – Care must be exercised to maintain consistency – Query processing is local to the DW: • faster • can operate even when data sources are unavailable By Dr. Borne 2005 UMUC Data Mining Lecture 2 30
  • 31. Selecting views to materialize • Factors that affect what to materialize: – Storage cost – Update cost – Which queries will benefit from it – How much will those queries benefit from it • Examples: – GROUP BY A1 is small, but not useful for most queries – GROUP BY A1, B2, C3 is useful for most queries, but too large to be of much benefit By Dr. Borne 2005 UMUC Data Mining Lecture 2 31
  • 32. Data Warehousing and OLAP (On-Line Analytical Processing) • OLAP as Data Mining: – Read data from integrated view of data sources – Complex queries of DW for Data Analysis – Data Analysis for Knowledge Discovery (KDD = Data Mining) – Knowledge Discovery for Decision Making – Goal: optimize reads and data warehouse queries for data exploration, mining, analysis By Dr. Borne 2005 UMUC Data Mining Lecture 2 32
  • 33. OLTP versus OLAP (On-Line Transaction Processing vs. On-Line Analytical Processing) • OLTP • OLAP – Mostly updates – Mostly reads – Short, simple – Long, complex transactions queries – DBA, clerical users – Analysts, decision – Goal: transaction makers throughput – Goal: fast queries – Local sources: – Distributed sources: heterogeneous DBs single integrated view (data warehouse) By Dr. Borne 2005 UMUC Data Mining Lecture 2 33
  • 34. OLAP Operations in the Warehouse • Slice (select one dimensional view) • Dice (select multi-dimensional view; aids in the search for trends and patterns) • Roll-up (consolidation; dimension reduction; aggregation; using simple or complex expressions) • Drill-down (querying specific items) • Visualize (―see‖ the results; allows for intuitive data understanding) By Dr. Borne 2005 UMUC Data Mining Lecture 2 34
  • 35. From Lecture #1 The Data Warehouse as the Source for the Mining Process By Dr. Borne 2005 UMUC Data Mining Lecture 2 35
  • 36. From ―DataMines for DataWarehouses‖ article (available in Webliography) Data Mining external to the Data Warehouse Data Mining within the Data Warehouse By Dr. Borne 2005 UMUC Data Mining Lecture 2 36
  • 37. Statistics and Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 37
  • 38. Data Mining = Statistical Analysis? • "Data mining … is the exploration and analysis, by automatic and semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules." (Berry, J. A. & Linoff, G. [1997]. Data mining Techniques For Marketing, Sales and Customer Support, John Wiley & Sons, Inc. New York, p.5, http://www.data- miners.com/books/order.html ) • "Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns of data for business advantage." (SAS Institute Inc., http://www.sas.com/technologies/analytics/datamining/index.html ) • "Data mining simply means finding patterns in your business data which you can use to do your business better" (SPSS Inc., http://www.statistical.com.au/dm.htm ) • ”Data mining is the use of statistical analysis and machine learning techniques, in a semiautomatic fashion, on large collections of data." (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance 11, 34–42.) By Dr. Borne 2005 UMUC Data Mining Lecture 2 38
  • 39. Statistics and Data Mining • Data mining got a bad name initially because it was initially viewed as ―statistical dredging‖ or a ―fishing expedition‖. • Data mining became an acceptable practice because its users exercised statistical rigor in their analyses. • Challenges and concerns: – Data volumes are huge. Techniques don‘t often scale. – Contaminated or corrupt data values (6-sigma effect) – Selection bias; non-independent observations – Fishing expedition = if you look hard enough, you will find something. But, is it really useful or not? … … this is the “Interestingness” Problem … • Are the data mining results interesting to anyone? By Dr. Borne 2005 UMUC Data Mining Lecture 2 39
  • 40. Quality Management and Data Mining • The focus of TQM (Total Quality Management) is total customer satisfaction. • This can be realized through CRM (Customer Relationship Management) systems = a data mining technology : – Gather data – Analyze data – Make decisions based upon results • Related to this are 6-Sigma quality control processes : customer satisfaction maximized through minimizing defects in products and services delivered. • Some references: – http://www.sbaer.uca.edu/newsletter/2002/012202.pdf – http://www.qualitydigest.com/apr99/html/body_spcguide.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 40
  • 41. Information Retrieval By Dr. Borne 2005 UMUC Data Mining Lecture 2 41
  • 42. Information Retrieval (IR) • IR is a combination of data discovery and data mining in digital libraries or other information repositories. • An IR system operates on a collection of documents (e.g., the WWW) • IR is sometimes called Text Mining or Web Mining • Effectiveness of an IR project is measured by precision and recall By Dr. Borne 2005 UMUC Data Mining Lecture 2 42
  • 43. Information Retrieval Metrics Precision = (relevant & retrieved) / (retrieved) – “Am I interested in the documents retrieved?” – High Precision means most of the retrieved documents are relevant to my query Recall = (relevant & retrieved) / (relevant) – “Have all relevant documents been retrieved?” – High Recall means that most of the relevant documents have been retrieved. By Dr. Borne 2005 UMUC Data Mining Lecture 2 43
  • 44. IR and Text/Web Mining • Semantic markup of Web or other text documents using XML (eXtensible Markup Language) • XML enables metadata / keyword harvesting from document collections (e.g., Web screen-scraping) • Harvested metadata can be stored in a Data Warehouse for mining -- this is clearly an example of a materialized view of distributed data sources • Other metrics: ―similarity‖ to other documents (e.g., common keywords, common keyphrases) • Application area: Automated Recommendation System By Dr. Borne 2005 UMUC Data Mining Lecture 2 44
  • 45. Information Retrieval Issues • Semantic content of documents • Unstructured versus structured content • Multi-modal content (image, text, numeric) • Reliability of sources • Quality of sources • Indexing for efficient & effective access • Similarity metrics (e.g., how do you do a Groupby or a Roll-up ?) • Privacy, Copyright, Intellectual Property By Dr. Borne 2005 UMUC Data Mining Lecture 2 45
  • 46. IR and Image Mining • Image Mining is a form of IR and data mining • Techniques: – Wavelet analysis and summarization – Pixel value (color) histograms and vectorization – Scene pattern recognition and indexing – Event/anomaly detection and cataloguing (e.g, forest fires seen in satellite photos) – Edge detection (unsharp masking) and graphs • The data to be mined are the information databases extracted from the images (not the raw image data themselves) By Dr. Borne 2005 UMUC Data Mining Lecture 2 46
  • 47. Data Mining as “Rule Induction” By Dr. Borne 2005 UMUC Data Mining Lecture 2 47
  • 48. From Lecture #1 Decision Tree Classification: based on rules at each node of the tree Should I play tennis today? By Dr. Borne 2005 UMUC Data Mining Lecture 2 48
  • 49. Intelligent actions (decision support) are often represented by a set of rules… IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖ IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖ IF age = ―31…40‖ THEN buys_computer = ―yes‖ IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN buys_computer = ―yes‖ IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer = ―no‖ (example of Decision Tree rules) By Dr. Borne 2005 UMUC Data Mining Lecture 2 49
  • 50. Rule-Based Algorithms (RBA) • RBA = Decision Support via ―if-then rules‖ • Can generate the rules from a Decision Tree (DT). • But, rules do not need to be derived from a DT. • Rules have no order, unlike Decision Trees. • Trees are built by examining all cases; whereas rules are generated one case at a time. • Rule Induction is the method for deriving rules. • Case-Based Reasoning (CBR) is a related application of rule-based algorithms. By Dr. Borne 2005 UMUC Data Mining Lecture 2 50
  • 51. Sometimes the rules are fuzzy… (example of Fuzzy Rule Induction) By Dr. Borne 2005 UMUC Data Mining Lecture 2 51
  • 52. Fuzzy Sets and Logic By Dr. Borne 2005 UMUC Data Mining Lecture 2 52
  • 53. Fuzzy Sets and Logic • Data mining does not always yield absolute answers, but statistical answers that indicate the probability frequency of occurrence of patterns or classes, or the likelihood that an object in the database belongs to a given class. • In predictive data mining, the result is fuzzy (e.g., predicting loan default through bank account analysis does not guarantee that the customer will indeed default on their loan). • Fuzzy Logic is a method for handling uncertainty in data, in decision-making, and in control systems. By Dr. Borne 2005 UMUC Data Mining Lecture 2 53
  • 54. Sets and Logic - Classical (Boolean) By Dr. Borne 2005 UMUC Data Mining Lecture 2 54
  • 55. Sets and Logic - Fuzzy By Dr. Borne 2005 UMUC Data Mining Lecture 2 55
  • 56. Classical versus Fuzzy By Dr. Borne 2005 UMUC Data Mining Lecture 2 56
  • 57. Fuzzy Logic, Control Systems, and Data Mining • Suppose you have a R/T (real-time) data monitoring (data mining) control system attached to machinery in a large manufacturing plant. • Temperature sensor on a machine says that it is running very hot (... what is ―hot‖? -- that‘s fuzzy). • Motion sensor within machine says that it is running at high RPM, very fast (… what is ―fast‖? -- that‘s fuzzy). • The machine is not technically over-heating, which you know because of past experience and common sense. • Control System responds to data and knowledge-base by invoking a rule to slow down the motor speed a little bit. By Dr. Borne 2005 UMUC Data Mining Lecture 2 57
  • 58. Application of Fuzzy Logic to Data Mining - 1 <http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html> Direct Mailing System • The problem is to identify customers from a customer database who can be targeted for a sale under the assumption that these customers responded positively to advertisements mailed to them. The additional constraint is that the mailing list budget is limited and number of advertisements to be mailed are to be controlled to increase profit. The first step involves analyzing the database for attributes like "frequency of visits to the store", "sum of purchases", etc. Analysis and plots of the data then determine the cluster of good customers. Next, one has to find the attribute relationships to define a query condition which is represented by a pair of attributes and a fuzzy linguistic value. One then verifies and refines the query condition by using another customer database. Thus the customer database is ranked and sorted by degree values based on a given fuzzy query condition. The customers retrieved by the query determine the list of the potential of good customers. By Dr. Borne 2005 UMUC Data Mining Lecture 2 58
  • 59. Application of Fuzzy Logic to Data Mining - 2 <http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html> Vibration Sensor • A product which was used to sense vibrations and predict the causes of these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy rules. The original sensor was based on simple threshold rule. The error rate for this sensor was around 12%. The fuzzy rules were created by analyzing the actual data in specified cases of earthquakes, automobiles etc. A feature extraction was done on the data set to identify each kind of cause. Relationships between the feature parameters and the kind of vibration were discovered to develop the fuzzy rules. These rules were then tested and refined. The accuracy of the sensor‘s prediction improved dramatically, with the error rate falling to within 1%. By Dr. Borne 2005 UMUC Data Mining Lecture 2 59
  • 60. Non-Fuzzy Logic System By Dr. Borne 2005 UMUC Data Mining Lecture 2 60
  • 61. Adaptive Fuzzy Logic System This example is related to air conditioner settings in a warm room, but the adaptive fuzzy logic system may be applied to activate other ―thinking machines‖. By Dr. Borne 2005 UMUC Data Mining Lecture 2 61
  • 62. Machine Learning – a tool for Data Mining and Intelligent Decision Support By Dr. Borne 2005 UMUC Data Mining Lecture 2 62
  • 63. Machine Learning • What is Machine Learning? -- “ML is the application of computer algorithms that improve automatically through experience.” • Why is ML applicable to Data Mining? -- – Refer to earlier slide “Reasons for the growing popularity of data mining” : • Growing Data Volume -- ML enables the intelligent analysis of overwhelmingly large data/knowledge repositories • Limitations of Human Analysis -- ML enables automated searches for complex multifactor dependencies in data • Low Cost of Machine Learning -- machines and software are cheaper than people; the ML process is repeatable, consistent, and robust in handling very large data analysis tasks; adaptive ML algorithms can scale with the problem. By Dr. Borne 2005 UMUC Data Mining Lecture 2 63
  • 64. Machine Learning and Data Mining • ML Techniques for DM (to be covered later): – Decision Trees – Rule Mining and Rule Learning – Case-Based Reasoning (CBR) – Neural Nets (NN) – Supervised and Unsupervised Learning – Support Vector Machines (SVM) – Bayesian Networks – Genetic Algorithms (GA) By Dr. Borne 2005 UMUC Data Mining Lecture 2 64
  • 65. Neural Nets • “Neural networks are the second best way of doing just about anything.” (John Denker) Neural Network Fuzzy Data Rules • The best way is “is to apply all available domain knowledge and spend a considerable amount of time, money and effort in building a rule system that will give the right answer. The second best way of doing anything is to learn from experience.” (Burbidge & Buxton) By Dr. Borne 2005 UMUC Data Mining Lecture 2 65
  • 66. Supervised vs. Unsupervised Learning • In Supervised Learning algorithms, a training set is provided (data with correct answers), which is used to mine for known patterns. • In Unsupervised Learning algorithms, data are provided with no a priori knowledge of the hidden patterns (knowledge) that they contain. The goal is to discover (learn) these patterns. • A class known as Semi-Supervised Learning also exists, where knowledge is known and applied from one data collection in order to mine, analyze, classify, and interpret a related data collection. By Dr. Borne 2005 UMUC Data Mining Lecture 2 66
  • 67. Machine Learning, Data Mining, and Support Vector Machines (SVM) • SVM is the tool of choice for the application of ML to the data mining classification problem. • So what are they? … ―a statistical learning system for predictive data mining -- for estimating regression functions.‖ • Loads of information available here: http://www.cs.rpi.edu/~bij2/svm.html http://www.kernel-machines.org/tutorial.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 67
  • 68. SVM Process Overview Initial Data Classification Data SVM Training Weights SVM Classification Elements Elements In Out of Classification Classification By Dr. Borne 2005 UMUC Data Mining Lecture 2 68
  • 69. SVM Classification • SVM attempts to find an optimal separating hyperplane between members of the two initial classifications. Separating hyperplane Class ―A‖ Class ―B‖ By Dr. Borne 2005 UMUC Data Mining Lecture 2 69
  • 70. SVM Class Separation Problem • An optimal hyperplane partitions the initial classification correctly and maximizes distance from the plane to elements on either ‗side‘: positive and negative examples. • When the training examples (initial classification) consist of very diverse expression patterns, then finding an optimal hyperplane can be impossible. By Dr. Borne 2005 UMUC Data Mining Lecture 2 70
  • 71. SVM Kernel Construction The expression data can be transformed to a higher dimensional space (feature space) by applying a kernel function. This transformation can have the effect of allowing a separating hyperplane to be found. By Dr. Borne 2005 UMUC Data Mining Lecture 2 71
  • 72. Practical SVM Issues • Results depend heavily on the input parameters. • Using a high degree kernel function risks artificial separation of the data. • An iterative approach to increasing the kernel power is advisable. By Dr. Borne 2005 UMUC Data Mining Lecture 2 72
  • 73. SVM Results • Two classes are produced: – Positive Class: contains elements with expression patterns similar to those in the positive examples in the training set. – Negative Class: contains all other members of the input set. • Each of these classes has elements that fall in two groups: – Those initially in the class (true positives and true negatives) – Those recruited into the class (false positives and false negatives) By Dr. Borne 2005 UMUC Data Mining Lecture 2 73
  • 74. Machine Learning Resources • 1. Massive compilation of ML resources at : http://home.earthlink.net/~dwaha/research/machine-learning.html • 2. Excellent Reference Book: Tom Mitchell‘s ―Machine Learning‖ (1997; McGraw-Hill) : http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html • 3. Machine Learning & Data Mining Resources : My favorite ML site … http://www.mlnet.org/ Click on Software … a site dedicated to ―machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining.‖ By Dr. Borne 2005 UMUC Data Mining Lecture 2 74
  • 75. Recap of ML and DM • DM requires machine assistance in the search and analysis of very large (often distributed, heterogeneous) databases • Intelligent analysis of complex multi-dimensional multiple- dependency data also demands machine assistance • Algorithms for DM are most efficient when they are adaptable to the type and content of the data (i.e., the system ―learns‖) • Machines are less expensive than humans • Machines are usually scalable as the problem size grows • Actionable data (the end-goal of DM) depends in many cases on an embedded ML algorithm to take appropriate action (in control systems; decision-support systems; robotics; autonomous systems) • ML and DM are historically, technically, and functionally intertwined (e.g, some data mining research groups call themselves Machine Learning Groups) By Dr. Borne 2005 UMUC Data Mining Lecture 2 75
  • 76. Steps in the Data Mining Process By Dr. Borne 2005 UMUC Data Mining Lecture 2 76
  • 77. Steps in the Data Mining Process http://www.cs.sfu.ca/~han/DM_Book.html • Learning the application domain: – relevant prior knowledge and goals of DM application • Creating a target data set: Data selection • Data cleaning and preprocessing: (may take 40-60% of effort!) • Data reduction and transformation: – Find useful features, dimensionality/variable reduction, invariant representation. • Choosing data mining functions – summarization, classification, regression, association, clustering • Choosing the mining algorithm(s) • Data mining & KDD: search for patterns of interest • Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. • Using the discovered knowledge = Actionable Data! By Dr. Borne 2005 UMUC Data Mining Lecture 2 77
  • 78. Steps in the Data Mining Process - Pictorial View By Dr. Borne 2005 UMUC Data Mining Lecture 2 78
  • 79. Cleaning the ―Dirty Data‖ • Excellent reference: Dorian Pyle‘s book ―Data Preparation for Data Mining‖ (1999, Morgan Kaufmann; 540pp) • Frequent problem: missing (NULL) values • Empty value Missing value (must treat each case differently) • Various options for NULLs (may introduce bias): – use ―fill value‖ (e.g, -999) – use estimated value (prediction from data model) – use interpolated value (from surrounding entries) – ignore any records with nulls • November 2003 Workshop on Data Cleaning: http://dimacs.rutgers.edu/Workshops/DataCleaning/ By Dr. Borne 2005 UMUC Data Mining Lecture 2 79
  • 80. Data Preprocessing (Laundering the Data) (may take 40-80% of the total data mining project effort!) (Reference: ―Data Scrubbing‖ article in Computerworld 2003) By Dr. Borne 2005 UMUC Data Mining Lecture 2 80
  • 81. "Data Scrubbing by the Numbers‖ (http://www.computerworld.com/printthis/2003/0,4814,78260,00.html) Here are some of the findings: Data cleansing accounts for up to 70% of the cost and effort of implementing most data warehouse projects, according to analysts. In 2001, The Data Warehousing Institute estimated that dirty data costs U.S. businesses $600 billion per year. Data cleanliness and quality was the No. 2 problem -- right behind budget cuts -- cited in a 2003 IDC survey of 1,648 companies implementing business analytics software enterprise-wide. Only 23% of 130 companies surveyed by Cutter Consortium on their data warehousing and business-intelligence practices use specialized data cleansing tools. Of those companies in the Cutter Consortium study using specialized data scrubbing software, 31% are using tools that were built in-house. By Dr. Borne 2005 UMUC Data Mining Lecture 2 81
  • 82. Major Issues in Data Mining By Dr. Borne 2005 UMUC Data Mining Lecture 2 82
  • 83. Major Issues in Data Mining (1) • Mining methodology and user interaction – Mining different kinds of knowledge in databases – Interactive mining of knowledge at multiple levels of abstraction – Incorporation of background knowledge – Data mining query languages and ad-hoc data mining – Expression and visualization of data mining results – Handling of noise and incomplete data – Pattern evaluation: the interestingness problem • Performance and scalability – Handling very large data volumes (the ―data flood‖) – Efficiency and scalability of data mining algorithms – Parallel, distributed, and incremental mining methods By Dr. Borne 2005 UMUC Data Mining Lecture 2 83
  • 84. Major Issues in Data Mining (2) • Issues relating to the diversity of data types – Handling relational and complex types of data – Mining information from heterogeneous databases and global information systems (WWW) • Issues related to applications and social impacts – Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making – Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem – Protection of data security, integrity, and privacy • Dirty data (60% of the effort, or more) – Preparing the data for mining (transformation, cleaning, processing) By Dr. Borne 2005 UMUC Data Mining Lecture 2 84
  • 85. Case Study - The Mars Rover http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html By Dr. Borne 2005 UMUC Data Mining Lecture 2 85
  • 86. Data Mining in Action • Data Mining facilitates Intelligent Data Understanding • Data Mining enables Decision Support and Active Control Systems By Dr. Borne 2005 UMUC Data Mining Lecture 2 86
  • 87. What is Intelligent Data Understanding? • IDU refers to the application of techniques for transforming data into understanding. … (sound familiar?) Data  Information  Knowledge  Understanding / Wisdom! • Web reference: http://is.arc.nasa.gov/IDU/index.html • IDU specifically refers to automating the following techniques for machine-assisted data analysis: – Data Mining (e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html) – Knowledge Discovery – Machine Learning By Dr. Borne 2005 UMUC Data Mining Lecture 2 87
  • 88. Intelligent Data System Applications (1) • Rove around the surface of Mars and take samples of rocks (mass spectroscopy = a data histogram) • Supervised Learning (search for rocks with known compositions) • Unsupervised Learning (discover what types of rocks are present, without preconceived biases) • Association Mining (find unusual associations) • Clustering (find the set of unique classes of rocks) • Classification (assign rocks to known classes) • Deviation/Outlier Detection (one-of-kind; interesting?) By Dr. Borne 2005 UMUC Data Mining Lecture 2 88
  • 89. Intelligent Data System Applications (2) • On-board Intelligent Data Understanding & Decision Support Systems (Fuzzy Logic & Decision Trees & Cased-Based Reasoning ) – Science Goal Monitoring: – “stay here and do more”; or else “move on to another rock” – “send results to Earth immediately”; or “send results later” • Learn as it goes (Machine Learning & Neural Nets) • Relate the results to other factors, such as dust storms (XML & Information Retrieval & Information Fusion with other data from orbiting satellite ―mother ship‖) • Predict where to go in order to find interesting rocks (Logistic Regression & Case-Based Reasoning) By Dr. Borne 2005 UMUC Data Mining Lecture 2 89
  • 90. Mars Rover as an Adaptive Fuzzy Logic System • Decisions are based on data mined, prior experience, new knowledge, and fuzzy logic • Rover acts autonomously, without human intervention, in Deep Space environment • Actions are driven by mining actionable data from all sensors By Dr. Borne 2005 UMUC Data Mining Lecture 2 90
  • 91. Summary By Dr. Borne 2005 UMUC Data Mining Lecture 2 91
  • 92. Summary of Topics Covered • Summary of ―What is Data Mining?‖ Tutorial • Foundations of Data Mining • Database Systems • Data Warehousing and OLAP • Statistics and Data Mining • Information Retrieval • Data Mining as ―Rule Induction‖ • Fuzzy Sets and Logic • Machine Learning • Steps in the Data Mining Process • Major Issues in Data Mining • A Case Study: The NASA Mars Rover By Dr. Borne 2005 UMUC Data Mining Lecture 2 92