Lecture 2

Data Mining
UMUC CSMN 667

Lecture #2

By Dr. Borne 2005 UMUC Data Mining Lecture 2 1

Term Paper - Data Mining Case Analysis
• Refer to Project Descriptions section of WebTycho course
Syllabus for detailed information.
• 1-page Summary (Abstract+Outline) due: April 4, 2005
• Final Paper Due Date: 12midnight, April 18, 2005
• Submit both in your WebTycho Assignments Folder
• Term Paper Page Restrictions: 5-8 pages
• I will submit your paper to TurnItIn.com for verification of
originality – per UMUC Graduate School policies.
• Format/Style: Use the SPIE Conference Proceedings Style,
which is available at:
http://www.spie.org/app/Publications/index.cfm?fuseaction=authinfo&type=manspecs
[ONLY USE THIS FOR STYLE FILES AND FORMATTING INSTRUCTIONS]


Case Analysis Instructions (1)
The goal of the paper assignment is to complete an in-depth study of
a data mining application. Examples of applications include
financial, scientific, medical, intrusion detection, and web mining.
Describe data types, data volumes, technical challenges, end-goals,
who is the user community, which data mining algorithms are most
relevant, why data mining, how is it used, what is the current status
of data mining usage in this field? --- Possible case topics include:
A direct mailing application looking to maximize cross-selling opportunities (e.g., Doubleclick).
A bank determining the credit worthiness of a potential customer (e.g., American Express, Bank
of America).
A medical insurer looking to detect medical fraud.
Gene detection in BioInformatics (e.g., Celera).
Glitch or anomaly detection in scientific time series data.
Abnormal network access behavior for detection of computer system intrusion and security
violation.


Case Analysis Instructions (2)

• You may choose to go in depth in either one of
these two areas:
– A data mining application domain: Evaluate the application area
in detail, as explained on the previous slide, including a review and analysis
of the different data mining techniques employed there.
Or

– A data mining technique: Research in depth the different application
domains where this technique has been used. Answer the questions on the
previous slide when evaluating this technique‘s different application areas.


Case Analysis Paper - Instructions (3)
• Please e-mail me your suggested topic (application area to
be researched) so that I may verify that it is okay.


Case Analysis Paper - Instructions (4)

• Submit your completed paper in WebTycho.
• You may submit your paper in any of these
formats: PDF, or Microsoft WORD, or postscript
(PS).
• You must submit it no later than midnight on
April 18. WebTycho will not allow submissions
after that time.
• Submit the paper in your "Assignments
Folder" (on the left menu bar within the
WebTycho course website).

Lecture 2:
―Data Mining Roots‖
(Chapter 2 of Dunham textbook)


Lecture 2 Outline
• Summary of ―What is Data Mining?‖ Tutorial
• Foundations of Data Mining
• Database Systems
• Data Warehousing and OLAP
• Statistics and Data Mining
• Information Retrieval
• Data Mining as ―Rule Induction‖
• Fuzzy Sets and Logic
• Machine Learning
• Steps in the Data Mining Process
• Major Issues in Data Mining
• A Case Study: The NASA Mars Rover


“What is Data Mining?”

From online reading assigment --
Data Mining Tutorial at :
http://www.megaputer.com/dm/dm101.php3


Summary of ―What is Data Mining?‖ Tutorial
• What is data mining?
• Why use data mining?
• What can Data Mining do for you?
• Reasons for the growing popularity of Data Mining
• Tasks Solved by Data Mining
• Different DM Technologies and Systems
Subject-oriented analytical systems
Statistical packages
Neural Networks
Evolutionary Programming
Memory Based Reasoning
Decision Trees
Genetic Algorithms
Nonlinear Regression Methods

What can Data Mining do for you?
(business-focused list)
• Identify your best prospects and then
retain them as customers.
• Predict cross-sell opportunities and make
recommendations.
• Learn parameters influencing trends in
sales and margins.
• Segment markets and personalize
communications.


Reasons for the Growing Popularity of Data Mining
• Growing Data Volumes
• Limitations of Human Analysis
• Low Cost of Machine Learning

Tasks Solved by Data Mining
• Prediction
• Explicit Modeling
• Classification
• Clustering
• Detection of Relations
• Market Basket Analysis
• Deviation Detection


Foundations of Data Mining


Foundations of Data Mining: Databases,
Statistics, and Machine Learning
• David Hand (1998. ―Data Mining: Statistics and
More?‖, The American Statistician, 52, pp. 112–
118) used the following definition.
– "Data mining is a new discipline lying at the interface of
statistics, database technology, pattern recognition, machine
learning, and other areas. It is concerned with the secondary
analysis of large databases in order to find previously
unsuspected relationships which are of interest or value to
the database owners.”
– Why “secondary”? … Because the data were typically
collected for other purposes (such as billing, accounting,
customer addresses, etc.). Primary analysis of large
databases is generally the domain of STATISTICS.


Slide from Lecture 1
Evolution of Data Mining
<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Evolutionary Step Business Question Enabling Characteristics
Technologies
Data Collection "What was my total Computers, tapes, disks Retrospective, static
(1960s) revenue in the last five data delivery
years?"

Data Access "What were unit sales in Relational databases Retrospective, dynamic
(1980s) New England last (RDBMS), Structured data delivery at record
March?" Query Language (SQL), level
ODBC

Data Warehousing & "What were unit sales in
On-line analytic Retrospective, dynamic
Decision Support New England last processing (OLAP), data delivery at multiple
(1990s) March? Drill down to multidimensional levels
Boston." databases, data
warehouses
Data Mining "What’s likely to Advanced algorithms, Prospective, proactive
(Emerging Today) happen to Boston unit multiprocessor information delivery
sales next month? computers, massive
Why?" databases


Foundation for Data Mining Techniques
• 1960s:
– Data collection, database creation, IMS, and hierarchical DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, financial, manufacturing, sales, etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases


History of Data Mining
• Dates for specific events were imprecise in the
preceding slides. This might be a little better :


Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Technology

Machine
Data Mining Visualization
Learning

Information Other
Science Disciplines


Data Mining Stepping Stones
http://www.cs.sfu.ca/~han/DM_Book.html

Increasing potential End User
to support Making
business decisions Decisions

Data Presentation Business
Visualization Techniques Analyst

Data Mining
Information Discovery Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP


Database Systems


Database Systems
• DBMS joins ―AI and statistics‖ to become Data Mining
• Data mining usually asks complex statistical questions
that are difficult to answer via traditional SQL queries
• Data mining relies on special algorithms outside of the
standard DBMS/SQL family of tools
• Data mining is used to extract knowledge from DBMS,
not just the data bits (i.e., KDD)
• Data mining applies familiar statistical concepts to
large DBMS (e.g., outlier detection; cluster analysis;
data modeling; evolutionary analysis; prediction)

Data Mining is a core database function
• Data Mining has many names / aliases :
– Knowledge Discovery in Databases (KDD)
– Machine Learning (ML)
– Exploratory Data Analysis (EDA)
– Intelligent Data Analysis (IDA)
– On-Line Analytical Processing (OLAP)
– Business Intelligence (BI)
– Customer Relationship Management (CRM)
– Business Analytics
– Target Marketing
– Cross-Selling
– Market Basket Analysis
– Credit Scoring
– Case-Based Reasoning (CBR)
– Connecting the Dots
– Intrusion Detection Systems (IDS)
– Recommendation / Personalization Systems!


Database Systems and Data Mining
• Data mining brings novel non-traditional concepts to
large DBMS (e.g., association mining; neural nets;
decision trees; link analysis; pattern recognition;
classification; regression; SOMs). For example:
– Clustering Analysis = group together similar items and
separate dissimilar items
– Classification Prediction = predict the class label
– Regression = predict a numeric attribute value
– Association Analysis = detect attribute-value conditions that
occur frequently together (e.g., Beer & Diapers example)


Types of Databases to be Mined
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories:
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW, and eventually the Semantic Web

Data Warehousing and OLAP


Data Warehousing
• Data warehouse = Materialized view
• Integrated view of data from distributed sources
• If transformation process can be represented via SQL,
then data warehouse can be seen as a DB view:
– CREATE VIEW warehouse_table AS
SELECT …
FROM source_table1, source_table2, …
WHERE …
– except that the view is materialized = result is stored
and needs to be maintained when source data change


Order of Database Operations (1)

• When building a DW, pay attention to the
order of operations in the SQL command
– particularly if large data need to be selected,
grouped, and ordered
– perhaps build intermediate views to cull data
down to manageable size
• Order of operations . . .


Order of Database Operations (2)
(4) select ..... specifies attributes and computations to
appear in answer

(1) from .... indicates Cartesian product of source tables

(2) where ..... provides boolean to filter Cartesian product

groupby .... specifies attributes necessary to cluster the
(3)
results of the where-filter

(5) orderby .... indicates attributes on which to order any
visual display or sequential tuple returns

(6) into .... specifies a temporary table to hold the answer

Operational order


Maintaining the Data Warehouse
The key concept is ETL :
– Extraction: extract relevant
data and/or changes from the
DB sources
– Transformation: transform
the data to match the
warehouse schema
– Loading: integrate data (and
subsequent changes to data)
into the warehouse

Data Warehousing ―features‖
• Data are integrated into the DW in advance,
prior to queries being formulated
– Caution: Query results could therefore be stale
• Data are copied from distributed sources
– Care must be exercised to maintain consistency
– Query processing is local to the DW:
• faster
• can operate even when data sources are unavailable


Selecting views to materialize
• Factors that affect what to materialize:
– Storage cost
– Update cost
– Which queries will benefit from it
– How much will those queries benefit from it
• Examples:
– GROUP BY A1 is small, but not useful for most
queries
– GROUP BY A1, B2, C3 is useful for most
queries, but too large to be of much benefit

Data Warehousing and OLAP
(On-Line Analytical Processing)
• OLAP as Data Mining:
– Read data from integrated view of data sources
– Complex queries of DW for Data Analysis
– Data Analysis for Knowledge Discovery
(KDD = Data Mining)
– Knowledge Discovery for Decision Making
– Goal: optimize reads and data warehouse
queries for data exploration, mining, analysis

OLTP versus OLAP
(On-Line Transaction Processing vs. On-Line Analytical Processing)

• OLTP • OLAP
– Mostly updates – Mostly reads
– Short, simple – Long, complex
transactions queries
– DBA, clerical users – Analysts, decision
– Goal: transaction makers
throughput – Goal: fast queries
– Local sources: – Distributed sources:
heterogeneous DBs single integrated view
(data warehouse)

OLAP Operations in the Warehouse
• Slice (select one dimensional view)
• Dice (select multi-dimensional view;
aids in the search for trends and
patterns)
• Roll-up (consolidation; dimension
reduction; aggregation; using simple
or complex expressions)
• Drill-down (querying specific items)
• Visualize (―see‖ the results; allows
for intuitive data understanding)

From Lecture #1

The Data Warehouse as the Source
for the Mining Process


From ―DataMines for DataWarehouses‖ article
(available in Webliography)

Data Mining external
to the Data Warehouse

Data Mining within
the Data Warehouse


Statistics and Data Mining


Data Mining = Statistical Analysis?
• "Data mining … is the exploration and analysis, by automatic and
semi-automatic means, of large quantities of data in order to
discover meaningful patterns and rules." (Berry, J. A. & Linoff, G.
[1997]. Data mining Techniques For Marketing, Sales and Customer
Support, John Wiley & Sons, Inc. New York, p.5, http://www.data-
miners.com/books/order.html )
• "Data mining is the process of selecting, exploring, and modeling
large amounts of data to uncover previously unknown patterns of
data for business advantage." (SAS Institute Inc.,
http://www.sas.com/technologies/analytics/datamining/index.html )
• "Data mining simply means finding patterns in your business data
which you can use to do your business better" (SPSS Inc.,
http://www.statistical.com.au/dm.htm )
• ”Data mining is the use of statistical analysis and machine learning
techniques, in a semiautomatic fashion, on large collections of
data." (Jorgensen, M. & Gentleman, R. [1998]. Data Mining. Chance
11, 34–42.)


Statistics and Data Mining
• Data mining got a bad name initially because it was
initially viewed as ―statistical dredging‖ or a ―fishing
expedition‖.
• Data mining became an acceptable practice because
its users exercised statistical rigor in their analyses.
• Challenges and concerns:
– Data volumes are huge. Techniques don‘t often scale.
– Contaminated or corrupt data values (6-sigma effect)
– Selection bias; non-independent observations
– Fishing expedition = if you look hard enough, you will
find something. But, is it really useful or not? … …
this is the “Interestingness” Problem …
• Are the data mining results interesting to anyone?


Quality Management and Data Mining
• The focus of TQM (Total Quality Management) is total customer
satisfaction.
• This can be realized through CRM (Customer Relationship
Management) systems = a data mining technology :
– Gather data
– Analyze data
– Make decisions based upon results
• Related to this are 6-Sigma quality control processes : customer
satisfaction maximized through minimizing defects in products
and services delivered.
• Some references:
– http://www.sbaer.uca.edu/newsletter/2002/012202.pdf
– http://www.qualitydigest.com/apr99/html/body_spcguide.html


Information Retrieval


Information Retrieval (IR)
• IR is a combination of data discovery and
data mining in digital libraries or other
information repositories.
• An IR system operates on a collection of
documents (e.g., the WWW)
• IR is sometimes called Text Mining or Web
Mining
• Effectiveness of an IR project is measured by
precision and recall

Information Retrieval Metrics
Precision = (relevant & retrieved) / (retrieved)
– “Am I interested in the documents retrieved?”
– High Precision means most of the retrieved
documents are relevant to my query

Recall = (relevant & retrieved) / (relevant)
– “Have all relevant documents been retrieved?”
– High Recall means that most of the relevant
documents have been retrieved.

IR and Text/Web Mining
• Semantic markup of Web or other text documents using
XML (eXtensible Markup Language)
• XML enables metadata / keyword harvesting from
document collections (e.g., Web screen-scraping)
• Harvested metadata can be stored in a Data Warehouse for
mining -- this is clearly an example of a materialized view
of distributed data sources
• Other metrics: ―similarity‖ to other documents
(e.g., common keywords, common keyphrases)
• Application area: Automated Recommendation System

Information Retrieval Issues
• Semantic content of documents
• Unstructured versus structured content
• Multi-modal content (image, text, numeric)
• Reliability of sources
• Quality of sources
• Indexing for efficient & effective access
• Similarity metrics (e.g., how do you do a
Groupby or a Roll-up ?)
• Privacy, Copyright, Intellectual Property

IR and Image Mining
• Image Mining is a form of IR and data mining
• Techniques:
– Wavelet analysis and summarization
– Pixel value (color) histograms and vectorization
– Scene pattern recognition and indexing
– Event/anomaly detection and cataloguing
(e.g, forest fires seen in satellite photos)
– Edge detection (unsharp masking) and graphs
• The data to be mined are the information databases
extracted from the images (not the raw image data
themselves)

Data Mining as “Rule Induction”


From Lecture #1

Decision Tree Classification:
based on rules at each node of the tree

Should I play
tennis today?


Intelligent actions (decision support) are
often represented by a set of rules…

IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖
IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖
IF age = ―31…40‖ THEN buys_computer = ―yes‖
IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN buys_computer = ―yes‖
IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer = ―no‖

(example of Decision Tree rules)


Rule-Based Algorithms (RBA)
• RBA = Decision Support via ―if-then rules‖
• Can generate the rules from a Decision Tree (DT).
• But, rules do not need to be derived from a DT.
• Rules have no order, unlike Decision Trees.
• Trees are built by examining all cases; whereas
rules are generated one case at a time.
• Rule Induction is the method for deriving rules.
• Case-Based Reasoning (CBR) is a related
application of rule-based algorithms.

Sometimes the rules are fuzzy…

(example of Fuzzy Rule Induction)


Fuzzy Sets and Logic


Fuzzy Sets and Logic
• Data mining does not always yield absolute answers, but
statistical answers that indicate the probability frequency
of occurrence of patterns or classes, or the likelihood that
an object in the database belongs to a given class.
• In predictive data mining, the result is fuzzy (e.g.,
predicting loan default through bank account analysis
does not guarantee that the customer will indeed default
on their loan).
• Fuzzy Logic is a method for handling uncertainty in
data, in decision-making, and in control systems.

Sets and Logic - Classical (Boolean)


Sets and Logic - Fuzzy


Classical versus Fuzzy


Fuzzy Logic, Control Systems, and Data Mining

• Suppose you have a R/T (real-time) data monitoring
(data mining) control system attached to machinery in a
large manufacturing plant.
• Temperature sensor on a machine says that it is running
very hot (... what is ―hot‖? -- that‘s fuzzy).
• Motion sensor within machine says that it is running at
high RPM, very fast (… what is ―fast‖? -- that‘s fuzzy).
• The machine is not technically over-heating, which you
know because of past experience and common sense.
• Control System responds to data and knowledge-base by
invoking a rule to slow down the motor speed a little bit.

Application of Fuzzy Logic to Data Mining - 1
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>
Direct Mailing System
• The problem is to identify customers from a customer database who can be
targeted for a sale under the assumption that these customers responded
positively to advertisements mailed to them. The additional constraint is that
the mailing list budget is limited and number of advertisements to be mailed
are to be controlled to increase profit. The first step involves analyzing the
database for attributes like "frequency of visits to the store", "sum of
purchases", etc. Analysis and plots of the data then determine the cluster of
good customers. Next, one has to find the attribute relationships to define a
query condition which is represented by a pair of attributes and a fuzzy
linguistic value. One then verifies and refines the query condition by using
another customer database. Thus the customer database is ranked and sorted
by degree values based on a given fuzzy query condition. The customers
retrieved by the query determine the list of the potential of good customers.


Application of Fuzzy Logic to Data Mining - 2
<http://www.cs.uah.edu/~thinke/CS687/Fall97/Tech/rahul_dbase_paper.html>

Vibration Sensor
• A product which was used to sense vibrations and predict the causes of
these vibrations (i.e., earthquakes, etc.) was improved by utilizing fuzzy
rules. The original sensor was based on simple threshold rule. The error rate
for this sensor was around 12%. The fuzzy rules were created by analyzing
the actual data in specified cases of earthquakes, automobiles etc. A feature
extraction was done on the data set to identify each kind of cause.
Relationships between the feature parameters and the kind of vibration were
discovered to develop the fuzzy rules. These rules were then tested and
refined. The accuracy of the sensor‘s prediction improved dramatically, with
the error rate falling to within 1%.


Non-Fuzzy Logic System


Adaptive Fuzzy Logic System
This example is related
to air conditioner settings
in a warm room, but the
adaptive fuzzy logic system
may be applied to activate
other ―thinking machines‖.


Machine Learning – a tool for
Data Mining and Intelligent
Decision Support


Machine Learning
• What is Machine Learning? -- “ML is the application of
computer algorithms that improve automatically
through experience.”
• Why is ML applicable to Data Mining? --
– Refer to earlier slide “Reasons for the growing popularity of
data mining” :
• Growing Data Volume -- ML enables the intelligent analysis of
overwhelmingly large data/knowledge repositories
• Limitations of Human Analysis -- ML enables automated searches for
complex multifactor dependencies in data
• Low Cost of Machine Learning -- machines and software are cheaper
than people; the ML process is repeatable, consistent, and robust in
handling very large data analysis tasks; adaptive ML algorithms can
scale with the problem.

Machine Learning and Data Mining
• ML Techniques for DM (to be covered later):
– Decision Trees
– Rule Mining and Rule Learning
– Case-Based Reasoning (CBR)
– Neural Nets (NN)
– Supervised and Unsupervised Learning
– Support Vector Machines (SVM)
– Bayesian Networks
– Genetic Algorithms (GA)


Neural Nets
• “Neural networks are the second best way of
doing just about anything.” (John Denker)

Neural Network Fuzzy
Data
Rules

• The best way is “is to apply all available domain
knowledge and spend a considerable amount
of time, money and effort in building a rule
system that will give the right answer. The
second best way of doing anything is to learn
from experience.” (Burbidge & Buxton)

Supervised vs. Unsupervised Learning
• In Supervised Learning algorithms, a training
set is provided (data with correct answers),
which is used to mine for known patterns.
• In Unsupervised Learning algorithms, data are
provided with no a priori knowledge of the
hidden patterns (knowledge) that they contain.
The goal is to discover (learn) these patterns.
• A class known as Semi-Supervised Learning
also exists, where knowledge is known and
applied from one data collection in order to
mine, analyze, classify, and interpret a related
data collection.

Machine Learning, Data Mining, and
Support Vector Machines (SVM)
• SVM is the tool of choice for the application of
ML to the data mining classification problem.
• So what are they? … ―a statistical learning
system for predictive data mining -- for
estimating regression functions.‖
• Loads of information available here:
http://www.cs.rpi.edu/~bij2/svm.html
http://www.kernel-machines.org/tutorial.html


SVM Process Overview
Initial Data
Classification
Data

SVM
Training

Weights SVM
Classification

Elements Elements
In Out of
Classification Classification


SVM Classification
• SVM attempts to find an optimal separating
hyperplane between members of the two
initial classifications.

Separating
hyperplane
Class ―A‖
Class ―B‖


SVM Class Separation Problem
• An optimal hyperplane partitions the initial
classification correctly and maximizes distance
from the plane to elements on either ‗side‘:
positive and negative examples.
• When the training examples (initial classification)
consist of very diverse expression patterns, then
finding an optimal hyperplane can be impossible.


SVM Kernel Construction
The expression data can be transformed to a higher
dimensional space (feature space) by applying a
kernel function. This transformation can have the
effect of allowing a separating hyperplane to be
found.


Practical SVM Issues
• Results depend heavily on the input
parameters.
• Using a high degree kernel function risks
artificial separation of the data.
• An iterative approach to increasing the
kernel power is advisable.


SVM Results
• Two classes are produced:
– Positive Class: contains elements with expression
patterns similar to those in the positive examples in the
training set.
– Negative Class: contains all other members of the input
set.
• Each of these classes has elements that fall in two groups:
– Those initially in the class (true positives and true
negatives)
– Those recruited into the class (false positives and false
negatives)


Machine Learning Resources
• 1. Massive compilation of ML resources at :
http://home.earthlink.net/~dwaha/research/machine-learning.html
• 2. Excellent Reference Book: Tom Mitchell‘s
―Machine Learning‖ (1997; McGraw-Hill) :
http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
• 3. Machine Learning & Data Mining Resources :
My favorite ML site …
http://www.mlnet.org/ Click on Software
… a site dedicated to ―machine learning,
knowledge discovery, case-based reasoning,
knowledge acquisition, and data mining.‖


Recap of ML and DM
• DM requires machine assistance in the search and analysis of very
large (often distributed, heterogeneous) databases
• Intelligent analysis of complex multi-dimensional multiple-
dependency data also demands machine assistance
• Algorithms for DM are most efficient when they are adaptable to
the type and content of the data (i.e., the system ―learns‖)
• Machines are less expensive than humans
• Machines are usually scalable as the problem size grows
• Actionable data (the end-goal of DM) depends in many cases on an
embedded ML algorithm to take appropriate action (in control
systems; decision-support systems; robotics; autonomous systems)
• ML and DM are historically, technically, and functionally
intertwined (e.g, some data mining research groups call themselves
Machine Learning Groups)

Steps in the Data Mining Process


Steps in the Data Mining Process
http://www.cs.sfu.ca/~han/DM_Book.html
• Learning the application domain:
– relevant prior knowledge and goals of DM application
• Creating a target data set: Data selection
• Data cleaning and preprocessing: (may take 40-60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing data mining functions
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining & KDD: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Using the discovered knowledge = Actionable Data!

Steps in the Data Mining Process - Pictorial View


Cleaning the ―Dirty Data‖
• Excellent reference: Dorian Pyle‘s book ―Data Preparation
for Data Mining‖ (1999, Morgan Kaufmann; 540pp)
• Frequent problem: missing (NULL) values
• Empty value Missing value (must treat each case
differently)
• Various options for NULLs (may introduce bias):
– use ―fill value‖ (e.g, -999)
– use estimated value (prediction from data model)
– use interpolated value (from surrounding entries)
– ignore any records with nulls
• November 2003 Workshop on Data Cleaning:
http://dimacs.rutgers.edu/Workshops/DataCleaning/


Data Preprocessing (Laundering the Data)
(may take 40-80% of the total data mining project effort!)
(Reference: ―Data Scrubbing‖ article in Computerworld 2003)


"Data Scrubbing by the Numbers‖
(http://www.computerworld.com/printthis/2003/0,4814,78260,00.html)

Here are some of the findings:
Data cleansing accounts for up to 70% of the cost and effort of
implementing most data warehouse projects, according to analysts.
In 2001, The Data Warehousing Institute estimated that dirty data
costs U.S. businesses $600 billion per year.
Data cleanliness and quality was the No. 2 problem -- right behind
budget cuts -- cited in a 2003 IDC survey of 1,648 companies
implementing business analytics software enterprise-wide.
Only 23% of 130 companies surveyed by Cutter Consortium on their
data warehousing and business-intelligence practices use specialized
data cleansing tools.
Of those companies in the Cutter Consortium study using specialized
data scrubbing software, 31% are using tools that were built in-house.


Major Issues in Data Mining


Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling of noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Handling very large data volumes (the ―data flood‖)
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, and incremental mining methods


Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
– Protection of data security, integrity, and privacy
• Dirty data (60% of the effort, or more)
– Preparing the data for mining (transformation, cleaning, processing)

Case Study - The Mars Rover

http://mars.jpl.nasa.gov/mer/mission/spacecraft_surface_rover.html


Data Mining in Action

• Data Mining facilitates
Intelligent Data
Understanding

• Data Mining enables
Decision Support and
Active Control Systems


What is Intelligent Data Understanding?
• IDU refers to the application of techniques for
transforming data into understanding.
… (sound familiar?)
Data  Information  Knowledge  Understanding / Wisdom!

• Web reference: http://is.arc.nasa.gov/IDU/index.html
• IDU specifically refers to automating the following
techniques for machine-assisted data analysis:
– Data Mining (e.g., http://is.arc.nasa.gov/IDU/tasks/NVODDM.html)
– Knowledge Discovery
– Machine Learning

Intelligent Data System Applications (1)

• Rove around the surface of Mars and take samples of
rocks (mass spectroscopy = a data histogram)
• Supervised Learning (search for rocks with known
compositions)
• Unsupervised Learning (discover what types of rocks
are present, without preconceived biases)
• Association Mining (find unusual associations)
• Clustering (find the set of unique classes of rocks)
• Classification (assign rocks to known classes)
• Deviation/Outlier Detection (one-of-kind; interesting?)

Intelligent Data System Applications (2)
• On-board Intelligent Data Understanding & Decision
Support Systems (Fuzzy Logic & Decision Trees &
Cased-Based Reasoning ) – Science Goal Monitoring:
– “stay here and do more”; or else “move on to another rock”
– “send results to Earth immediately”; or “send results later”
• Learn as it goes (Machine Learning & Neural Nets)
• Relate the results to other factors, such as dust storms
(XML & Information Retrieval & Information Fusion
with other data from orbiting satellite ―mother ship‖)
• Predict where to go in order to find interesting rocks
(Logistic Regression & Case-Based Reasoning)

Mars Rover as an
Adaptive Fuzzy Logic System

• Decisions are based on data mined, prior
experience, new knowledge, and fuzzy logic
• Rover acts autonomously, without human
intervention, in Deep Space environment
• Actions are driven by mining actionable
data from all sensors

Summary


Summary of Topics Covered
• Summary of ―What is Data Mining?‖ Tutorial
• Foundations of Data Mining
• Database Systems
• Data Warehousing and OLAP
• Statistics and Data Mining
• Information Retrieval
• Data Mining as ―Rule Induction‖
• Fuzzy Sets and Logic
• Machine Learning
• Steps in the Data Mining Process
• Major Issues in Data Mining
• A Case Study: The NASA Mars Rover

Lecture 2

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (13)

Destacado

Destacado (17)

Similar a Lecture 2

Similar a Lecture 2 (20)

Más de butest

Más de butest (20)

Lecture 2