3. INTRODUCTION
3
Data mining (the analysis step of the
"Knowledge Discovery and Data Mining"
process, or KDD)
Extracting hidden information
An interdisciplinary subfield of computer
science
The computational process of discovering
patterns in large data sets
Involving methods at the intersection of
Artificial intelligence, Machine learning,
Statistics, and Database systems.
4. INTORODUCTION(CONTD..)
4
The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further use.
Aside from the raw analysis step, it involves
•
database and data management aspects
•
•
data pre-processing
model
inference considerations
complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
5. Why Data Mining?
5
The Explosive Growth of Data: from terabytes to petabytes
Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras,…
7. What Motivated Data Mining?
7
We are drowning in data, but starving for
knowledge!
8. Evolution of Database
Technology
8
Data mining can be viewed as a result of natural evolution
of IT
1960s:
1970s:
Data collection, database creation and network DBMS
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
9. Evolution of Database Technology
9
1990s:
Data mining, data warehousing, multimedia
databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global
information systems
11. What Is Data Mining?
11
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
12. Data Mining: Confluence of Multiple Disciplines
12
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
13. Knowledge Discovery (KDD) Process
13
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data
Warehouse
Data Cleaning
Data Integration
Databases
Selection
14. Knowledge Process
14
1.
2.
3.
4.
5.
6.
7.
Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into
appropriate form for data mining
Data mining- An essential process where intelligent
methods are applied to extract data patterns
Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness
measure
Knowledge presentation-visualization and
representation techniques
15. Example: A Web Mining Framework
15
Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
16. Data Mining in Business Intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Business
Analyst
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16
DBA
17. KDD Process: A Typical View from ML and
Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
This is a view from typical machine learning and statistics communities
17
18. Data Mining: On What Kinds of Data?
18
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data
Multimedia database
Text databases
The World-Wide Web
19. RDBMS
19
A database that has a collection of tables of data items, all of
which is formally described and organized according to the
relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of
columns, called the p rim a ry ke y , to uniquely identify each row.
A relationship can then be established between each row in
the table and a row in another table by creating a fo re ig n ke y ,
a column or group of columns in one table that points to the
primary key of another table.
20. RDBMS
20
•
•
•
•
•
Database normalization: The relational model offers various levels
of refinement of table organization and reorganization .
DBMS of a relational database is called an RDBMS, and is the
software of a relational database.
The relational database was first defined in June 1970 by Edgar
Codd, of IBM's San Jose Research Laboratory.
Codd's view of what qualifies as an RDBMS is summarized in
Codd's 12 rules.
A relational database has become the predominant choice in
storing data.
22. RDMS(contd..)
22
Example :Allelectronics(Company described by relation
tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the
customer information(Cust_id,cust_name,
Age,Occupation,annual income, credit information and
category)
Tables: used to represent the relationship between or
among multiple entities
Database queries(SQL): For data accessing using
relational operations such as join, selection and projection
23. Mining Relational databases
23
Can go further by searching for trends or data patterns
Examples
Analyze customer data to predict the risk of customers
based on their income ,age
Detect deviations: sales comparison with previous year
RDBMS are one of the most commonly available and
richest information repositories for data mining
24. What is a Data
Warehouse?
24
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
25. DATA WAREHOUSES
25
Is a repository of information collected from
multiple sources, stored under a unified
schema.
Constructed via
Data cleaning
Data integration
Data transformation
Data Loading and periodic data refreshing
27. DATA WAREHOUSES(contd…)
27
Data warehouse is modeled by a multidimensional data
structure
Data cube: precomputation &fast access of
summarized data
Each dimension corresponds to an attribute or a set of attributes
in a schema
Each cell stores the value of some aggregate measure (count,
sum etc)
Example:
In Allelectronics the cube has three dimension :
•
Address(with city values, U S A, Canada, Mexico)
•
Time (with quarter values Q1,Q2,Q3,Q4)
•
Item(with type values )
28. Multidimensional Data
28
Sales volume as a function of product, month,
and region
Re
g
io
n
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Category Country Quarter
Product
Product
City
Office
Month
Month
Day
Week
29. A Sample Data Cube
29
Pr
TV
PC
VCR
sum
1Qtr
2Qtr
3Qtr
4Qtr
sum
Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum
Country
od
uc
t
Date
30. Data mining functionalities
30
Tasks can be classified :
Predictive(makes prediction about values of data using known
results found from different data)
Descriptive( characterize properties of a target data set)
Explore the properties of the data examined
Data mining functionalities are used to specify the kinds
of patterns
Characterization and Discrimination
The mining of frequent patterns, associations and correlations
Classification and regression
Cluster analysis
Outlier analysis
31. Characterization and Discrimination
31
Data characterization is a summarization of the general
characteristics or features of a target class of data
Output of characterization can be presented in various forms
Pie charts
Bar charts
Curves
multidimensional data cube
Multidimensional tables
Descriptions presented in generalized relations- Characteristic
rules
Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f
c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s
lle
this can be view in any dimension, such as on occupation to view
these customers according to their type of employment.
32. Data Discrimination
32
Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or more
multiple contrasting class
Output representation similar to characterization
description
Discrimination description expressed in the form of
rules –Discrimination rules
Target and contrasting class specified by the user
Example:
Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with
s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e
p e rio d
33. Mining Frequent Patterns, Associations,
Correlations
33
Frequent pattern
Frequent item sets(Milk, bread)
Frequent subsequences(Latop ,digital camera
,memory
card)
Frequent sub structures (graphs ,trees)
Mining frequent patterns leads to the discovery of
interesting associations and correlation within
data.
34. Association analysis(example)
34
Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”)
[support=1%, confidence=50%]
X - a variable representing a customer
A confidence or certainty – 50%(chance)
1%(under analysis)
Association rule- with single-dimension association rules
“computer => software[1%,50%]”.
Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”)
[support=2%, confidence=60%] (Multidimensional association rule)
35. Classification and Regression for Predictive
Analysis
35
Classification: the process of finding a
model(function)that describes and
distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as
Classification rules(IF-THEN rules)
Decision trees
Mathematical formulae or Neural networks
Regression:
Statistical methodology for
numeric prediction
36. 36
Cluster Analysis and Outlier
Analysis
Cluster Analysis:
Determining similarity among data on predefined
attributes
The most similar data are grouped into clusters
Outlier Analysis
Outliers: The dataset contain objects that do not
required for the model of the data
Analysis of outlier data is referred to as Outlier
Analysis or Anomaly mining
Detected using statstical tests
37. Which Technologies Are Used?
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Statistics
Visualization
Data Mining
Database
Technology
High-Performance
Computing
37
38. Potential Applications of Data Mining
Where there are data there are
data mining applications
38
Data analysis and decision support ( Business Intelligence)
Market analysis and management
Risk analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
39. Major Issues in Data Mining (1)
Mining Methodology
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Mining various and new kinds of knowledge
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
39
40. Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
40