Introduction to DataMining

2

A young, fast growing and promising
field

INTRODUCTION
3










Data mining (the analysis step of the
"Knowledge Discovery and Data Mining"
process, or KDD)
Extracting hidden information
An interdisciplinary subfield of computer
science
The computational process of discovering
patterns in large data sets
Involving methods at the intersection of
Artificial intelligence, Machine learning,
Statistics, and Database systems.

INTORODUCTION(CONTD..)
4

The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further use.
Aside from the raw analysis step, it involves
•
database and data management aspects
•





•

data pre-processing
model
inference considerations

complexity considerations, post-processing of
discovered structures, visualization, and online
updating.

Why Data Mining?
5



The Explosive Growth of Data: from terabytes to petabytes



Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)


Data collection and data availability


Automated data collection tools, database systems, Web,
computerized society



Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, …



Science: Remote sensing, bioinformatics, scientific simulation, …



Society and everyone: news, digital cameras,…

Why Data Mining?
6

“Necessity is the mother of invention” - Data
mining—Automated analysis of massive data
sets

What Motivated Data Mining?
7



We are drowning in data, but starving for
knowledge!

Evolution of Database
Technology

8

Data mining can be viewed as a result of natural evolution
of IT


1960s:




1970s:




Data collection, database creation and network DBMS
Relational data model, relational DBMS implementation

1980s:


RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)



Application-oriented DBMS (spatial, scientific, engineering, etc.)

Evolution of Database Technology
9



1990s:




Data mining, data warehousing, multimedia
databases, and Web databases

2000s


Stream data management and mining



Data mining and its applications



Web technology (XML, data integration) and global
information systems

What Is Data Mining?
11



Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data



Alternative names




Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?


Simple search and query processing



(Deductive) expert systems

Data Mining: Confluence of Multiple Disciplines
12

Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines

Knowledge Discovery (KDD) Process
13



Data mining—core of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data
Warehouse
Data Cleaning
Data Integration
Databases

Selection

Knowledge Process
14

1.
2.
3.
4.

5.

6.

7.

Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into
appropriate form for data mining
Data mining- An essential process where intelligent
methods are applied to extract data patterns
Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness
measure
Knowledge presentation-visualization and
representation techniques

Example: A Web Mining Framework
15



Web mining usually involves









Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base

Data Mining in Business Intelligence
Increasing potential
to support
business decisions

End User

Decision
Making

Business
Analyst

Data Presentation
Visualization Techniques
Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16

DBA

KDD Process: A Typical View from ML and
Statistics

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction



Data
Mining

Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………

PostProcessing

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

This is a view from typical machine learning and statistics communities
17

Data Mining: On What Kinds of Data?
18



Database-oriented data sets and applications




Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data



Time-series data, temporal data, sequence data (incl. bio-sequences)



Structure data, graphs, social networks and multi-linked data



Object-relational databases



Heterogeneous databases and legacy databases



Spatial data



Multimedia database



Text databases



The World-Wide Web

RDBMS
19









A database that has a collection of tables of data items, all of
which is formally described and organized according to the
relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of
columns, called the p rim a ry ke y , to uniquely identify each row.
A relationship can then be established between each row in
the table and a row in another table by creating a fo re ig n ke y ,
a column or group of columns in one table that points to the
primary key of another table.

RDBMS
20
•

•

•

•

•

Database normalization: The relational model offers various levels
of refinement of table organization and reorganization .
DBMS of a relational database is called an RDBMS, and is the
software of a relational database.
The relational database was first defined in June 1970 by Edgar
Codd, of IBM's San Jose Research Laboratory.
Codd's view of what qualifies as an RDBMS is summarized in
Codd's 12 rules.
A relational database has become the predominant choice in
storing data.

21

Relational database
terminology.

A relation is defined as a set of tuples that have the same
attributes

RDMS(contd..)
22

Example :Allelectronics(Company described by relation
tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the
customer information(Cust_id,cust_name,
Age,Occupation,annual income, credit information and
category)
Tables: used to represent the relationship between or
among multiple entities
 Database queries(SQL): For data accessing using
relational operations such as join, selection and projection

Mining Relational databases
23








Can go further by searching for trends or data patterns
Examples
Analyze customer data to predict the risk of customers
based on their income ,age
Detect deviations: sales comparison with previous year
RDBMS are one of the most commonly available and
richest information repositories for data mining

What is a Data
Warehouse?

24



Defined in many different ways, but not rigorously.


A decision support database that is maintained separately from
the organization’s operational database



Support information processing by providing a solid platform of
consolidated, historical data for analysis.



“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon



Data warehousing:


The process of constructing and using data warehouses

DATA WAREHOUSES
25

Is a repository of information collected from
multiple sources, stored under a unified
schema.
Constructed via
 Data cleaning
 Data integration
 Data transformation
 Data Loading and periodic data refreshing


DATA WAREHOUSES(contd…)
27





Data warehouse is modeled by a multidimensional data
structure
Data cube: precomputation &fast access of
summarized data




Each dimension corresponds to an attribute or a set of attributes
in a schema
Each cell stores the value of some aggregate measure (count,
sum etc)



Example:



In Allelectronics the cube has three dimension :

•

Address(with city values, U S A, Canada, Mexico)

•

Time (with quarter values Q1,Q2,Q3,Q4)

•

Item(with type values )

Multidimensional Data
28

Sales volume as a function of product, month,
and region
Re
g

io
n

Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region

Year

Category Country Quarter
Product



Product

City
Office

Month

Month
Day

Week

A Sample Data Cube
29

Pr

TV
PC
VCR
sum

1Qtr

2Qtr

3Qtr

4Qtr

sum

Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum

Country

od
uc
t

Date

Data mining functionalities
30



Tasks can be classified :




Predictive(makes prediction about values of data using known
results found from different data)
Descriptive( characterize properties of a target data set)
 Explore the properties of the data examined

Data mining functionalities are used to specify the kinds
of patterns






Characterization and Discrimination
The mining of frequent patterns, associations and correlations
Classification and regression
Cluster analysis
Outlier analysis

Characterization and Discrimination
31





Data characterization is a summarization of the general

characteristics or features of a target class of data
Output of characterization can be presented in various forms
 Pie charts
 Bar charts
 Curves

multidimensional data cube
 Multidimensional tables
Descriptions presented in generalized relations- Characteristic
rules
Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f
c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s
lle
this can be view in any dimension, such as on occupation to view
these customers according to their type of employment.

Data Discrimination
32









Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or more
multiple contrasting class
Output representation similar to characterization
description
Discrimination description expressed in the form of
rules –Discrimination rules
Target and contrasting class specified by the user

Example:


Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with
s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e
p e rio d

Mining Frequent Patterns, Associations,
Correlations
33



Frequent pattern
Frequent item sets(Milk, bread)
 Frequent subsequences(Latop ,digital camera
,memory
card)
 Frequent sub structures (graphs ,trees)
Mining frequent patterns leads to the discovery of
interesting associations and correlation within
data.


Association analysis(example)
34

Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”)
[support=1%, confidence=50%]
X - a variable representing a customer
A confidence or certainty – 50%(chance)
1%(under analysis)
Association rule- with single-dimension association rules
“computer => software[1%,50%]”.
Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”)
[support=2%, confidence=60%] (Multidimensional association rule)

Classification and Regression for Predictive
Analysis
35






Classification: the process of finding a
model(function)that describes and
distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as




Classification rules(IF-THEN rules)
Decision trees
Mathematical formulae or Neural networks

 Regression:

Statistical methodology for
numeric prediction

36

Cluster Analysis and Outlier
Analysis


Cluster Analysis:






Determining similarity among data on predefined
attributes
The most similar data are grouped into clusters

Outlier Analysis






Outliers: The dataset contain objects that do not
required for the model of the data
Analysis of outlier data is referred to as Outlier

Analysis or Anomaly mining
Detected using statstical tests

Which Technologies Are Used?
Machine
Learning

Applications

Algorithm

Pattern
Recognition

Statistics

Visualization

Data Mining

Database
Technology

High-Performance
Computing

37

Potential Applications of Data Mining
Where there are data there are
data mining applications
38


Data analysis and decision support ( Business Intelligence)


Market analysis and management




Risk analysis and management





Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications


Text mining (news group, email, documents) and Web mining



Stream data mining



Bioinformatics and bio-data analysis

Major Issues in Data Mining (1)


Mining Methodology



Mining knowledge in multi-dimensional space



Data mining: An interdisciplinary effort



Boosting the power of discovery in a networked environment



Handling noise, uncertainty, and incompleteness of data




Mining various and new kinds of knowledge

Pattern evaluation and pattern- or constraint-guided mining

User Interaction


Interactive mining



Incorporation of background knowledge



Presentation and visualization of data mining results
39

Major Issues in Data Mining (2)


Efficiency and Scalability





Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods

Diversity of data types





Handling complex types of data
Mining dynamic, networked, and global data repositories

Data mining and society


Social impacts of data mining



Privacy-preserving data mining



Invisible data mining
40

Introduction to DataMining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to DataMining

Similar to Introduction to DataMining (20)

Recently uploaded

Recently uploaded (20)

Introduction to DataMining