This document provides an introduction to data mining. It discusses the evolution of data mining technology, defines what data mining is, and outlines common data mining tasks like classification, clustering, and association rule discovery. The document also examines the KDD process, different types of data that can be mined, and major issues in data mining like scalability, handling diverse data types, and integrating discovered knowledge.
Micro-Scholarship, What it is, How can it help me.pdf
Data Mining : Concepts and Techniques
1. By
R. Deepa (IT),
Batch: 2016-2019
Department of (CS&IT),
Nadar Saraswathi College of Arts Science, Theni.
Concepts and
Techniques
2. CHAPTER : 1
INTRODUCTION
Evolution of Data Mining technology
What is Data Mining?
Data Mining Tasks
Data Mining : On what kind of data?
Are all the patterns interesting?
Data Mining : A KDD Process
Classification of Data Mining systems
Major Issues in Data mining
Summary
3. EVOLUTION OF DATA MINING TECHNOLOGY
Evolution Step Business Question Enabling
Technologies
Data Collection
(1960s)
“What was my total revenue
in the last five years? ”
Computers,
tapes,disks
Data Access
(1980s)
“What were unit sales in New
England last March? “
Relational Databases
(RDBMS),
Structured Query
Language(SQL),
ODBC
Data Warehousing &
Decision Support
(1990s)
“What were unit sales in New
England last March? Drill
down to Boston.”
On-line analytic
processing(OLAP),
Multidimensional
Database, Data
Warehouses
Data Mining
(Emerging Today)
“What ’s likely to happen to
Boston unit sales next month?
Why? “
Advanced algorithms,
Multiprocessor
Computers,
Massive Databases
R.Deepa ITData Mining: Concepts and techniques
4. WHAT IS DATA MINING ?
Data Mining:
Data Mining refers to extracting or “Mining” knowledge
from large amounts of data.
• Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) information or patterns from data
in large databases.
Alternatives Names:
Knowledge Discovery from Data(KDD),
Knowledge extraction,
Data / pattern analysis,
Data archeology,
Data dredging,
Information harvesting,
Business intelligence, etc..,
.
R.Deepa ITData Mining: Concepts and techniques
5. DECISIONS IN DATA MINING :
Databases to be mined
Relational, transactional, object-oriented, object-
relational, spatial, time-series, text, legacy, multi-media,
heterogeneous, WWW, etc.
Knowledge to be mined
Association, classification, clustering , etc.
Techniques utilized
Database-oriented, Data warehouse(OLAP), Machine
learning, Statistics, Visualization, Neural Networks, etc.
Applications adapted
Retail, Telecommunication, Banking, Fraud analysis, DNA
mining, Stock market analysis, Web mining, Weblog analysis, etc.
R.Deepa ITData Mining: Concepts and techniques
6. DATA MINING TASKS
Prediction Tasks:
Use some variables to predict unknown or future
values of other variables.
Description Tasks:
Find human-interpretable patterns that describe
the data.
Common Data Mining Tasks:
• Classification(predictive)
• Clustering(descriptive)
• Association Rule Discovery(descriptive)
• Sequential Pattern Discovery(descriptive)
• Regression (predictive)
• Deviation Detection(predictive)
R.Deepa ITData Mining: Concepts and techniques
7. DATA MINING - A KDD PROCESS
++
DATABASES
DATA
MINING
DATA
WAREHOUSE
PATTERNS
FLAT FILES
Cleaning and
Integration
Selection and
Transformation
R.Deepa IT
8. 7 STEPS IN KDD PROCESS
1.Data Cleaning:
to remove noise and inconsistent data
2.Data Integration:
where multiple data sources may be combined
3.Data Selection:
where data relevant to the analysis task are retrieved from the database
4.Data Transformation:
where data are transformed and consolidated into forms appropriate for
mining performing summary or aggregation operations.
5.Data Mining:
an essential process where intelligent methods are applied to extract
data patterns
6.Pattern Evaluation:
to identify the truly interesting patterns representing knowledge based
on interestingness measures.
6.Knowledge Presentation:
where visualization and knowledge representation techniques are used to
present mined knowledge to users .
R.Deepa ITData Mining: Concepts and techniques
9. ARCHITECTURE OF A TYPICAL DATA
MINING SYSTEM
Graphical user interface
Pattern evaluation
Data mining engine
Database or Data warehouse
Server
Data
Warehouse
Knowledge-Base
DataBases
Data Cleaning & Data Integration Filtering
R.Deepa IT
10. DATA MINING - ON WHAT KIND OF DATA?
1.Relational Databases
2.Data Warehouses
3.Transactional Databases
4.Advanced Data and Information System
• Object-Oriented and Object-Relational Databases
• Spatial and Spatiotemporal Databases
• Heterogeneous and Legacy Databases
• Text Databases
• Multimedia Databases
• Data Streams
• WWW
R.Deepa ITData Mining: Concepts and techniques
11. DATA MINING FUNCTIONALITIES
Concept /Class Description: Characterization and
Discrimination
Generalize, summarize and contrast data
characteristics, eg., dry vs. wet regions .
Data Characterization is a summarization of the
general characteristics or features in a target class of
data.
Data Discrimination is a comparison of the general
features of target class data objects with the general
features of objects from one or a set of contrasting
classes.
R.Deepa ITData Mining: Concepts and techniques
12. Association (Correlations and Causality)
* Multi- dimensional vs. single dimensional
association
* age( X, ”20….29”)^income (X,”20…..29K”)
buys( X, “PC”)[Support=2%, Confidence= 60%]
* contains( T, “Computer”)contains (
X,”Software”) [1% and 75%]
R.Deepa ITData Mining: Concepts and techniques
13. Classification and Prediction
* Finding models (function) that describe and
distinguish classes or concepts for future prediction
* Eg., classify countries based on climate, or
classify cars based on gas mileage.
Presentation: Decision tree, Classification rule, Neural network
Prediction: predict some unknown or missing numerical values
R.Deepa ITData Mining: Concepts and techniques
14. Cluster Analysis
Class label is unknown: Group data to form new
classes, eg., Cluster houses to find distribution patterns
Clustering based on the principle: Maximizing
the intra-class similarity and minimizing the inter-class similarity.
15. Outlier Analysis
Outlier: A data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but
is quite useful in fraud detection, rare events analysis.
Trend and Evolution Analysis
Trend and Deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity based Analysis
Other pattern directed or statistical
analyses
R.Deepa IT
16. ARE ALL THE “DISCOVERED”
PATTERNS INTERESTING?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
Suggested approach: Human-oriented, query-based,
focused mining
Interestingness measures : A pattern is interesting if it is
easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns,
eg., support, confidence, etc.
Subjective: based on user’s belief in the data, eg.,
unexpectedness, novelty, actionability, etc.
.
R.Deepa IT
17. CAN WE FIND AND ONLY INTERESTING
PATTERNS?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Association vs. Classification vs. Clustering
Search for only interesting patterns : Optimization
Can a data mining system find only the interesting patterns?
Approaches
First general all the patterns and then filter out the
uninteresting ones.
Generate only the interesting patterns-mining query
optimization
R.Deepa ITData Mining: Concepts and techniques
18. DATA MINING: CONFLUENCE OF
MULTIPLE DISCIPLINES
Statistics
Visualization
Other
Disciplines
Data MiningMachine
Learning
Information
Science
Database
Technology
R.Deepa ITData Mining: Concepts and techniques
19. DATA MINING : CLASSIFICATION
SCHEMES
General Functionality
Descriptive data mining
Predictive data mining
Different views ,different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
R.Deepa IT
Data Mining: Concepts and techniques
20. MAJOR ISSUES IN DATA MINING
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of Knowledge at multiple levels of
abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining
Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
Performance and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods
R.Deepa ITData Mining: Concepts and techniques
21. Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and
global information systems(WWW)
Issues related to applications and social impacts
Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
Protection of data security, integrity, and privacy.
R.Deepa ITData Mining: Concepts and techniques
22. SUMMARY
Data Mining: discovering interesting patterns from large
amounts of data
A natural evolution of databases technology, in great demand
, with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information
repositories
Data mining functionalities:
Characterization, Discrimination, Association,
Classification, Clustering, Outlier and Trend analysis, etc.
Classification of Data mining systems
Major issues in Data Mining. R.Deepa IT