SlideShare a Scribd company logo
1 of 40
1

INTRODUCTION
2

A young, fast growing and promising
field
INTRODUCTION
3










Data mining (the analysis step of the
"Knowledge Discovery and Data Mining"
process, or KDD)
Extracting hidden information
An interdisciplinary subfield of computer
science
The computational process of discovering
patterns in large data sets
Involving methods at the intersection of
Artificial intelligence, Machine learning,
Statistics, and Database systems.
INTORODUCTION(CONTD..)
4

The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further use.
Aside from the raw analysis step, it involves
•
database and data management aspects
•





•

data pre-processing
model
inference considerations

complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
Why Data Mining?
5



The Explosive Growth of Data: from terabytes to petabytes



Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)


Data collection and data availability


Automated data collection tools, database systems, Web,
computerized society



Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, …



Science: Remote sensing, bioinformatics, scientific simulation, …



Society and everyone: news, digital cameras,…
Why Data Mining?
6

“Necessity is the mother of invention” - Data
mining—Automated analysis of massive data
sets
What Motivated Data Mining?
7



We are drowning in data, but starving for
knowledge!
Evolution of Database
Technology

8

Data mining can be viewed as a result of natural evolution
of IT


1960s:




1970s:




Data collection, database creation and network DBMS
Relational data model, relational DBMS implementation

1980s:


RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)



Application-oriented DBMS (spatial, scientific, engineering, etc.)
Evolution of Database Technology
9



1990s:




Data mining, data warehousing, multimedia
databases, and Web databases

2000s


Stream data management and mining



Data mining and its applications



Web technology (XML, data integration) and global
information systems
10
What Is Data Mining?
11



Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data



Alternative names




Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?


Simple search and query processing



(Deductive) expert systems
Data Mining: Confluence of Multiple Disciplines
12

Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines
Knowledge Discovery (KDD) Process
13



Data mining—core of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data
Warehouse
Data Cleaning
Data Integration
Databases

Selection
Knowledge Process
14

1.
2.
3.
4.

5.

6.

7.

Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into
appropriate form for data mining
Data mining- An essential process where intelligent
methods are applied to extract data patterns
Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness
measure
Knowledge presentation-visualization and
representation techniques
Example: A Web Mining Framework
15



Web mining usually involves









Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business Intelligence
Increasing potential
to support
business decisions

End User

Decision
Making

Business
Analyst

Data Presentation
Visualization Techniques
Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16

DBA
KDD Process: A Typical View from ML and
Statistics

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction



Data
Mining

Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………

PostProcessing

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

This is a view from typical machine learning and statistics communities
17
Data Mining: On What Kinds of Data?
18



Database-oriented data sets and applications




Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data



Time-series data, temporal data, sequence data (incl. bio-sequences)



Structure data, graphs, social networks and multi-linked data



Object-relational databases



Heterogeneous databases and legacy databases



Spatial data



Multimedia database



Text databases



The World-Wide Web
RDBMS
19









A database that has a collection of tables of data items, all of
which is formally described and organized according to the
relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of
columns, called the p rim a ry ke y , to uniquely identify each row.
A relationship can then be established between each row in
the table and a row in another table by creating a fo re ig n ke y ,
a column or group of columns in one table that points to the
primary key of another table.
RDBMS
20
•

•

•

•

•

Database normalization: The relational model offers various levels
of refinement of table organization and reorganization .
DBMS of a relational database is called an RDBMS, and is the
software of a relational database.
The relational database was first defined in June 1970 by Edgar
Codd, of IBM's San Jose Research Laboratory.
Codd's view of what qualifies as an RDBMS is summarized in
Codd's 12 rules.
A relational database has become the predominant choice in
storing data.
21

Relational database
terminology.

A relation is defined as a set of tuples that have the same
attributes
RDMS(contd..)
22

Example :Allelectronics(Company described by relation
tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the
customer information(Cust_id,cust_name,
Age,Occupation,annual income, credit information and
category)
Tables: used to represent the relationship between or
among multiple entities
 Database queries(SQL): For data accessing using
relational operations such as join, selection and projection
Mining Relational databases
23








Can go further by searching for trends or data patterns
Examples
Analyze customer data to predict the risk of customers
based on their income ,age
Detect deviations: sales comparison with previous year
RDBMS are one of the most commonly available and
richest information repositories for data mining
What is a Data
Warehouse?

24



Defined in many different ways, but not rigorously.


A decision support database that is maintained separately from
the organization’s operational database



Support information processing by providing a solid platform of
consolidated, historical data for analysis.



“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon



Data warehousing:


The process of constructing and using data warehouses
DATA WAREHOUSES
25

Is a repository of information collected from
multiple sources, stored under a unified
schema.
Constructed via
 Data cleaning
 Data integration
 Data transformation
 Data Loading and periodic data refreshing

26
DATA WAREHOUSES(contd…)
27





Data warehouse is modeled by a multidimensional data
structure
Data cube: precomputation &fast access of
summarized data




Each dimension corresponds to an attribute or a set of attributes
in a schema
Each cell stores the value of some aggregate measure (count,
sum etc)



Example:



In Allelectronics the cube has three dimension :

•

Address(with city values, U S A, Canada, Mexico)

•

Time (with quarter values Q1,Q2,Q3,Q4)

•

Item(with type values )
Multidimensional Data
28

Sales volume as a function of product, month,
and region
Re
g

io
n

Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region

Year

Category Country Quarter
Product



Product

City
Office

Month

Month
Day

Week
A Sample Data Cube
29

Pr

TV
PC
VCR
sum

1Qtr

2Qtr

3Qtr

4Qtr

sum

Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum

Country

od
uc
t

Date
Data mining functionalities
30



Tasks can be classified :




Predictive(makes prediction about values of data using known
results found from different data)
Descriptive( characterize properties of a target data set)
 Explore the properties of the data examined

Data mining functionalities are used to specify the kinds
of patterns






Characterization and Discrimination
The mining of frequent patterns, associations and correlations
Classification and regression
Cluster analysis
Outlier analysis
Characterization and Discrimination
31





Data characterization is a summarization of the general

characteristics or features of a target class of data
Output of characterization can be presented in various forms
 Pie charts
 Bar charts
 Curves

multidimensional data cube
 Multidimensional tables
Descriptions presented in generalized relations- Characteristic
rules
Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f
c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s
lle
this can be view in any dimension, such as on occupation to view
these customers according to their type of employment.
Data Discrimination
32









Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or more
multiple contrasting class
Output representation similar to characterization
description
Discrimination description expressed in the form of
rules –Discrimination rules
Target and contrasting class specified by the user

Example:


Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with
s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e
p e rio d
Mining Frequent Patterns, Associations,
Correlations
33



Frequent pattern
Frequent item sets(Milk, bread)
 Frequent subsequences(Latop ,digital camera
,memory
card)
 Frequent sub structures (graphs ,trees)
Mining frequent patterns leads to the discovery of
interesting associations and correlation within
data.

Association analysis(example)
34

Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”)
[support=1%, confidence=50%]
X - a variable representing a customer
A confidence or certainty – 50%(chance)
1%(under analysis)
Association rule- with single-dimension association rules
“computer => software[1%,50%]”.
Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”)
[support=2%, confidence=60%] (Multidimensional association rule)
Classification and Regression for Predictive
Analysis
35






Classification: the process of finding a
model(function)that describes and
distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as




Classification rules(IF-THEN rules)
Decision trees
Mathematical formulae or Neural networks

 Regression:

Statistical methodology for
numeric prediction
36

Cluster Analysis and Outlier
Analysis


Cluster Analysis:






Determining similarity among data on predefined
attributes
The most similar data are grouped into clusters

Outlier Analysis






Outliers: The dataset contain objects that do not
required for the model of the data
Analysis of outlier data is referred to as Outlier

Analysis or Anomaly mining
Detected using statstical tests
Which Technologies Are Used?
Machine
Learning

Applications

Algorithm

Pattern
Recognition

Statistics

Visualization

Data Mining

Database
Technology

High-Performance
Computing

37
Potential Applications of Data Mining
Where there are data there are
data mining applications
38


Data analysis and decision support ( Business Intelligence)


Market analysis and management




Risk analysis and management





Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications


Text mining (news group, email, documents) and Web mining



Stream data mining



Bioinformatics and bio-data analysis
Major Issues in Data Mining (1)


Mining Methodology



Mining knowledge in multi-dimensional space



Data mining: An interdisciplinary effort



Boosting the power of discovery in a networked environment



Handling noise, uncertainty, and incompleteness of data




Mining various and new kinds of knowledge

Pattern evaluation and pattern- or constraint-guided mining

User Interaction


Interactive mining



Incorporation of background knowledge



Presentation and visualization of data mining results
39
Major Issues in Data Mining (2)


Efficiency and Scalability





Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods

Diversity of data types





Handling complex types of data
Mining dynamic, networked, and global data repositories

Data mining and society


Social impacts of data mining



Privacy-preserving data mining



Invisible data mining
40

More Related Content

What's hot

Database system environment ppt.
Database system environment ppt.Database system environment ppt.
Database system environment ppt.yhen06
 
Data mining basic fundamentals
Data mining basic fundamentalsData mining basic fundamentals
Data mining basic fundamentalsSiddique Ibrahim
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouseAmin Choroomi
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Data Science
Data ScienceData Science
Data ScienceRabin BK
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 

What's hot (20)

Big data
Big dataBig data
Big data
 
Database system environment ppt.
Database system environment ppt.Database system environment ppt.
Database system environment ppt.
 
Data mining basic fundamentals
Data mining basic fundamentalsData mining basic fundamentals
Data mining basic fundamentals
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data preparation
Data preparationData preparation
Data preparation
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Science
Data ScienceData Science
Data Science
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 

Similar to Introduction to DataMining

20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 

Similar to Introduction to DataMining (20)

2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 

Recently uploaded

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Recently uploaded (20)

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

Introduction to DataMining

  • 2. 2 A young, fast growing and promising field
  • 3. INTRODUCTION 3      Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD) Extracting hidden information An interdisciplinary subfield of computer science The computational process of discovering patterns in large data sets Involving methods at the intersection of Artificial intelligence, Machine learning, Statistics, and Database systems.
  • 4. INTORODUCTION(CONTD..) 4 The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves • database and data management aspects •    • data pre-processing model inference considerations complexity considerations, post-processing of discovered structures, visualization, and online updating.
  • 5. Why Data Mining? 5  The Explosive Growth of Data: from terabytes to petabytes  Eg: Global backbone telecommunication network carry tens of petabytes everyday (1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras,…
  • 6. Why Data Mining? 6 “Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets
  • 7. What Motivated Data Mining? 7  We are drowning in data, but starving for knowledge!
  • 8. Evolution of Database Technology 8 Data mining can be viewed as a result of natural evolution of IT  1960s:   1970s:   Data collection, database creation and network DBMS Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)
  • 9. Evolution of Database Technology 9  1990s:   Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  • 10. 10
  • 11. What Is Data Mining? 11  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names   Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 12. Data Mining: Confluence of Multiple Disciplines 12 Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines
  • 13. Knowledge Discovery (KDD) Process 13  Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
  • 14. Knowledge Process 14 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove noise and inconsistent data Data integration – to combine multiple source Data selection – to retrieve relevant data for analysis Data transformation – to transform data into appropriate form for data mining Data mining- An essential process where intelligent methods are applied to extract data patterns Pattern Evaluation-Identify truly interesting patterns representing knowledge based on interestingness measure Knowledge presentation-visualization and representation techniques
  • 15. Example: A Web Mining Framework 15  Web mining usually involves         Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
  • 16. Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 16 DBA
  • 17. KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction  Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities 17
  • 18. Data Mining: On What Kinds of Data? 18  Database-oriented data sets and applications   Relational database, data warehouse, transactional database Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data  Multimedia database  Text databases  The World-Wide Web
  • 19. RDBMS 19     A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model. Data in a single table represents a relation. Each table schema must identify a column or group of columns, called the p rim a ry ke y , to uniquely identify each row. A relationship can then be established between each row in the table and a row in another table by creating a fo re ig n ke y , a column or group of columns in one table that points to the primary key of another table.
  • 20. RDBMS 20 • • • • • Database normalization: The relational model offers various levels of refinement of table organization and reorganization . DBMS of a relational database is called an RDBMS, and is the software of a relational database. The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory. Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant choice in storing data.
  • 21. 21 Relational database terminology. A relation is defined as a set of tuples that have the same attributes
  • 22. RDMS(contd..) 22 Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch) Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category) Tables: used to represent the relationship between or among multiple entities  Database queries(SQL): For data accessing using relational operations such as join, selection and projection
  • 23. Mining Relational databases 23      Can go further by searching for trends or data patterns Examples Analyze customer data to predict the risk of customers based on their income ,age Detect deviations: sales comparison with previous year RDBMS are one of the most commonly available and richest information repositories for data mining
  • 24. What is a Data Warehouse? 24  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses
  • 25. DATA WAREHOUSES 25 Is a repository of information collected from multiple sources, stored under a unified schema. Constructed via  Data cleaning  Data integration  Data transformation  Data Loading and periodic data refreshing 
  • 26. 26
  • 27. DATA WAREHOUSES(contd…) 27   Data warehouse is modeled by a multidimensional data structure Data cube: precomputation &fast access of summarized data   Each dimension corresponds to an attribute or a set of attributes in a schema Each cell stores the value of some aggregate measure (count, sum etc)  Example:  In Allelectronics the cube has three dimension : • Address(with city values, U S A, Canada, Mexico) • Time (with quarter values Q1,Q2,Q3,Q4) • Item(with type values )
  • 28. Multidimensional Data 28 Sales volume as a function of product, month, and region Re g io n Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product  Product City Office Month Month Day Week
  • 29. A Sample Data Cube 29 Pr TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Canada Mexico sum Country od uc t Date
  • 30. Data mining functionalities 30  Tasks can be classified :   Predictive(makes prediction about values of data using known results found from different data) Descriptive( characterize properties of a target data set)  Explore the properties of the data examined Data mining functionalities are used to specify the kinds of patterns      Characterization and Discrimination The mining of frequent patterns, associations and correlations Classification and regression Cluster analysis Outlier analysis
  • 31. Characterization and Discrimination 31   Data characterization is a summarization of the general characteristics or features of a target class of data Output of characterization can be presented in various forms  Pie charts  Bar charts  Curves  multidimensional data cube  Multidimensional tables Descriptions presented in generalized relations- Characteristic rules Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s lle this can be view in any dimension, such as on occupation to view these customers according to their type of employment.
  • 32. Data Discrimination 32     Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class Output representation similar to characterization description Discrimination description expressed in the form of rules –Discrimination rules Target and contrasting class specified by the user Example:  Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e p e rio d
  • 33. Mining Frequent Patterns, Associations, Correlations 33  Frequent pattern Frequent item sets(Milk, bread)  Frequent subsequences(Latop ,digital camera ,memory card)  Frequent sub structures (graphs ,trees) Mining frequent patterns leads to the discovery of interesting associations and correlation within data. 
  • 34. Association analysis(example) 34 Item frequently purchased together buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%] X - a variable representing a customer A confidence or certainty – 50%(chance) 1%(under analysis) Association rule- with single-dimension association rules “computer => software[1%,50%]”. Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)
  • 35. Classification and Regression for Predictive Analysis 35    Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts Model derived from analysis of a set of training data Models are represented as    Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks  Regression: Statistical methodology for numeric prediction
  • 36. 36 Cluster Analysis and Outlier Analysis  Cluster Analysis:    Determining similarity among data on predefined attributes The most similar data are grouped into clusters Outlier Analysis    Outliers: The dataset contain objects that do not required for the model of the data Analysis of outlier data is referred to as Outlier Analysis or Anomaly mining Detected using statstical tests
  • 37. Which Technologies Are Used? Machine Learning Applications Algorithm Pattern Recognition Statistics Visualization Data Mining Database Technology High-Performance Computing 37
  • 38. Potential Applications of Data Mining Where there are data there are data mining applications 38  Data analysis and decision support ( Business Intelligence)  Market analysis and management   Risk analysis and management    Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis
  • 39. Major Issues in Data Mining (1)  Mining Methodology   Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data   Mining various and new kinds of knowledge Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results 39
  • 40. Major Issues in Data Mining (2)  Efficiency and Scalability    Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types    Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining 40