SlideShare una empresa de Scribd logo
1 de 14
Chapter 1
Introduction
Introduction
Data mining is often defined as finding hidden information in a database or
exploratory data analysis, data driven discovery, deductive learning. Data
mining access of a database differs from a traditional access in:
• Query: The query might not be well formed or precisely stated. The data
miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that all of the
original operational database. The data have been cleansed and modified
to better support the mining process.
• Output: The output of the data mining query probably is not at subset of
the database. Instead it is the output of some analysis of the contents of
the database.
Data Mining Algorithms
DM algorithms attempt to fit a model to the data. They examine the
data and determine a model that is closest to the characteristics of the
data being examined. Such algorithms can be characterized as
consisting of three parts:
• Model: The purpose of the algorithm is to fit a model to the data.
What attributes should be used to define what class structure?
• Preference: Some criteria must be used to fit one model over another.
The preference will be given to the criteria that fits data the best.
• Search: All algorithms require some technique to search the data. The
criteria needed to fit the data to the classes must be properly defined.
• A predictive model makes a prediction about values of data using known results
found from other (historical) data.
• A descriptive model identifies patterns or relationships in data. It serves as a way
to explore the properties of the data examined, not to predict new properties.
1.1 Basic Data Mining Models and Tasks
• Classification maps data into predefined groups or classes. It is often referred to as supervised
learning because classes are determined before examining the data.
• Regression is it used to math data item to a real valued prediction variable. Regression assumes
that the target data fit into song known type of function (e.g., , linear, logistic etc.) and
determines the best function of this type that models the given data. In actuality regression
involves learning of the function that does this mapping.
• Time series analysis examines the value of an attribute as it varies over time (obtained at evenly
spaced points). There're three basic functions performed in time series analysis: 1) similarity
between different time series is determines using distance measures; 2) the structure of the line
is examined to determine (perhaps classify) its behavior; 3) future values are predicted using
historical time series plot.
• Prediction predicts future data states based on past and current data. Prediction can be also
viewed as a type of classification.
Predictive Models
Descriptive Models
• Clustering is similar to classification except for that the groups are not predefined
but rather defined by the data alone. The clustering is usually accomplished by
determining the similarity among the data on predefined attributes. The most
similar data are grouped into clusters.
• Summarization extracts or derives representative information about the
database. It maps data into subsets with associated simple descriptions. It is also
called characterization or generalization.
• Association rules (link analysis, affinity analysis or association) refers to
uncovering relationships among data. An association rule is a model that
identifies specific types of data associations. These are not casual relationships,
and there is no guarantee that an association will apply in the future.
• Sequence discovery is used to determine sequential patterns in data. These
patterns are based on time (a sequence of actions). Temporal association rules
fall into this category.
Knowledge Discovery Steps
Data Mining Issues
• Human interaction. Experts are used to formulate the queries, identify data and desired results.
• Overfitting: It occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of the
training database.
• Outliers.
• Interpretation of results. Output may require expert to correctly interpret the results.
• Large databases: Sampling and parallelization are effective tools to attack the scalability problem.
• High dimensionality. One solution to this problem is to reduce the number of attributes, which is
known as dimensionality reduction.
• Multimedia data, missing data, irrelevant data, noisy data, changing data.
• Integration and application: Business practices may have to be modified to determine how to
effectively use the information uncovered.
Data Mining Metrics
• From an overall business perspective, a measure such as the return
on investment (ROI) could be used. ROI examines the difference
between what the data mining technique costs and what the savings
or benefits from its use are. It could be measured as increased to
sales, increased advertising expenditure, or both.
• The metrics used include the traditional metrics of space and time
based on complexity and analysis. In some cases, such as accuracy in
classification, more specific metrics targeted to data mining task may
be used.
Cross-Industry Standard Process Model for
Data Mining (CRISP-DM)
The process lifecycle consists of:
• business understanding,
• data understanding,
• data preparation,
• modeling
• evaluation and deployment.
ETL, Online Analytic Processing (OLAP), BI
Examples of Data Mining Applications
• Healthcare data can identify best practices that improve care and reduce costs. Mining can be used to predict the volume
of patients in every category, to find best practices for diagnosis and the most effective treatments
• Market Basket Analysis may allow the retailer to understand the purchase behavior of a buyer.
• Education. Learning pattern of the students can be captured and used to develop techniques to teach them.
• Manufacturing Engineering. Discovering patterns in product architecture, product portfolio, and customer needs data.
Predicting product development span time, cost, or dependencies among tasks.
• Customer Relationship Management (CRM) and customer segmentation are used for implementing customer focused
strategies in acquiring and retaining customers, improving customers’ loyalty.
• Fraud Detection, image analysis, facial and speech recognition.
• Financial Banking. Finding patterns, causalities, and correlations in business information and market prices.
• Research in bio informatics, biology, medicine, neuroscience: gene finding, protein function inference, protein and gene
interaction network reconstruction, data cleansing, and protein sub-cellular location prediction.
• The Human Genome Project. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis
techniques are employed to account for the background noise and normalization of data.
Information Flow Diagram
References:
Dunham, Margaret H. “Data Mining: Introductory and Advanced
Topics”. Pearson Education, Inc., 2003.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
Data discretization
Data discretizationData discretization
Data discretization
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Kdd process
Kdd processKdd process
Kdd process
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 

Destacado

Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 

Destacado (20)

Data mining
Data miningData mining
Data mining
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse Project
 
Data mining
Data miningData mining
Data mining
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Database Project
Database ProjectDatabase Project
Database Project
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support Project
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Independent component analysis
Independent component analysisIndependent component analysis
Independent component analysis
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
Mixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering AlgorithmMixed Numeric and Categorical Attribute Clustering Algorithm
Mixed Numeric and Categorical Attribute Clustering Algorithm
 

Similar a 01 Introduction to Data Mining

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 

Similar a 01 Introduction to Data Mining (20)

Data mining
Data miningData mining
Data mining
 
Data modelling it's process and examples
Data modelling it's process and examplesData modelling it's process and examples
Data modelling it's process and examples
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniques
 
Data Mining Presentation.pptx
Data Mining Presentation.pptxData Mining Presentation.pptx
Data Mining Presentation.pptx
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
7.-Data-Analytics.pptx
7.-Data-Analytics.pptx7.-Data-Analytics.pptx
7.-Data-Analytics.pptx
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Data Mining
Data MiningData Mining
Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data mining
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 

Más de Valerii Klymchuk

Más de Valerii Klymchuk (7)

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides template
 
Toronto Capstone
Toronto CapstoneToronto Capstone
Toronto Capstone
 
03 Data Representation
03 Data Representation03 Data Representation
03 Data Representation
 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar Visualization
 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector Visualization
 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor Visualization
 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation Data
 

Último

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 

Último (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 

01 Introduction to Data Mining

  • 2. Introduction Data mining is often defined as finding hidden information in a database or exploratory data analysis, data driven discovery, deductive learning. Data mining access of a database differs from a traditional access in: • Query: The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what he wants to see. • Data: The data accessed is usually a different version from that all of the original operational database. The data have been cleansed and modified to better support the mining process. • Output: The output of the data mining query probably is not at subset of the database. Instead it is the output of some analysis of the contents of the database.
  • 3. Data Mining Algorithms DM algorithms attempt to fit a model to the data. They examine the data and determine a model that is closest to the characteristics of the data being examined. Such algorithms can be characterized as consisting of three parts: • Model: The purpose of the algorithm is to fit a model to the data. What attributes should be used to define what class structure? • Preference: Some criteria must be used to fit one model over another. The preference will be given to the criteria that fits data the best. • Search: All algorithms require some technique to search the data. The criteria needed to fit the data to the classes must be properly defined.
  • 4. • A predictive model makes a prediction about values of data using known results found from other (historical) data. • A descriptive model identifies patterns or relationships in data. It serves as a way to explore the properties of the data examined, not to predict new properties. 1.1 Basic Data Mining Models and Tasks
  • 5. • Classification maps data into predefined groups or classes. It is often referred to as supervised learning because classes are determined before examining the data. • Regression is it used to math data item to a real valued prediction variable. Regression assumes that the target data fit into song known type of function (e.g., , linear, logistic etc.) and determines the best function of this type that models the given data. In actuality regression involves learning of the function that does this mapping. • Time series analysis examines the value of an attribute as it varies over time (obtained at evenly spaced points). There're three basic functions performed in time series analysis: 1) similarity between different time series is determines using distance measures; 2) the structure of the line is examined to determine (perhaps classify) its behavior; 3) future values are predicted using historical time series plot. • Prediction predicts future data states based on past and current data. Prediction can be also viewed as a type of classification. Predictive Models
  • 6. Descriptive Models • Clustering is similar to classification except for that the groups are not predefined but rather defined by the data alone. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters. • Summarization extracts or derives representative information about the database. It maps data into subsets with associated simple descriptions. It is also called characterization or generalization. • Association rules (link analysis, affinity analysis or association) refers to uncovering relationships among data. An association rule is a model that identifies specific types of data associations. These are not casual relationships, and there is no guarantee that an association will apply in the future. • Sequence discovery is used to determine sequential patterns in data. These patterns are based on time (a sequence of actions). Temporal association rules fall into this category.
  • 8. Data Mining Issues • Human interaction. Experts are used to formulate the queries, identify data and desired results. • Overfitting: It occurs when the model does not fit future states. This may be caused by assumptions that are made about the data or may simply be caused by the small size of the training database. • Outliers. • Interpretation of results. Output may require expert to correctly interpret the results. • Large databases: Sampling and parallelization are effective tools to attack the scalability problem. • High dimensionality. One solution to this problem is to reduce the number of attributes, which is known as dimensionality reduction. • Multimedia data, missing data, irrelevant data, noisy data, changing data. • Integration and application: Business practices may have to be modified to determine how to effectively use the information uncovered.
  • 9. Data Mining Metrics • From an overall business perspective, a measure such as the return on investment (ROI) could be used. ROI examines the difference between what the data mining technique costs and what the savings or benefits from its use are. It could be measured as increased to sales, increased advertising expenditure, or both. • The metrics used include the traditional metrics of space and time based on complexity and analysis. In some cases, such as accuracy in classification, more specific metrics targeted to data mining task may be used.
  • 10. Cross-Industry Standard Process Model for Data Mining (CRISP-DM) The process lifecycle consists of: • business understanding, • data understanding, • data preparation, • modeling • evaluation and deployment.
  • 11. ETL, Online Analytic Processing (OLAP), BI
  • 12. Examples of Data Mining Applications • Healthcare data can identify best practices that improve care and reduce costs. Mining can be used to predict the volume of patients in every category, to find best practices for diagnosis and the most effective treatments • Market Basket Analysis may allow the retailer to understand the purchase behavior of a buyer. • Education. Learning pattern of the students can be captured and used to develop techniques to teach them. • Manufacturing Engineering. Discovering patterns in product architecture, product portfolio, and customer needs data. Predicting product development span time, cost, or dependencies among tasks. • Customer Relationship Management (CRM) and customer segmentation are used for implementing customer focused strategies in acquiring and retaining customers, improving customers’ loyalty. • Fraud Detection, image analysis, facial and speech recognition. • Financial Banking. Finding patterns, causalities, and correlations in business information and market prices. • Research in bio informatics, biology, medicine, neuroscience: gene finding, protein function inference, protein and gene interaction network reconstruction, data cleansing, and protein sub-cellular location prediction. • The Human Genome Project. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis techniques are employed to account for the background noise and normalization of data.
  • 14. References: Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.