01 Introduction to Data Mining

Introduction
Data mining is often defined as finding hidden information in a database or
exploratory data analysis, data driven discovery, deductive learning. Data
mining access of a database differs from a traditional access in:
• Query: The query might not be well formed or precisely stated. The data
miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that all of the
original operational database. The data have been cleansed and modified
to better support the mining process.
• Output: The output of the data mining query probably is not at subset of
the database. Instead it is the output of some analysis of the contents of
the database.

Data Mining Algorithms
DM algorithms attempt to fit a model to the data. They examine the
data and determine a model that is closest to the characteristics of the
data being examined. Such algorithms can be characterized as
consisting of three parts:
• Model: The purpose of the algorithm is to fit a model to the data.
What attributes should be used to define what class structure?
• Preference: Some criteria must be used to fit one model over another.
The preference will be given to the criteria that fits data the best.
• Search: All algorithms require some technique to search the data. The
criteria needed to fit the data to the classes must be properly defined.

• A predictive model makes a prediction about values of data using known results
found from other (historical) data.
• A descriptive model identifies patterns or relationships in data. It serves as a way
to explore the properties of the data examined, not to predict new properties.
1.1 Basic Data Mining Models and Tasks

• Classification maps data into predefined groups or classes. It is often referred to as supervised
learning because classes are determined before examining the data.
• Regression is it used to math data item to a real valued prediction variable. Regression assumes
that the target data fit into song known type of function (e.g., , linear, logistic etc.) and
determines the best function of this type that models the given data. In actuality regression
involves learning of the function that does this mapping.
• Time series analysis examines the value of an attribute as it varies over time (obtained at evenly
spaced points). There're three basic functions performed in time series analysis: 1) similarity
between different time series is determines using distance measures; 2) the structure of the line
is examined to determine (perhaps classify) its behavior; 3) future values are predicted using
historical time series plot.
• Prediction predicts future data states based on past and current data. Prediction can be also
viewed as a type of classification.
Predictive Models

Descriptive Models
• Clustering is similar to classification except for that the groups are not predefined
but rather defined by the data alone. The clustering is usually accomplished by
determining the similarity among the data on predefined attributes. The most
similar data are grouped into clusters.
• Summarization extracts or derives representative information about the
database. It maps data into subsets with associated simple descriptions. It is also
called characterization or generalization.
• Association rules (link analysis, affinity analysis or association) refers to
uncovering relationships among data. An association rule is a model that
identifies specific types of data associations. These are not casual relationships,
and there is no guarantee that an association will apply in the future.
• Sequence discovery is used to determine sequential patterns in data. These
patterns are based on time (a sequence of actions). Temporal association rules
fall into this category.

Data Mining Issues
• Human interaction. Experts are used to formulate the queries, identify data and desired results.
• Overfitting: It occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of the
training database.
• Outliers.
• Interpretation of results. Output may require expert to correctly interpret the results.
• Large databases: Sampling and parallelization are effective tools to attack the scalability problem.
• High dimensionality. One solution to this problem is to reduce the number of attributes, which is
known as dimensionality reduction.
• Multimedia data, missing data, irrelevant data, noisy data, changing data.
• Integration and application: Business practices may have to be modified to determine how to
effectively use the information uncovered.

Data Mining Metrics
• From an overall business perspective, a measure such as the return
on investment (ROI) could be used. ROI examines the difference
between what the data mining technique costs and what the savings
or benefits from its use are. It could be measured as increased to
sales, increased advertising expenditure, or both.
• The metrics used include the traditional metrics of space and time
based on complexity and analysis. In some cases, such as accuracy in
classification, more specific metrics targeted to data mining task may
be used.

Cross-Industry Standard Process Model for
Data Mining (CRISP-DM)
The process lifecycle consists of:
• business understanding,
• data understanding,
• data preparation,
• modeling
• evaluation and deployment.

ETL, Online Analytic Processing (OLAP), BI

Examples of Data Mining Applications
• Healthcare data can identify best practices that improve care and reduce costs. Mining can be used to predict the volume
of patients in every category, to find best practices for diagnosis and the most effective treatments
• Market Basket Analysis may allow the retailer to understand the purchase behavior of a buyer.
• Education. Learning pattern of the students can be captured and used to develop techniques to teach them.
• Manufacturing Engineering. Discovering patterns in product architecture, product portfolio, and customer needs data.
Predicting product development span time, cost, or dependencies among tasks.
• Customer Relationship Management (CRM) and customer segmentation are used for implementing customer focused
strategies in acquiring and retaining customers, improving customers’ loyalty.
• Fraud Detection, image analysis, facial and speech recognition.
• Financial Banking. Finding patterns, causalities, and correlations in business information and market prices.
• Research in bio informatics, biology, medicine, neuroscience: gene finding, protein function inference, protein and gene
interaction network reconstruction, data cleansing, and protein sub-cellular location prediction.
• The Human Genome Project. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis
techniques are employed to account for the background noise and normalization of data.

References:
Dunham, Margaret H. “Data Mining: Introductory and Advanced
Topics”. Pearson Education, Inc., 2003.

01 Introduction to Data Mining

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a 01 Introduction to Data Mining

Similar a 01 Introduction to Data Mining (20)

Más de Valerii Klymchuk

Más de Valerii Klymchuk (7)

Último

Último (20)

01 Introduction to Data Mining