Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.
2. Data mining
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful to
the data owner.
Data mining Techniques
In additiontousinga particulardata miningtool, internal auditors can choose from a variety of
data miningtechniques.The mostcommonlyusedtechniquesincludeartificial neuralnetworks,decision
trees, and the nearest-neighbor method. Each of these techniques analyzes data in different ways:
Artificial neural networks are non-linear, predictive models that learn through training.
Although they are powerful predictive modeling techniques, some of the power comes at
the expense of ease of use and deployment.
One area where auditors can easily use them is when reviewing records to
identify fraud and fraud-like actions. Because of their complexity, they are better
employed in situations where they can be used and reused, such as reviewing credit card
transactions every month to check for anomalies.
Decision trees are tree-shaped structures that represent decision sets. These decisions
generate rules, which then are used to classify data. Decision trees are the favored
technique for building understandable models. Auditors can use them to assess, for
example, whether the organization is using an appropriate cost-effective marketing
strategy that is based on the assigned value of the customer, such as profit.
The nearest-neighbor method classifies dataset records based on similar data in a
historical dataset. Auditors can use this approach to define a document that is interesting
to them and ask the system to search for similar items.
3. The most commonly used techniques include artificial neural networks, decision trees,
and the nearest-neighbor method. Each of these techniques analyzes data in different ways:
Association
Association (or relation) is probably the better known and most familiar and
straightforward data mining technique. Here, it make a simple correlation between two or
more items, often of the same type to identify patterns. For example, when tracking
people's buying habits, you might identify that a customer always buys cream when they
buy strawberries, and therefore suggest that the next time that they buy strawberries they
might also want to buy cream.
Classification
It use to build up an idea of the type of customer, item, or object by describing
multiple attributes to identify a particular class. For example, you can easily classify cars
into different types (sedan, 4x4, convertible) by identifying different attributes (number
of seats, car shape, driven wheels). Given a new car, you might apply it into a particular
class by comparing the attributes with our known definition. You can apply the same
principles to customers, for example by classifying them by age and social group.
Clustering
Clustering is the task of segmenting a diverse group into a number of similar sub groups
or clusters.The distingiush clustering from classification is that clustering is not rely
on predefined classes.
The records are grouped together on the basis of self similarity.clustering is often done as
a prelude to some other form of datamining.
Prediction
4. Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in combination
with the other data mining techniques, prediction involves analyzing trends, classification,
pattern matching, and relation. By analyzing past events or instances, you can make a prediction
about an event.
Using the credit card authorization, for example, you might combine decision tree analysis of
individual past transactions with classification and historical pattern matches to identify whether
a transaction is fraudulent. Making a match between the purchase of flights to the US and
transactions in the US, it is likely that the transaction is valid.
Sequential patterns
Oftern used over longer-term data, sequential patterns are a useful method for identifying
trends, or regular occurrences of similar events. For example, with customer data you can
identify that customers buy a particular collection of products together at different times of the
year. In a shopping basket application, you can use this information to automatically suggest that
certain items be added to a basket based on their frequency and past purchasing history.
Decision trees
Related to most of the other techniques (primarily classification and prediction), the
decision tree can be used either as a part of the selection criteria, or to support the use and
selection of specific data within the overall structure. Within the decision tree, you start with a
simple question that has two (or sometimes more) answers. Each answer leads to a further
question to help classify or identify the data so that it can be categorized, or so that a prediction
can be made based on each answer.
Each of these approaches brings different advantages and disadvantages that need to be
considered prior to their use. Neural networks, which are difficult to implement, require all input
and resultant output to be expressed numerically, thus needing some sort of interpretation
depending on the nature of the data-mining exercise
5. The decisiontree techniqueisthe mostcommonlyused methodology, because it is simple and
straightforwardtoimplement.Finally,the nearest-neighbormethodreliesmore onlinkingsimilar items
and, therefore, works better for extrapolation rather than predictive enquiries.
A goodway to applyadvanceddataminingtechniquesis to have a flexible and interactive data
mining tool that is fully integrated with a database or data warehouse. Using a tool that operates
outside of the database or data warehouse is not as efficient
Regardlessof the technique used,the real value behind data mining is modeling — the process
of buildingamodel basedonuser-specifiedcriteriafromalreadycaptureddata.Once a model is built, it
can be used in similar situations where an answer is not known.
For example,an organization looking to acquire new customers can create a model of its ideal
customer that is based on existing data captured from people who previously purchased the product.
The model then is used to query data on prospective customers to see if they match the profile.
Modeling also can be used in audit departments to predict the number of auditors required to
undertake an audit plan based on previous attempts and similar work.