Data Mining

MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in

Data Mining and Warehousing
Mr. Sagar Pandya
Information Technology Department
Course Code Course Name Hours Per Week Total
Credits
L T P
IT3ED02 Data Mining and Warehousing 3 0 0 3

IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis

Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya

Unit-2 Data Mining
 Basics of Data Mining ,
 Data mining techniques,
 KDP (Knowledge Discovery Process),
 Application and Challenges of Data Mining,
 Data Pre-processing: Overview,
 Data cleaning, Data integration, Data reduction, Data transformation and
discretization.
Mr. Sagar Pandya

Data Mining
 There is a huge amount of data available in the Information Industry.
This data is of no use until it is converted into useful information.
 It is necessary to analyze this huge amount of data and extract useful
information from it.
 Extraction of information is not the only process we need to perform;
data mining also involves other processes such as Data Cleaning,
Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation.
 Data mining is also called as Knowledge discovery, Knowledge
extraction, data/pattern analysis, information harvesting, etc.
Mr. Sagar Pandya

Data Mining
 Definition:- “The process of extracting previously unknown, valid
and actionable information from large databases and then using the
information to make crucial business decisions.”
 “The Science of extracting useful information from large datasets or
databases.”
Mr. Sagar Pandya

Data Mining
 Data mining is looking for hidden, valid, and potentially useful
patterns in huge data sets.
 In other words, we can say that data mining is the procedure of
mining knowledge from data.
 Data Mining is all about discovering unsuspected/ previously
unknown relationships amongst the data.
 It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
 The insights derived via Data Mining can be used for Market
Analysis, Fraud Detection, Customer Retention, Production Control,
Science Exploration etc.
Mr. Sagar Pandya

Types of Data in Data Mining
 Data mining can be performed on following types of data:-
1. Relational databases
2. Data warehouses
3. Advanced DB and information repositories
4. Object-oriented and object-relational databases
5. Transactional and Spatial databases
6. Heterogeneous and legacy databases
7. Multimedia and streaming database
8. Text databases
9. Text mining and Web mining
Mr. Sagar Pandya

History of Data Mining
Mr. Sagar Pandya
 In the 1990s, the term "Data Mining" was introduced, but data
mining is the evolution of a sector with an extensive history.
 Early techniques of identifying patterns in data include Bayes
theorem (1700s), and the evolution of regression(1800s).
 The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have
broad in size and complexity level. Explicit hands-on data
investigation has progressively been improved with indirect,
automatic data processing, and other computer science discoveries
such as neural networks, clustering, genetic algorithms (1950s),
decision trees(1960s), and supporting vector machines (1990s).
 Data mining origins are traced back to three family lines: Classical
statistics, Artificial intelligence, and Machine learning.

History of Data Mining
Mr. Sagar Pandya

Data Mining
Mr. Sagar Pandya

Evolution of Data Mining
Mr. Sagar Pandya

Applications of Data Mining
Mr. Sagar Pandya
Application Usage
Insurance Data mining helps insurance companies to price their
products profitable and promote new offers to their new or
existing customers.
Education Data mining benefits educators to access student data,
predict achievement levels and find students or groups of
students which need extra attention. For example, students
who are weak in maths subject.
Communications Data mining techniques are used in communication sector to
predict customer behavior to offer highly targeted and
relevant campaigns.

Mr. Sagar Pandya
Application Usage
Banking Data mining helps finance sector to get a view of market risks
and manage regulatory compliance. It helps banks to identify
probable defaulters to decide whether to issue credit cards,
loans, etc.
Retail Data Mining techniques help retail malls and grocery stores
identify and arrange most sellable items in the most attentive
positions. It helps store owners to comes up with the offer
which encourages customers to increase their spending.
Crime
Investigation
Data Mining helps crime investigation agencies to deploy police
workforce (where is a crime most likely to happen and when?),
who to search at a border crossing etc.

Mr. Sagar Pandya
Application Usage
Bioinformatics Data Mining helps to mine biological data from massive
datasets gathered in biology and medicine.
Service
Providers
Service providers like mobile phone and utility industries use
Data Mining to predict the reasons when a customer leaves
their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each
customer a probability score and offers incentives.
E-Commerce E-commerce websites use Data Mining to offer cross-sells and
up-sells through their websites. One of the most famous
names is Amazon, who use Data mining techniques to get
more customers into their eCommerce store.

Basic Data Mining Task
Mr. Sagar Pandya
 The data mining tasks can be classified generally into two types
based on what a specific task tries to achieve. Those two categories
are descriptive tasks and predictive tasks.
 The two “High Level” primary goals of data mining are prediction
and description.
 Prediction involves using some variables or fields in the database to
predict unknown or future values of other variables of interest.
 Descriptive tasks focuses on finding human-interpretable patterns
describing the data.

Mr. Sagar Pandya

Mr. Sagar Pandya
a) Classification
 Classification derives a model to determine the class of an object
based on its attributes.
 A collection of records will be available, each record with a set of
attributes.
 Classification can be used in direct marketing, that is to reduce
marketing costs by targeting a set of customers who are likely to buy
a new product.
 Using the available data, it is possible to know which customers
purchased similar products and who did not purchase in the past.
Hence, {purchase, don’t purchase} decision forms the class attribute
in this case.
 Once the class attribute is assigned, demographic and lifestyle
information of customers who purchased similar products can be
collected and promotion mails can be sent to them directly.

Mr. Sagar Pandya
b) Prediction
 Prediction task predicts the possible values of missing or future data.
 Prediction involves developing a model based on the available data
and this model is used in predicting future values of a new data set of
interest.
 For example, a model can predict the income of an employee based
on education, experience and other demographic factors like place of
stay, gender etc.
 Also prediction analysis is used in different areas including medical
diagnosis, fraud detection etc.

Mr. Sagar Pandya
c) Time - Series Analysis
 Time series is a sequence of events where the next event is
determined by one or more of the preceding events.
 Time series reflects the process being measured and there are certain
components that affect the behavior of a process.
 Time series analysis includes methods to analyze time-series data in
order to extract useful patterns, trends, rules and statistics.
 Stock market prediction is an important application of time- series
analysis.

Mr. Sagar Pandya
d) Association
 Association discovers the association or connection among a set of
items.
 Association identifies the relationships between objects.
 Association analysis is used for commodity management,
advertising, catalog design, direct marketing etc.
 A retailer can identify the products that normally customers purchase
together or even find the customers who respond to the promotion of
same kind of products.
 If a retailer finds that surf and soap are bought together mostly, he
can put nappies on sale to promote the soap of surf.

Mr. Sagar Pandya
e) Clustering
 Clustering is used to identify data objects that are similar to one
another.
 The similarity can be decided based on a number of factors like
purchase behavior, responsiveness to certain actions, geographical
locations and so on.
 For example, an insurance company can cluster its customers based
on age, residence, income etc.
 This group information will be helpful to understand the customers
better and hence provide better customized services.

Mr. Sagar Pandya
f) Summarization
 Summarization is the generalization of data.
 A set of relevant data is summarized which result in a smaller set that
gives aggregated information of the data.
 For example, the shopping done by a customer can be summarized
into total products, total spending, offers used, etc.
 Such high level summarized information can be useful for sales or
customer relationship team for detailed customer and purchase
behavior analysis.
 Data can be summarized in different abstraction levels and from
different angles.

Data Mining Architecture
Mr. Sagar Pandya
 Data Mining refers to the detection and extraction of new patterns
from the already collected data.
 Data mining architecture has many elements like Data Mining
Engine, Pattern evaluation, Data Warehouse, User Interface and
Knowledge Base.
 Each and every component of the data mining technique and
architecture has its own way of performing responsibilities and also
in completing data mining efficiently.
 The different modules are needed to interact correctly so as to
produce a valuable result and complete the complex procedure of
data mining successfully by providing the right set of information to
the business.

Mr. Sagar Pandya

Mr. Sagar Pandya
1. Data Sources
 A huge variety of present documents such as data warehouse,
database, www or popularly called a World wide web which
becomes the actual data sources.
 Most of the times, it can also be the case that the data is not present
in any of these golden sources but only in the form of text files, plain
files or sequence files or spreadsheets and then the data needs to be
processed in a very similar way as the processing would be done
upon the data received from golden sources.
 Most of the major chunk of data today is received from the internet
or the world wide web as everything which is present on the internet
today is data in some form or another which forms some form of
information repository units.

Mr. Sagar Pandya
 Before the data is processed ahead the different processes through
which it goes involves data cleansing, integration, and selection
before finally the data is passed onto the database or any of the EDW
(enterprise data warehouse ) server.
 The major challenge which lies at times with this set of data is
different levels of sources and a wide array of data formats which
forms the data components. Therefore the data cannot be directly
used for processing in its naïve state but processed, transformed and
crafted in a much more usable way.
 This way, the reliability and completeness of the data are also
ensured. So, the primary step involves data collection, cleaning and
integration, and post that only the relevant data is passed forward. All
this activity forms a part of a separate set of tools and techniques.

Mr. Sagar Pandya
2. Data Warehouse Server or Database
 The database server is the actual space where the data is contained
once it is received from the various number of data sources.
 The server contains the actual set of data which becomes ready to be
processed and therefore the server manages the data retrieval.
 All this activity is based on the request for data mining of the person.

Mr. Sagar Pandya
3. Data Mining Engine
 Data Mining Engine is the core component of data mining process
 In the case of data mining, the engine forms the core component and
is the most vital part, or to say the driving force which handles all the
requests and manages them and is used to contain a number of
modules.
 The number of modules present includes mining tasks such as
classification technique, association technique, regression technique,
characterization, prediction and clustering, time series analysis, naive
Bayes, support vector machines, ensemble methods, boosting and
bagging techniques, random forests, decision trees, etc.
 In other words, we can say data mining is the root of our data mining
architecture.

Mr. Sagar Pandya
4. Pattern Evaluation Modules
 They are responsible for finding interesting patterns in the data and
sometimes they also interact with the database servers for producing
the result of the user requests.
 All in all, the main purpose of this component is to look out and
search for all the interesting and useable patterns which could make
the data of comparatively better quality.
 Pattern Evaluation is responsible for finding various patterns with the
help of Data Mining Engine.

Mr. Sagar Pandya
5. Graphical User Interface
 The graphical user interface (GUI) module communicates between
the data mining system and the user.
 This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
 This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.

Mr. Sagar Pandya
6. Knowledge Base
 The knowledge base is helpful in the entire process of data mining.
 It might be helpful to guide the search or evaluate the stake of the
result patterns.
 The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
 The data mining engine may receive inputs from the knowledge base
to make the result more accurate and reliable.
 The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.

Types of Data Mining Architecture
Mr. Sagar Pandya
1. No Coupling:
 The no coupling data mining architecture retrieves data from
particular data sources.
 It does not use the database for retrieving the data which is otherwise
quite an efficient and accurate way to do the same.
 The no coupling architecture for data mining is poor and only used
for performing very simple data mining processes.
2. Loose Coupling:
 In loose coupling architecture data mining system retrieves data from
the database and stores the data in those systems.
 This mining is for memory-based data mining architecture.

Types of Data Mining Architecture
Mr. Sagar Pandya
3. Semi Tight Coupling:
 It tends to use various advantageous features of the data warehouse
systems.
 It includes sorting, indexing, aggregation.
 In this architecture, an intermediate result can be stored in the
database for better performance.
4. Tight coupling:
 In this architecture, a data warehouse is considered as one of it’s
most important components whose features are employed for
performing data mining tasks.
 This architecture provides scalability, performance, and integrated
information

Advantages of Data Mining
Mr. Sagar Pandya
 Assists in preventing future adversaries by accurately predicting
future trends.
 Contributes to the making of important decisions.
 Compresses data into valuable information.
 Provides new trends and unexpected patterns.
 Helps to analyze huge data sets.
 Aids companies to find, attract and retain customers.
 Helps the company to improve its relationship with the customers.
 Assist Companies to optimize their production according to the
likability of a certain product thus saving cost to the company.

Disadvantages of Data Mining
Mr. Sagar Pandya
 Excessive work intensity requires high-performance teams and staff
training.
 The requirement of large investments can also be considered as a
problem as sometimes data collection consumes many resources that
suppose a high cost.
 Lack of security could also put the data at huge risk, as the data may
contain private customer details.
 Inaccurate data may lead to the wrong output.
 Huge databases are quite difficult to manage.

DIFFERENT TYPES OF KNOWLEDGE
Mr. Sagar Pandya
 Knowledge is a collection of interesting and useful pattern in a
database. The key issue in Knowledge Discovery in Database is to
realize that there is more information hidden in your data than you
are table to distinguish at first sight. In data mining we distinguish
four different types of knowledge.
 Shallow Knowledge This is information that can be easily retrieved
from database using a query tool such as Structured Query Language
(SQL).
 Multi-Dimensional Knowledge OLAP tools you have the ability to
rapidly explore all sorts of clustering this is information that can be
analyzed using online analytical processing tools. With and different
orderings of the data but it is important to realize that most of the
things you can do with an OLAP tool can also be done using SQL.

Mr. Sagar Pandya
 The advantage of OLAP tools is that they are optimized for the kind
of search and analysis operation.
 However, OLAP is not as powerful as data mining; it cannot search
for optimal solutions.
 Hidden Knowledge This is data that can be found relative easily by
using pattern recognition or machine learning algorithms. Again, one
could use SQL to find these patterns but this would probably prove
extremely time-consuming.
 A pattern recognition algorithm could find regularities in a database
in minutes or at most a couple of hours, whereas you would have to
spend months using SQL to achieve the same result. Here
information that can be obtained through data mining techniques.

Mr. Sagar Pandya
 Deep Knowledge This is information that is stored in the database
but can only be located if we have a clue that tells us where to look.
 Different Types of Knowledge and Techniques:

Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
 Data mining is the core part of the knowledge discovery process.
 KDP is a process of finding knowledge in data, it does this by using data mining
methods (algorithms) in order to extract demanding knowledge from large
amount of data.
 Data Mining also known as Knowledge Discovery in Databases.
 Here is the list of steps involved in the knowledge discovery process:
1.) Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection. Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
 Parser decides weather the given string of data is acceptable within data
specification.

Mr. Sagar Pandya
2.) Data Integration: Data integration is defined as heterogeneous data
from multiple sources combined in a common
source(DataWarehouse).Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3.) Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Neural network.
 Data selection using Clustering, Regression, etc.

Mr. Sagar Pandya

Mr. Sagar Pandya
4.) Data Transformation:
 Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
 Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination
to capture transformations.
 Code generation: Creation of the actual transformation program.
5.) Data Mining:
Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.

Mr. Sagar Pandya
6.) Pattern Evaluation: Pattern Evaluation is defined as as identifying
strictly increasing patterns representing knowledge based on given
measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data
understandable by user.
7.) Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining
results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification
rules, characterization rules, etc.

Data Mining Issues
Mr. Sagar Pandya
 Nowadays Data Mining and knowledge discovery are evolving a
crucial technology for business and researchers in many domains.
 Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place.
 It needs to be integrated from various heterogeneous data sources.
These factors also create some issues.
 Here, we will discuss the major issues regarding −
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues

Data Mining Issues
Mr. Sagar Pandya

Mining Methodology and User Interaction Issues
Mr. Sagar Pandya
 Mining different kinds of knowledge in databases −
 Different users may be interested in different kinds of knowledge.
Therefore it is necessary for data mining to cover a broad range of
knowledge discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction −
 The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data
mining requests based on the returned results.

Mr. Sagar Pandya
 Incorporation of background knowledge −
 To guide discovery process and to express the discovered patterns,
the background knowledge can be used.
 Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of
abstraction.
 Data mining query languages and ad hoc data mining −
 Data Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.

Mr. Sagar Pandya
 Presentation and visualization of data mining results −
 Once the patterns are discovered it needs to be expressed in high
level languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data −
 The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
 Pattern evaluation −
 The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues
Mr. Sagar Pandya
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms.
 These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without
mining the data again from scratch.

Diverse Data Types Issues
Mr. Sagar Pandya
 Handling of relational and complex types of data −
 The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems −
 The data is available at different data sources on LAN or WAN.
These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data
mining.

Data Mining Challenges
Mr. Sagar Pandya

Mr. Sagar Pandya
 Although data mining is very powerful, it faces many challenges
during its execution. Various challenges could be related to
performance, data, methods, and techniques, etc. The process of data
mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
 Incomplete and noisy data
 Data Distribution
 Complex Data
 Performance
 Data Privacy and Security
 Data Visualization

Mr. Sagar Pandya
1.) Incomplete and noisy data:
 The process of extracting useful data from large volumes of data is
data mining.
 The data in the real-world is heterogeneous, incomplete, and noisy.
 Data in huge quantities will usually be inaccurate or unreliable.
These problems may occur due to data measuring instrument or
because of human errors.
 Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the
information into their system.
 The person may make a digit mistake when entering the phone
number, which results in incorrect data.

Mr. Sagar Pandya
 Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data.
 The data could get changed due to human or system error.
 All these consequences (noisy and incomplete data)makes data
mining challenging.
2.) Data Distribution:
 Real-worlds data is usually stored on various platforms in a
distributed computing environment.
 It might be in a database, individual systems, or even on the internet.
 Practically, It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical
concerns.

Mr. Sagar Pandya
 For example, various regional offices may have their servers to store
their data. It is not feasible to store, all the data from all the offices
on a central server. Therefore, data mining requires the development
of tools and algorithms that allow the mining of distributed data.
3.) Complex Data:
 Real-world data is heterogeneous, and it could be multimedia data,
including audio and video, images, complex data, spatial data, time
series, and so on.
 Managing these various types of data and extracting useful
information is a tough task.
 Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.

Mr. Sagar Pandya
4.) Performance:
 The data mining system's performance relies primarily on the
efficiency of algorithms and techniques used.
 If the designed algorithm and techniques are not up to the mark, then
the efficiency of the data mining process will be affected adversely.
5.) Data Privacy and Security:
 Data mining usually leads to serious problems in terms of data
security, governance, and privacy.
 For example, if a retailer analyzes the details of the purchased items,
then it reveals data about buying habits and preferences of the
customers without their permission.

Mr. Sagar Pandya
6.) Data Visualization:
 In data mining, data visualization is a very important process because
it is the primary method that shows the output to the user in a
presentable way.
 The extracted data should convey the exact meaning of what it
intends to express.
 But many times, representing the information to the end-user in a
precise and easy way is difficult.
 The input data and the output information being complicated, very
efficient, and successful data visualization processes need to be
implemented to make it successful.

Data Mining vs Data Warehousing
Mr. Sagar Pandya
S.no. Data Mining Data Warehousing
1 Data mining is a method of
comparing large amounts of
data to finding right patterns.
Data warehousing is a method
of centralizing data from
different sources into one
common repository.
2 Data mining is the process of
determining data patterns.
A data warehouse is a database
system designed for analytics.
3 In data mining, data is
analyzed repeatedly.
In data warehousing, data is
stored periodically.
4 Data mining uses pattern
recognition techniques to
identify patterns.
Data warehousing is the process
of extracting and storing data
that allow easier reporting.

Data Mining vs Data Warehousing
Mr. Sagar Pandya
S.no. Data Mining Data Warehousing
5 Data mining helps to create suggestive
patterns of important factors. Like the
buying habits of customers, products.
Data Warehouse adds an extra value to
operational business systems like CRM
systems when the warehouse is integrated.
6 After successful initial queries, users
may ask more complicated queries
which would increase the workload.
Data Warehouse is complicated to
implement and maintain.
7 The Data mining techniques are never
100% accurate and may cause serious
consequences in certain conditions.
In the data warehouse, there is great
chance that the data which was required
for analysis by the organization may not
be integrated into the warehouse. It can
easily lead to loss of information.

Alternative names for Data Mining :
Mr. Sagar Pandya
 Alternative names for Data Mining :
1. Knowledge discovery (mining) in databases (KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archeology
5. Data dredging
6. Information harvesting
7. Business intelligence

Data Mining Implementation Process
Mr. Sagar Pandya
 Many different sectors are taking advantage of data mining to boost
their business efficiency, including manufacturing, chemical,
marketing, aerospace, etc.
 Therefore, the need for a conventional data mining process improved
effectively.
 Data mining is described as a process of finding hidden precious data
by evaluating the huge quantity of information stored in data
warehouses, using multiple data mining techniques such as Artificial
Intelligence (AI), Machine learning and statistics.

The Cross-Industry Standard Process for Data
Mining (CRISP-DM)
Mr. Sagar Pandya
 Cross-industry Standard Process of Data Mining (CRISP-DM)
comprises of six phases designed as a cyclical method as the given
figure:

Mr. Sagar Pandya

Mr. Sagar Pandya
1. Business understanding:
 It focuses on understanding the project goals and requirements form
a business point of view, then converting this information into a data
mining problem afterward a preliminary plan designed to accomplish
the target.
 Tasks:
• Determine business objectives
• Access situation
• Determine data mining goals
• Produce a project plan

Mr. Sagar Pandya
 First, you need to understand business and client objectives.
 You need to define what your client wants (which many times even
they do not know themselves)
 Take stock of the current data mining scenario.
 Factor in resources, assumption, constraints, and other significant
factors into your assessment.
 Using business objectives and current scenario, define your data
mining goals.
 A good data mining plan is very detailed and should be developed to
accomplish both business and data mining goals.

Mr. Sagar Pandya
 Determine business objectives:
• It Understands the project targets and prerequisites from a business
point of view.
• Carefully understand what the customer wants to achieve.
• Reveal significant factors, at the starting, it can impact the result of
the project.
 Access situation:
• It requires a more detailed analysis of facts about all the resources,
constraints, assumptions, and others that ought to be considered.

Mr. Sagar Pandya
 Determine data mining goals:
• A business goal states the target of the business terminology. For example,
increase catalog sales to the existing customer.
• A data mining goal describes the project objectives. For example, It
assumes how many objects a customer will buy, given their demographics
details (Age, Salary, and City) and the price of the item over the past three
years.
 Produce a project plan:
• It states the targeted plan to accomplish the business and data mining plan.
• The project plan should define the expected set of steps to be performed
during the rest of the project, including the latest technique and better
selection of tools.

Mr. Sagar Pandya
2. Data Understanding:
 Data understanding starts with an original data collection and
proceeds with operations to get familiar with the data, to data quality
issues, to find better insight in data, or to detect interesting subsets
for concealed information hypothesis.
 Tasks:
• Collects initial data
• Describe data
• Explore data
• Verify data quality

Mr. Sagar Pandya
 First, data is collected from multiple data sources available in the
organization.
 These data sources may include multiple databases, flat filer or data
cubes.
 There are issues like object matching and schema integration which
can arise during Data Integration process.
 It is a quite complex and tricky process as data from various sources
unlikely to match easily.
 For example, table A contains an entity named cust_no whereas
another table B contains an entity named cust-id.

Mr. Sagar Pandya
 Therefore, it is quite difficult to ensure that both of these given
objects refer to the same value or not.
 Here, Metadata should be used to reduce errors in the data
integration process.
 Next, the step is to search for properties of acquired data.
 A good way to explore the data is to answer the data mining
questions (decided in business phase) using the query, reporting, and
visualization tools.
 Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.

Mr. Sagar Pandya
 Collect initial data:
• It acquires the information mentioned in the project resources.
• It includes data loading if needed for data understanding.
• It may lead to original data preparation steps.
• If various information sources are acquired then integration is an
extra issue, either here or at the subsequent stage of data preparation.
 Describe data:
• It examines the "gross" or "surface" characteristics of the information
obtained.
• It reports on the outcomes.
 Verify data quality:
• It examines the data quality and addressing questions.

Mr. Sagar Pandya
 Explore data:
• Addressing data mining issues that can be resolved by querying,
visualizing, and reporting, including:
• Distribution of important characteristics, results of simple
aggregation.
• Establish the relationship between the small number of attributes.
• Characteristics of important sub-populations, simple statical analysis.
• It may refine the data mining objectives.
• It may contribute or refine the information description, and quality
reports.
• It may feed into the transformation and other necessary information
preparation.

Mr. Sagar Pandya
3. Data Preparation:
• It usually takes more than 90 percent of the time.
• It covers all operations to build the final data set from the original
raw information.
• Data preparation is probable to be done several times and not in any
prescribed order.
 Tasks:
• Select data
• Clean data
• Construct data
• Integrate data
• Format data

Mr. Sagar Pandya
 Data transformation operations would contribute toward the success
of the mining process.
 Smoothing: It helps to remove noise from the data.
 Aggregation: Summary or aggregation operations are applied to the
data. I.e., the weekly sales data is aggregated to calculate the monthly
and yearly total.
 Generalization: In this step, Low-level data is replaced by higher-
level concepts with the help of concept hierarchies. For example, the
city is replaced by the county.
 Normalization: Normalization performed when the attribute data are
scaled up o scaled down. Example: Data should fall in the range -2.0
to 2.0 post-normalization.

Mr. Sagar Pandya
 Select data:
• It decides which information to be used for evaluation.
• In the data selection criteria include significance to data mining
objectives, quality and technical limitations such as data volume
boundaries or data types.
• It covers the selection of characteristics and the choice of the
document in the table.
 Clean data:
• It may involve the selection of clean subsets of data, inserting
appropriate defaults or more ambitious methods, such as estimating
missing information by modeling.

Mr. Sagar Pandya
 Construct data:
• It comprises of Constructive information preparation, such as
generating derived characteristics, complete new documents, or
transformed values of current characteristics.
 Integrate data:
• Integrate data refers to the methods whereby data is combined from
various tables, or documents to create new documents or values.
 Format data:
• Formatting data refer mainly to linguistic changes produced to
information that does not alter their significance but may require a
modeling tool.

Mr. Sagar Pandya
4. Modeling:
 In modeling, various modeling methods are selected and applied, and
their parameters are measured to optimum values.
 Some methods gave particular requirements on the form of data.
 Therefore, stepping back to the data preparation phase is necessary.
 Tasks:
• Select modeling technique
• Generate test design
• Build model
• Access model

Mr. Sagar Pandya
 Select modeling technique:
• It selects the real modeling method that is to be used. For example,
decision tree, neural network.
• If various methods are applied, then it performs this task individually
for each method.
 Generate test Design:
• Generate a procedure or mechanism for testing the validity and
quality of the model before constructing a model.
• For example, in classification, error rates are commonly used as
quality measures for data mining models. Therefore, typically
separate the data set into train and test set, build the model on the
train set and assess its quality on the separate test set.

Mr. Sagar Pandya
 Build model:
• To create one or more models, we need to run the modeling tool on
the prepared data set.
 Assess model:
• It interprets the models according to its domain expertise, the data
mining success criteria, and the required design.
• It assesses the success of the application of modeling and discovers
methods more technically.
• It Contacts business analytics and domain specialists later to discuss
the outcomes of data mining in the business context.

Mr. Sagar Pandya
5. Evaluation:
• At the last of this phase, a decision on the use of the data mining
results should be reached. It evaluates the model efficiently, and
review the steps executed to build the model and to ensure that the
business objectives are properly achieved. The main objective of the
evaluation is to determine some significant business issue that has
not been regarded adequately. At the last of this phase, a decision on
the use of the data mining outcomes should be reached.
 Tasks:
• Evaluate results
• Review process
• Determine next steps

Mr. Sagar Pandya
 Evaluate results:
• It assesses the degree to which the model meets the organization's
business objectives.
• It tests the model on test apps in the actual implementation when
time and budget limitations permit and also assesses other data
mining results produced.
• It unveils additional difficulties, suggestions, or information for
future instructions.
 Review process:
• The review process does a more detailed evaluation of the data
mining engagement to determine when there is a significant factor or
task that has been somehow ignored. It reviews quality assurance
problems.

Mr. Sagar Pandya
 Determine next steps:
• It decides how to proceed at this stage.
• It decides whether to complete the project and move on to
deployment when necessary or whether to initiate further iterations
or set up new data-mining initiatives.it includes resources analysis
and budget that influence the decisions.
• A go or no-go decision is taken to move the model in the
deployment phase.

Mr. Sagar Pandya
6. Deployment:
 Determine:
• Deployment refers to how the outcomes need to be utilized.
 Deploy data mining results by:
• It includes scoring a database, utilizing results as company
guidelines, interactive internet scoring.
• The information acquired will need to be organized and presented in
a way that can be used by the client. However, the deployment phase
can be as easy as producing. However, depending on the demands,
the deployment phase may be as simple as generating a report or as
complicated as applying a repeatable data mining method across the
organizations.

Mr. Sagar Pandya
 A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization's business policy.
 Tasks:
• Plan deployment
• Plan monitoring and maintenance
• Produce final report
• Review project
 Plan deployment:
• To deploy the data mining outcomes into the business, takes the
assessment results and concludes a strategy for deployment.
• It refers to documentation of the process for later deployment.

Mr. Sagar Pandya
 Plan monitoring and maintenance:
• It is important when the data mining results become part of the day-to-day
business and its environment.
• It helps to avoid unnecessarily long periods of misuse of data mining
results. It needs a detailed analysis of the monitoring process.
 Produce final report:
• A final report can be drawn up by the project leader and his team.
• It may only be a summary of the project and its experience.
• It may be a final and comprehensive presentation of data mining.
 Review project:
• Review projects evaluate what went right and what went wrong, what was
done wrong, and what needs to be improved.

Data Mining Techniques
Mr. Sagar Pandya
 One of the most important tasks in Data Mining is to select the
correct data mining technique.
 Data Mining technique has to be chosen based on the type of
business and the type of problem your business faces.
 A generalized approach has to be used to improve the accuracy and
cost-effectiveness of using data mining techniques.
 There are basically seven main Data Mining techniques which are
discussed.
 There are also a lot of other Data Mining techniques but these seven
are considered more frequently used by business people.
• Statistics, Clustering, Visualization, Decision Tree, Association
Rules, Neural Networks, Classification.

Mr. Sagar Pandya

Mr. Sagar Pandya
1. Classification:
 This technique is used to obtain important and relevant information
about data and metadata.
 This data mining technique helps to classify data in different classes.
 Data mining techniques classification is the most commonly used
data mining technique which contains a set of pre-classified samples
to create a model which can classify the large set of data.
 There are two main processes involved in this technique
• Learning – In this process the data are analyzed by the classification
algorithm.
• Classification – In this process, the data is used to measure the
precision of the classification rules.

Mr. Sagar Pandya
 Data mining techniques can be classified by different criteria, as
follows:
i. Classification of Data mining frameworks as per the type of data
sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example.
Object-oriented database, transactional database, relational database,
and so on..

Mr. Sagar Pandya
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or
data mining functionalities.
For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few
data mining functionalities together..
iv. Classification of data mining frameworks according to data
mining techniques used:
This classification is as per the data analysis approach utilized, such as
neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database-oriented, etc.

Mr. Sagar Pandya
 There are different types of classification models. They are as
follows
• Classification by decision tree induction
• Bayesian Classification
• Neural Networks
• Support Vector Machines (SVM)
• Classification Based on Associations

Mr. Sagar Pandya
• Learning step (training phase): In this, a classification algorithm
builds the classifier by analyzing a training set.
• Classification step: Test data are used to estimate the accuracy or
precision of the classification rules.
 For example, a banking company uses to identify loan applicants at
low, medium or high credit risks. Similarly, a medical researcher
analyzes cancer data to predict which medicine to prescribe to the
patient.

Mr. Sagar Pandya
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.

Mr. Sagar Pandya
 In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to
the new data tuples if the accuracy is considered acceptable.

Mr. Sagar Pandya
2. Clustering Technique
 Clustering is one of the oldest techniques used in Data Mining.
 Clustering analysis is the process of identifying data that are similar
to each other.
 This will help to understand the differences and similarities between
the data.
 This is sometimes called segmentation and helps the users to
understand what is going on within the database.
 For example, an insurance company can group its customers based
on their income, age, nature of policy and type of claims.
 Clustering is very similar to the classification, but it involves
grouping chunks of data together based on their similarities.

Mr. Sagar Pandya
 There are different types of clustering methods. They are as follows
• Partitioning Methods
• Hierarchical Agglomerative methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Methods
 The most popular clustering algorithm is the Nearest Neighbour. In
business, the Nearest Neighbour technique is most often used in the
process of Text Retrieval.
 They are used to find the documents that share the important
characteristics with that main document that have been marked as
interesting.

Mr. Sagar Pandya
 A similar example of loan applicants can be considered here also. There
are some differences that are depicted in the figure below.

Mr. Sagar Pandya
3. Regression:
 Regression analysis is the data mining method of identifying and
analyzing the relationship between variables.
 It is used to identify the likelihood of a specific variable, given the
presence of other variables.
 Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of
the other factor.
 It is used to define the probability of the specific variable.
 Regression, primarily a form of planning and modeling.

Mr. Sagar Pandya
 For example, we might use it to project certain costs, depending on
other factors such as availability, consumer demand, and
competition.
 Primarily it gives the exact relationship between two or more
variables in the given data set.
 A good example of regression analysis is the use of this data mining
technique in matching people on dating portals.
 Many websites use variables to match people according to their likes,
interest, and hobbies.

Mr. Sagar Pandya
4. Association Rule Technique
 This technique helps to find the association between two or more
items.
 It helps to know the relations between the different variables in
databases.
 It discovers the hidden patterns in the data sets which is used to
identify the variables and the frequent occurrence of different
variables that appear with the highest frequencies.
 There are three types of association rule. They are
1. Multilevel Association Rule
2. Multidimensional Association Rule
3. Quantitative Association Rule

Mr. Sagar Pandya
 This technique is most often used in the retail industry to find
patterns in sales. This will help increase the conversion rate and thus
increases profit.
 Association rules are if-then statements that support to show the
probability of interactions between data items within large data sets
in different types of databases.
 Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.
 The way the algorithm works is that you have various data, For
example, a list of grocery items that you have been buying for the
last six months.
 It calculates a percentage of items being purchased together.

Mr. Sagar Pandya
 These are three major measurements technique:
• Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
• Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
• Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)

Mr. Sagar Pandya
 Suppose, the marketing manager of a supermarket wants to
determine which products are frequently purchased together.
 As an example,
 Buys (x,”beer”) -> buys(x, “chips”) [support = 1%, confidence =
50%]
• Here x represents a customer buying beer and chips together.
• Confidence shows certainty that if a customer buys a beer, there is a
50% chance that he/she will buy the chips also.
• Support means that 1% of all the transactions under analysis showed
that beer and chips were bought together.
• Many similar examples like bread and butter or computer and
software can be considered.

Mr. Sagar Pandya
 There are two types of Association Rules:
• Single dimensional association rule: These rules contain a single
attribute that is repeated.
• Multidimensional association rule: These rules contain multiple
attributes that are repeated.

Mr. Sagar Pandya
5. Outer detection:
 This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior.
 This technique may be used in various domains like intrusion,
detection, fraud detection, etc.
 It is also known as Outlier Analysis or Outlier mining. The outlier is
a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier. Outlier detection
plays a significant role in the data mining field.
 Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection,
detecting outlying in wireless sensor network data, etc.

Mr. Sagar Pandya
 For example, let’s assume the graph below is plotted using some data
sets in our database.
 So the best fit line is drawn. The points lying nearby the line show
expected behavior while the point far from the line is an Outlier.
 This would help to detect the anomalies and take possible actions
accordingly.

Mr. Sagar Pandya
6. Sequential Patterns:
 This data mining technique helps to discover or identify similar
patterns or trends in transaction data for certain period.
 The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns.
 It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms
of different criteria like length, occurrence frequency, etc.
 In other words, this technique of data mining helps to discover or
recognize similar patterns in transaction data over some time.

Mr. Sagar Pandya
 This method is used to identify patterns that occur frequently over a
certain period of time.
 For example, the sales manager of clothing company sees that sales
of jackets seem to increase just before the winter season, or sales in
bakery increases during Christmas or New Year’s eve.

Mr. Sagar Pandya
7. Prediction:
 Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc.
 It analyzes past events or instances in the right sequence to predict a
future event.
 Prediction is one of the most valuable data mining techniques, since
it’s used to project the types of data you’ll see in the future.
 In many cases, just recognizing and understanding historical trends is
enough to chart a somewhat accurate prediction of what will happen
in the future.
 For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.

Mr. Sagar Pandya
 For example, if the sales manager of a supermarket would like to
predict the amount of revenue that each item would generate based
on past sales data. It models a continuous valued function that
predicts missing numeric data values.
 Regression Analysis is the best choice to perform prediction. It can
be used to set a relationship between independent variables and
dependent variables.

Mr. Sagar Pandya
 Sequence Prediction:
 Before defining the problem of sequence prediction, it is necessary to
first explain what is a sequence. A sequence is an ordered list of
symbols. For example, here are some common types of sequences:
• A sequence of webpages visited by a user, ordered by the time of
access.
• A sequence of words or characters typed on a cellphone by a user, or
in a text such as a book.
• A sequence of products bought by a customer in a retail store
• A sequence of proteins in bioinformatics
• A sequence of symptoms observed on a patient at a hospital

Mr. Sagar Pandya
 The task of sequence prediction consists of predicting the next symbol of a
sequence based on the previously observed symbols. For example, if a user has
visited some webpages A, B, C, in that order, one may want to predict what is
the next webpage that will be visited by that user to prefetch the webpage.
 First, one must train a sequence prediction model using some previously
seen sequences called the training sequences. This process is illustrated below:

Mr. Sagar Pandya
 Some Other Data Mining Techniques:-
 Statistical Techniques:
 Data mining techniques statistics is a branch of mathematics which
relates to the collection and description of data.
 The statistical technique is not considered as a data mining technique
by many analysts.
 But still, it helps to discover the patterns and build predictive models.
 For this reason, data analyst should possess some knowledge about
the different statistical techniques.
 In today’s world, people have to deal with a large amount of data and
derive important patterns from it.

Mr. Sagar Pandya
 Statistics can help you to a greater extent to get answers for questions
about their data like
• What are the patterns in their database?
• What is the probability of an event to occur?
• Which patterns are more useful to the business?
• What is the high-level summary that can give you a detailed view of
what is there in the database?
 Statistics not only answer these questions they help in summarizing
the data and count it.
 It also helps in providing information about the data with ease.
Through statistical reports, people can make smart decisions.

Mr. Sagar Pandya
 There are different forms of statistics but the most important and
useful technique is the collection and counting of data. There are a
lot of ways to collect data like
• Histogram
• Mean
• Median
• Mode
• Variance
• Max
• Min
• Linear Regression

Mr. Sagar Pandya
 Decision Trees
 A decision tree is a tree structure (as its name suggests), where
• Each internal node represents a test on the attribute.
• Branch denotes the result of the test.
• Terminal nodes hold the class label.
• The topmost node is the root node which has a simple question that
has two or more answers. Accordingly, the tree grows and a flow
chart like structure is generated.

Mr. Sagar Pandya
 In this decision, tree government classifies citizens below age 18 or
above age 18. This would help them to decide whether a license must
be issued to a particular city or not.

Data Mining Tools
Mr. Sagar Pandya
 In today’s world, a large amount of data is generated within seconds.
To handle this data, we should have some knowledge of different
techniques and tools.
 Data mining tools are nothing but a set of methodologies that are
used for analyzing this large amount of data and the relationship
between different data.
 Data Mining tools have the objective of discovering
patterns/trends/groupings among large sets of data and transforming
data into more refined information.

Data Mining Tools
Mr. Sagar Pandya

Data Mining Tools
Mr. Sagar Pandya
1. Orange Data Mining Tool:
 It is open-source software written in python language.
 Orange is the best software for analyzing data and machine learning.
These components are called widgets.
 These widgets are used for reading data, analyzing components,
allows users to select the features and helps to show the data.
 With orange, data formatting and moving them with the help of
widgets becomes fast and easy.
 Besides, Orange provides a more interactive and enjoyable
atmosphere to dull analytical tools. It is quite exciting to operate.

Data Mining Tools
Mr. Sagar Pandya
 Widgets deliver significant functionalities such as:
• Displaying data table and allowing to select features
• Data reading
• Training predictors and comparison of learning algorithms
• Data element visualization, etc.

Data Mining Tools
Mr. Sagar Pandya
2. SAS Data Mining:
 SAS stands for Statistical Analysis System.
 It is a product of the SAS Institute created for analytics and data
management.
 SAS can mine data, change it, manage information from various
sources, and analyze statistics.
 It offers a graphical UI for non-technical users.
 SAS data miner allows users to analyze big data and provide
accurate insight for timely decision-making purposes.
 SAS has distributed memory processing architecture that is highly
scalable.
 It is suitable for data mining, optimization, and text mining purposes.

Data Mining Tools
Mr. Sagar Pandya
3. DataMelt Data Mining:
 DataMelt is a computation and visualization environment which
offers an interactive structure for data analysis and visualization.
 It is primarily designed for students, engineers, and scientists.
 It is also known as DMelt.
 DMelt is a multi-platform utility written in JAVA.
 It can run on any operating system which is compatible with JVM
(Java Virtual Machine).
 It consists of Science and mathematics libraries.

Data Mining Tools
Mr. Sagar Pandya
 Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
 Mathematical libraries:
Mathematical libraries are used for random number generation,
algorithms, curve fitting, etc.
 DMelt can be used for the analysis of the large volume of data, data
mining, and statistical analysis.
 It is extensively used in natural sciences, financial markets, and
engineering.

Data Mining Tools
Mr. Sagar Pandya
4. Rattle:
 Ratte is a data mining tool based on GUI.
 It uses the R stats programming language.
 Rattle exposes the statically power of R by offering significant data
mining features.
 While rattle has a comprehensive and well-developed user interface,
It has an integrated log code tab that produces duplicate code for any
GUI operation.
 The data set produced by Rattle can be viewed and edited.
 Rattle gives the other facility to review the code, use it for many
purposes, and extend the code without any restriction.

Data Mining Tools
Mr. Sagar Pandya
5. Rapid Miner:
 It is written in JAVA programming language.
 Rapid Miner is one of the most popular predictive analysis systems
created by the company with the same name as the Rapid Miner.
 It offers an integrated environment for text mining, deep learning,
machine learning, and predictive analysis.
 The instrument can be used for a wide range of applications,
including company applications, commercial applications, research,
education, training, application development, machine learning.
 Rapid Miner provides the server on-site as well as in public or
private cloud infrastructure.

Data Mining Process
Mr. Sagar Pandya
 Data Mining refers to extracting or mining knowledge from large
amounts of data.
 It is also defined as extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge
from a huge amount of data.
 Data mining is a rapidly growing field that is concerned with
developing techniques to assist managers and decision-makers to
make intelligent use of a huge amount of repositories.
 It is computational process of discovering patterns in large data sets
involving methods at intersection of artificial intelligence, machine
learning, statistics, and database systems.
 The goal of data mining process is to extract information from a data
set and transform it into an understandable structure for further use.

Data Mining Process
Mr. Sagar Pandya
 Major tasks of data pre-processing:
 Data Cleaning
 Data cleaning is a process to clean the data in such a way that data
can be easily integrated.
 Data Integration
 Data integration is a process to integrate/combine all the data.
 Data Reduction
 Data reduction is a process to reduce the large data into smaller
once in such a way that data can be easily transformed further.
 Data Transformation
 Data transformation is a process to transform the data into a reliable
shape.


Data Mining Process
Mr. Sagar Pandya
 Data Discretization
 Data discretization converts a large number of data values into
smaller once, so that data evaluation and data management becomes
very easy.
 After the completion of these tasks, the data is ready for mining.

Data Mining Process
Mr. Sagar Pandya

Data Mining Process
Mr. Sagar Pandya
 1.) Data Cleansing:-
 Data cleansing or data cleaning is the process of identifying and
removing (or correcting) inaccurate records from a dataset, table, or
database and refers to recognizing unfinished, unreliable, inaccurate
or non-relevant parts of the data and then restoring, remodeling, or
removing the dirty or crude data.
 To perform the data analytics properly we need various data cleaning
techniques so that our data is ready for analysis.
 Data cleaning techniques may be performed as batch processing
through scripting or interactively with data cleansing tools.
 After cleaning, a dataset should be uniform with other related
datasets in the operation.

Data Mining Process
Mr. Sagar Pandya
 Data cleaning techniques are not only an essential part of the data
science process – it’s also the most time-consuming part.
 With the rise of big data, data cleaning methods has become more
important than ever before. Every industry – banking, healthcare,
retail, hospitality, education – is now navigating in a large ocean of
data.
 “Data scientists spend 80% of their time cleaning and manipulating
data and only 20% of their time actually analyzing it.”
 Data cleaning is a process to clean the dirty data. Data is mostly not
clean. It means that most data can be incorrect due to a large number
of reasons like due to hardware error/failure, network error or human
error. So it is compulsory to clean the data before mining.

Data Mining Process
Mr. Sagar Pandya
 Sources of Missing Values
1. There are many sources of missing data. Let’s see some major
sources of missing data.
2. User forgot to fill the data in a field.
3. It can be a programming error.
4. Data can be lost when we transferring the data manually from a
legacy database.

Data Mining Process
Mr. Sagar Pandya
 Data Cleaning Techniques-Get Rid of Extra Spaces
 Here we have the text Welcome To Medicaps University written in
four different ways.
 welcome to Medicaps University
 First one is the regular way with only one space between words, in
the second case we have more than one space between words, in a
third case we have some leading spaces along with a couple of
spaces between words and in the fourth case we have trailing spaces,
you can see there are a couple of space after the last word.

Data Mining Process
Mr. Sagar Pandya
Fixing Structural errors
 The errors that arise during measurement, transfer of data or other
similar situations are called structural errors.
 Structural errors include typos in the name of features, same attribute
with different name, mislabeled classes, i.e. separate classes that
should really be the same or inconsistent capitalization.
 For example, the model will treat America and america as different
classes or values, though they represent the same value or red,
yellow and red-yellow as different classes or attributes, though one
class can be included in other two classes. So, these are some
structural errors that make our model inefficient and gives poor
quality results.

Data Mining Process
Mr. Sagar Pandya
 Some techniques of Data Cleaning Process
1. Parsing
2. Correcting
3. Standardizing
4. Matching
5. Consolidation
6. Dealing with missing data
7. Dealing with incorrect and noisy data

Data Mining Process
Mr. Sagar Pandya
 Some data cleaning methods :-
 1 You can ignore the tuple. This is done when class label is missing.
This method is not very effective , unless the tuple contains several
attributes with missing values.
 2 You can fill in the missing value manually. This approach is
effective on small data set with some missing values.
 3 You can replace all missing attribute values with global constant,
such as a label like “Unknown” or minus infinity.
 4 You can use the attribute mean to fill in the missing value.For
example customer average income is 25000 then you can use this
value to replace missing value for income.
 5 Use the most probable value to fill in the missing value.

Data Mining Process
Mr. Sagar Pandya
 Noisy Data
 Noise is a random error or variance in a measured variable. Noisy
Data may be due to faulty data collection instruments, data entry
problems and technology limitation.
 How to Handle Noisy Data?
 Binning:
 Binning methods sorted data value by consulting its “neighbor-
hood,” that is, the values around it.The sorted values are distributed
into a number of “buckets,” or bins.
 For example
 Price = 4, 8, 15, 21, 21, 24, 25, 28, 34

Data Mining Process
Mr. Sagar Pandya
 Partition into (equal-frequency) bins:
 Bin a: 4, 8, 15
 Bin b: 21, 21, 24
 Bin c: 25, 28, 34
 In this example, the data for price are first sorted and then partitioned
into equal-frequency bins of size 3.
 Smoothing by bin means:
 Bin a: 9, 9, 9
 Bin b: 22, 22, 22
 Bin c: 29, 29, 29
 In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin.

Data Mining Process
Mr. Sagar Pandya
 Smoothing by bin boundaries:
 Bin a: 4, 4, 15
 Bin b: 21, 21, 24
 Bin c: 25, 25, 34
 In smoothing by bin boundaries, each bin value is replaced by the
closest boundary value.
 Regression
 Data can be smoothed by fitting the data into a regression functions.
 Clustering:
 Outliers may be detected by clustering,where similar values are
organized into groups, or “clusters.Values that fall outside of the set
of clusters may be considered outliers.

Data Mining Process
Mr. Sagar Pandya
2.) Data Integration In Data Mining
 Here comes a second step in the data mining process. From various
zones, your data is incorporated into a single zone.
 Data in your computer system is stored in different formats under
different locations. These are your saved spreadsheets, text files,
images, documents, etc.
 Data integration can give a real tough time if you are previously
messed up with your organization. Data integration sets free data
from repetition without affecting the reliability of the data.
 Data Integration is a data preprocessing technique that combines data
from multiple sources and provides users a unified view of these
data.

Data Mining Process
Mr. Sagar Pandya
 There are mainly 2 major approaches for data integration:-
1 Tight Coupling
 In tight coupling data is combined from different sources into a
single physical location through the process of ETL - Extraction,
Transformation and Loading. Here, a data warehouse is treated as an
information retrieval component.
2 Loose Coupling
 In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and
transforms it in a way the source database can understand and then
sends the query directly to the source databases to obtain the result.
And the data only remains in the actual source databases.

Data Mining Process
Mr. Sagar Pandya
 Issues in Data Integration:
There are no of issues to consider during data integration: Schema
Integration, Redundancy, Detection and resolution of data value
conflicts. These are explained in brief as following below.
 1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to
as the entity identification problem.
 For example, How can the data analyst and computer be sure that
customer id in one data base and customer number in another
reference to the same attribute.

Data Mining Process
Mr. Sagar Pandya
 2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from
another attribute or set of attribute.
• Inconsistencies in attribute can also cause redundanciesin the
resulting data set.
• Some redundancies can be detected by correlation analysis.
 3. Detection and resolution of data value conflicts:
• This is the third important issues in data integration.
• Attribute values from another different sources may differ for the
same real world entity.
• An attribute in one system may be recorded at a lower level
abstraction then the “same” attribute in another.

Data Mining Process
Mr. Sagar Pandya
3.) Data Transformation In Data Mining
 In data transformation process data are transformed from one format
to another format, that is more appropriate for data mining.
 Some Data Transformation Strategies:-
 1 Smoothing
 Smoothing is a process of removing noise from the data.
 2 Aggregation
 Aggregation is a process where summary or aggregation operations
are applied to the data.
 3 Generalization
 In generalization low-level data are replaced with high-level data by
using concept hierarchies climbing.

Data Mining Process
Mr. Sagar Pandya
 4 Normalization
 Normalization scaled attribute data so as to fall within a small
specified range, such as 0.0 to 1.0.
 5 Attribute Construction
 In Attribute construction, new attributes are constructed from the
given set of attributes.

Data Mining Process
Mr. Sagar Pandya
 Discrete vs Continuous Data
 If you have quantitative data, like a number of workers in a company,
could you divide every one of the workers into 2 parts? The answer
is absolutely NOT. Because the number of workers is discrete data.
 Discrete data is a count that involves integers.
 Only a limited number of values is possible.
 The discrete values cannot be subdivided into parts.
 For example, the number of children in a school is discrete data. You
can count whole individuals. You can’t count 1.5 kids.
 So, discrete data can take only certain values. The data variables
cannot be divided into smaller parts.

Data Mining Process
Mr. Sagar Pandya
 As we mentioned above the two types of quantitative data (numerical
data) are discrete and continuous data.
 Continuous data is considered as the opposite of discrete data.
 Continuous data is information that could be meaningfully divided
into finer levels.
 It can be measured on a scale and can have almost any numeric
value.
 For example, you can measure your height at very precise scales —
meters, centimeters, millimeters and etc.
 You can record continuous data at so many different measurements –
width, temperature, time, and etc.
 This is where the key difference with discrete data lies.

Data Mining Process
Mr. Sagar Pandya
4.) Data discretization:
Data discretization converts a large number of data values into smaller
once, so that data evaluation and data management becomes very easy.
 Data Discretization techniques can be used to divide the range of
continuous attribute into intervals.Numerous continuous attribute
values are replaced by small interval labels.
 This leads to a concise, easy-to-use, knowledge-level representation
of mining results.
 Data discretization example
 we have an attribute of age with the following values. (Before
Discretization)
Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75

Data Mining Process
Mr. Sagar Pandya
Attribute Age Age Age
10,11,13,14,17,
19,
30, 31, 32, 38,
40, 42
70 , 72, 73, 75
After
Discretization
Young Mature Old

Data Mining Process - Discretization and Concept
Hierarchy Generation for Numerical Data
Mr. Sagar Pandya
 Typical methods:
1 Binning
 Binning is a top-down splitting technique based on a specified
number of bins. Binning is an unsupervised discretization technique.
2 Histogram Analysis
 Because histogram analysis does not use class information so it is an
unsupervised discretization technique. Histograms partition the
values for an attribute into disjoint ranges called buckets.
 3 Cluster Analysis
 Cluster analysis is a popular data discretization method.A clustering
algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.

Data Mining Process
Mr. Sagar Pandya
 Top-down discretization
 If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals, then it is called top-down
discretization or splitting.
 Bottom-up discretization
 If the process starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values
to form intervals, then it is called bottom-up discretization or merging.
 Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a concept
hierarchy.

Data Mining Process
Mr. Sagar Pandya
 Concept hierarchies
 Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
 In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization
provides users with the flexibility to view data from different
perspectives.
 Data mining on a reduced data set means fewer input/output
operations and is more efficient than mining on a larger data set.
 Because of these benefits, discretization techniques and concept
hierarchies are typically applied before data mining, rather than
during mining.

Summary
Mr. Sagar Pandya
 Data Mining is all about explaining the past and predicting the future
for analysis.
 Data mining helps to extract information from huge sets of data. It is
the procedure of mining knowledge from data.
 Data mining process includes business understanding, Data
Understanding, Data Preparation, Modelling, Evolution,
Deployment.
 Important Data mining techniques are Classification, clustering,
Regression, Association rules, Outer detection, Sequential Patterns,
and prediction
 R-language and Oracle Data mining are prominent data mining tools.
 Data mining technique helps companies to get knowledge-based
information.

Summary
Mr. Sagar Pandya
• The main drawback of data mining is that many analytics software is
difficult to operate and requires advance training to work on.
• Data mining is used in diverse industries such as Communications,
Insurance, Education, Manufacturing, Banking, Retail, Service
providers, eCommerce, Supermarkets Bioinformatics.

Unit – 2
Any - 5 Assignment Questions Marks:-20
Mr. Sagar Pandya
 Q.1 What is Data Mining? Explain the different stages for Data
Mining Process.
 Q.2 Describe challenges to Data Mining regarding data mining
methodology and user interaction issues.
 Q.3 Describe the various techniques of Data Mining. Write tools for
Data Mining.
 Q.4 What is Data Cleaning? Describe the approaches to fill missing
values and noisy data.
 Q.5 Explain Knowledge Discovery Process.
 Q.6 Define Support and Confidence in Association rule mining.
 Q.7 Explain Data mining Architecture. Write Some Application of
Data mining.

Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya

Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Mining

Similar to Data Mining (20)

More from Medicaps University

More from Medicaps University (14)

Recently uploaded

Recently uploaded (20)

Data Mining

Editor's Notes