SlideShare a Scribd company logo
1 of 165
MEDI-CAPS UNIVERSITY
Faculty of Engineering
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in
Data Mining and Warehousing
Mr. Sagar Pandya
Information Technology Department
sagar.pandya@medicaps.ac.in
Course Code Course Name Hours Per Week Total
Credits
L T P
IT3ED02 Data Mining and Warehousing 3 0 0 3
IT3ED02 Data Mining and Warehousing 3-0-0
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Unit 1. Introduction
 Unit 2. Data Mining
 Unit 3. Association and Classification
 Unit 4. Clustering
 Unit 5. Business Analysis
Reference Books
Text Books
 Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann,
India, 2012.
 Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University Press.
 Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA.
Reference Books
 Sam Anahory and Dennis Murray, Data Warehousing in the Real World,
Pearson Education Asia.
 W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India.
and many others
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Unit-2 Data Mining
 Basics of Data Mining ,
 Data mining techniques,
 KDP (Knowledge Discovery Process),
 Application and Challenges of Data Mining,
 Data Pre-processing: Overview,
 Data cleaning, Data integration, Data reduction, Data transformation and
discretization.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining
 There is a huge amount of data available in the Information Industry.
This data is of no use until it is converted into useful information.
 It is necessary to analyze this huge amount of data and extract useful
information from it.
 Extraction of information is not the only process we need to perform;
data mining also involves other processes such as Data Cleaning,
Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation.
 Data mining is also called as Knowledge discovery, Knowledge
extraction, data/pattern analysis, information harvesting, etc.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining
 Definition:- “The process of extracting previously unknown, valid
and actionable information from large databases and then using the
information to make crucial business decisions.”
 “The Science of extracting useful information from large datasets or
databases.”
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining
 Data mining is looking for hidden, valid, and potentially useful
patterns in huge data sets.
 In other words, we can say that data mining is the procedure of
mining knowledge from data.
 Data Mining is all about discovering unsuspected/ previously
unknown relationships amongst the data.
 It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
 The insights derived via Data Mining can be used for Market
Analysis, Fraud Detection, Customer Retention, Production Control,
Science Exploration etc.
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Types of Data in Data Mining
 Data mining can be performed on following types of data:-
1. Relational databases
2. Data warehouses
3. Advanced DB and information repositories
4. Object-oriented and object-relational databases
5. Transactional and Spatial databases
6. Heterogeneous and legacy databases
7. Multimedia and streaming database
8. Text databases
9. Text mining and Web mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
History of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 In the 1990s, the term "Data Mining" was introduced, but data
mining is the evolution of a sector with an extensive history.
 Early techniques of identifying patterns in data include Bayes
theorem (1700s), and the evolution of regression(1800s).
 The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have
broad in size and complexity level. Explicit hands-on data
investigation has progressively been improved with indirect,
automatic data processing, and other computer science discoveries
such as neural networks, clustering, genetic algorithms (1950s),
decision trees(1960s), and supporting vector machines (1990s).
 Data mining origins are traced back to three family lines: Classical
statistics, Artificial intelligence, and Machine learning.
History of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Evolution of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Applications of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Application Usage
Insurance Data mining helps insurance companies to price their
products profitable and promote new offers to their new or
existing customers.
Education Data mining benefits educators to access student data,
predict achievement levels and find students or groups of
students which need extra attention. For example, students
who are weak in maths subject.
Communications Data mining techniques are used in communication sector to
predict customer behavior to offer highly targeted and
relevant campaigns.
Applications of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Application Usage
Banking Data mining helps finance sector to get a view of market risks
and manage regulatory compliance. It helps banks to identify
probable defaulters to decide whether to issue credit cards,
loans, etc.
Retail Data Mining techniques help retail malls and grocery stores
identify and arrange most sellable items in the most attentive
positions. It helps store owners to comes up with the offer
which encourages customers to increase their spending.
Crime
Investigation
Data Mining helps crime investigation agencies to deploy police
workforce (where is a crime most likely to happen and when?),
who to search at a border crossing etc.
Applications of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Application Usage
Bioinformatics Data Mining helps to mine biological data from massive
datasets gathered in biology and medicine.
Service
Providers
Service providers like mobile phone and utility industries use
Data Mining to predict the reasons when a customer leaves
their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each
customer a probability score and offers incentives.
E-Commerce E-commerce websites use Data Mining to offer cross-sells and
up-sells through their websites. One of the most famous
names is Amazon, who use Data mining techniques to get
more customers into their eCommerce store.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 The data mining tasks can be classified generally into two types
based on what a specific task tries to achieve. Those two categories
are descriptive tasks and predictive tasks.
 The two “High Level” primary goals of data mining are prediction
and description.
 Prediction involves using some variables or fields in the database to
predict unknown or future values of other variables of interest.
 Descriptive tasks focuses on finding human-interpretable patterns
describing the data.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
a) Classification
 Classification derives a model to determine the class of an object
based on its attributes.
 A collection of records will be available, each record with a set of
attributes.
 Classification can be used in direct marketing, that is to reduce
marketing costs by targeting a set of customers who are likely to buy
a new product.
 Using the available data, it is possible to know which customers
purchased similar products and who did not purchase in the past.
Hence, {purchase, don’t purchase} decision forms the class attribute
in this case.
 Once the class attribute is assigned, demographic and lifestyle
information of customers who purchased similar products can be
collected and promotion mails can be sent to them directly.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
b) Prediction
 Prediction task predicts the possible values of missing or future data.
 Prediction involves developing a model based on the available data
and this model is used in predicting future values of a new data set of
interest.
 For example, a model can predict the income of an employee based
on education, experience and other demographic factors like place of
stay, gender etc.
 Also prediction analysis is used in different areas including medical
diagnosis, fraud detection etc.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
c) Time - Series Analysis
 Time series is a sequence of events where the next event is
determined by one or more of the preceding events.
 Time series reflects the process being measured and there are certain
components that affect the behavior of a process.
 Time series analysis includes methods to analyze time-series data in
order to extract useful patterns, trends, rules and statistics.
 Stock market prediction is an important application of time- series
analysis.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
d) Association
 Association discovers the association or connection among a set of
items.
 Association identifies the relationships between objects.
 Association analysis is used for commodity management,
advertising, catalog design, direct marketing etc.
 A retailer can identify the products that normally customers purchase
together or even find the customers who respond to the promotion of
same kind of products.
 If a retailer finds that surf and soap are bought together mostly, he
can put nappies on sale to promote the soap of surf.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
e) Clustering
 Clustering is used to identify data objects that are similar to one
another.
 The similarity can be decided based on a number of factors like
purchase behavior, responsiveness to certain actions, geographical
locations and so on.
 For example, an insurance company can cluster its customers based
on age, residence, income etc.
 This group information will be helpful to understand the customers
better and hence provide better customized services.
Basic Data Mining Task
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
f) Summarization
 Summarization is the generalization of data.
 A set of relevant data is summarized which result in a smaller set that
gives aggregated information of the data.
 For example, the shopping done by a customer can be summarized
into total products, total spending, offers used, etc.
 Such high level summarized information can be useful for sales or
customer relationship team for detailed customer and purchase
behavior analysis.
 Data can be summarized in different abstraction levels and from
different angles.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data Mining refers to the detection and extraction of new patterns
from the already collected data.
 Data mining architecture has many elements like Data Mining
Engine, Pattern evaluation, Data Warehouse, User Interface and
Knowledge Base.
 Each and every component of the data mining technique and
architecture has its own way of performing responsibilities and also
in completing data mining efficiently.
 The different modules are needed to interact correctly so as to
produce a valuable result and complete the complex procedure of
data mining successfully by providing the right set of information to
the business.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1. Data Sources
 A huge variety of present documents such as data warehouse,
database, www or popularly called a World wide web which
becomes the actual data sources.
 Most of the times, it can also be the case that the data is not present
in any of these golden sources but only in the form of text files, plain
files or sequence files or spreadsheets and then the data needs to be
processed in a very similar way as the processing would be done
upon the data received from golden sources.
 Most of the major chunk of data today is received from the internet
or the world wide web as everything which is present on the internet
today is data in some form or another which forms some form of
information repository units.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Before the data is processed ahead the different processes through
which it goes involves data cleansing, integration, and selection
before finally the data is passed onto the database or any of the EDW
(enterprise data warehouse ) server.
 The major challenge which lies at times with this set of data is
different levels of sources and a wide array of data formats which
forms the data components. Therefore the data cannot be directly
used for processing in its naïve state but processed, transformed and
crafted in a much more usable way.
 This way, the reliability and completeness of the data are also
ensured. So, the primary step involves data collection, cleaning and
integration, and post that only the relevant data is passed forward. All
this activity forms a part of a separate set of tools and techniques.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2. Data Warehouse Server or Database
 The database server is the actual space where the data is contained
once it is received from the various number of data sources.
 The server contains the actual set of data which becomes ready to be
processed and therefore the server manages the data retrieval.
 All this activity is based on the request for data mining of the person.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3. Data Mining Engine
 Data Mining Engine is the core component of data mining process
 In the case of data mining, the engine forms the core component and
is the most vital part, or to say the driving force which handles all the
requests and manages them and is used to contain a number of
modules.
 The number of modules present includes mining tasks such as
classification technique, association technique, regression technique,
characterization, prediction and clustering, time series analysis, naive
Bayes, support vector machines, ensemble methods, boosting and
bagging techniques, random forests, decision trees, etc.
 In other words, we can say data mining is the root of our data mining
architecture.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4. Pattern Evaluation Modules
 They are responsible for finding interesting patterns in the data and
sometimes they also interact with the database servers for producing
the result of the user requests.
 All in all, the main purpose of this component is to look out and
search for all the interesting and useable patterns which could make
the data of comparatively better quality.
 Pattern Evaluation is responsible for finding various patterns with the
help of Data Mining Engine.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
5. Graphical User Interface
 The graphical user interface (GUI) module communicates between
the data mining system and the user.
 This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
 This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
6. Knowledge Base
 The knowledge base is helpful in the entire process of data mining.
 It might be helpful to guide the search or evaluate the stake of the
result patterns.
 The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
 The data mining engine may receive inputs from the knowledge base
to make the result more accurate and reliable.
 The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.
Types of Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1. No Coupling:
 The no coupling data mining architecture retrieves data from
particular data sources.
 It does not use the database for retrieving the data which is otherwise
quite an efficient and accurate way to do the same.
 The no coupling architecture for data mining is poor and only used
for performing very simple data mining processes.
2. Loose Coupling:
 In loose coupling architecture data mining system retrieves data from
the database and stores the data in those systems.
 This mining is for memory-based data mining architecture.
Types of Data Mining Architecture
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3. Semi Tight Coupling:
 It tends to use various advantageous features of the data warehouse
systems.
 It includes sorting, indexing, aggregation.
 In this architecture, an intermediate result can be stored in the
database for better performance.
4. Tight coupling:
 In this architecture, a data warehouse is considered as one of it’s
most important components whose features are employed for
performing data mining tasks.
 This architecture provides scalability, performance, and integrated
information
Advantages of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Assists in preventing future adversaries by accurately predicting
future trends.
 Contributes to the making of important decisions.
 Compresses data into valuable information.
 Provides new trends and unexpected patterns.
 Helps to analyze huge data sets.
 Aids companies to find, attract and retain customers.
 Helps the company to improve its relationship with the customers.
 Assist Companies to optimize their production according to the
likability of a certain product thus saving cost to the company.
Disadvantages of Data Mining
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Excessive work intensity requires high-performance teams and staff
training.
 The requirement of large investments can also be considered as a
problem as sometimes data collection consumes many resources that
suppose a high cost.
 Lack of security could also put the data at huge risk, as the data may
contain private customer details.
 Inaccurate data may lead to the wrong output.
 Huge databases are quite difficult to manage.
DIFFERENT TYPES OF KNOWLEDGE
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Knowledge is a collection of interesting and useful pattern in a
database. The key issue in Knowledge Discovery in Database is to
realize that there is more information hidden in your data than you
are table to distinguish at first sight. In data mining we distinguish
four different types of knowledge.
 Shallow Knowledge This is information that can be easily retrieved
from database using a query tool such as Structured Query Language
(SQL).
 Multi-Dimensional Knowledge OLAP tools you have the ability to
rapidly explore all sorts of clustering this is information that can be
analyzed using online analytical processing tools. With and different
orderings of the data but it is important to realize that most of the
things you can do with an OLAP tool can also be done using SQL.
DIFFERENT TYPES OF KNOWLEDGE
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 The advantage of OLAP tools is that they are optimized for the kind
of search and analysis operation.
 However, OLAP is not as powerful as data mining; it cannot search
for optimal solutions.
 Hidden Knowledge This is data that can be found relative easily by
using pattern recognition or machine learning algorithms. Again, one
could use SQL to find these patterns but this would probably prove
extremely time-consuming.
 A pattern recognition algorithm could find regularities in a database
in minutes or at most a couple of hours, whereas you would have to
spend months using SQL to achieve the same result. Here
information that can be obtained through data mining techniques.
DIFFERENT TYPES OF KNOWLEDGE
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Deep Knowledge This is information that is stored in the database
but can only be located if we have a clue that tells us where to look.
 Different Types of Knowledge and Techniques:
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data mining is the core part of the knowledge discovery process.
 KDP is a process of finding knowledge in data, it does this by using data mining
methods (algorithms) in order to extract demanding knowledge from large
amount of data.
 Data Mining also known as Knowledge Discovery in Databases.
 Here is the list of steps involved in the knowledge discovery process:
1.) Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection. Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
 Parser decides weather the given string of data is acceptable within data
specification.
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2.) Data Integration: Data integration is defined as heterogeneous data
from multiple sources combined in a common
source(DataWarehouse).Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3.) Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Neural network.
 Data selection using Clustering, Regression, etc.
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4.) Data Transformation:
 Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
 Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination
to capture transformations.
 Code generation: Creation of the actual transformation program.
5.) Data Mining:
Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
6.) Pattern Evaluation: Pattern Evaluation is defined as as identifying
strictly increasing patterns representing knowledge based on given
measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data
understandable by user.
7.) Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining
results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification
rules, characterization rules, etc.
Knowledge Discovery Process (KDP)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Nowadays Data Mining and knowledge discovery are evolving a
crucial technology for business and researchers in many domains.
 Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place.
 It needs to be integrated from various heterogeneous data sources.
These factors also create some issues.
 Here, we will discuss the major issues regarding −
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues
Data Mining Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Mining Methodology and User Interaction Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Mining different kinds of knowledge in databases −
 Different users may be interested in different kinds of knowledge.
Therefore it is necessary for data mining to cover a broad range of
knowledge discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction −
 The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data
mining requests based on the returned results.
Mining Methodology and User Interaction Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Incorporation of background knowledge −
 To guide discovery process and to express the discovered patterns,
the background knowledge can be used.
 Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of
abstraction.
 Data mining query languages and ad hoc data mining −
 Data Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
Mining Methodology and User Interaction Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Presentation and visualization of data mining results −
 Once the patterns are discovered it needs to be expressed in high
level languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data −
 The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
 Pattern evaluation −
 The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms.
 These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without
mining the data again from scratch.
Diverse Data Types Issues
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Handling of relational and complex types of data −
 The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems −
 The data is available at different data sources on LAN or WAN.
These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data
mining.
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Although data mining is very powerful, it faces many challenges
during its execution. Various challenges could be related to
performance, data, methods, and techniques, etc. The process of data
mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
 Incomplete and noisy data
 Data Distribution
 Complex Data
 Performance
 Data Privacy and Security
 Data Visualization
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1.) Incomplete and noisy data:
 The process of extracting useful data from large volumes of data is
data mining.
 The data in the real-world is heterogeneous, incomplete, and noisy.
 Data in huge quantities will usually be inaccurate or unreliable.
These problems may occur due to data measuring instrument or
because of human errors.
 Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the
information into their system.
 The person may make a digit mistake when entering the phone
number, which results in incorrect data.
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data.
 The data could get changed due to human or system error.
 All these consequences (noisy and incomplete data)makes data
mining challenging.
2.) Data Distribution:
 Real-worlds data is usually stored on various platforms in a
distributed computing environment.
 It might be in a database, individual systems, or even on the internet.
 Practically, It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical
concerns.
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 For example, various regional offices may have their servers to store
their data. It is not feasible to store, all the data from all the offices
on a central server. Therefore, data mining requires the development
of tools and algorithms that allow the mining of distributed data.
3.) Complex Data:
 Real-world data is heterogeneous, and it could be multimedia data,
including audio and video, images, complex data, spatial data, time
series, and so on.
 Managing these various types of data and extracting useful
information is a tough task.
 Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4.) Performance:
 The data mining system's performance relies primarily on the
efficiency of algorithms and techniques used.
 If the designed algorithm and techniques are not up to the mark, then
the efficiency of the data mining process will be affected adversely.
5.) Data Privacy and Security:
 Data mining usually leads to serious problems in terms of data
security, governance, and privacy.
 For example, if a retailer analyzes the details of the purchased items,
then it reveals data about buying habits and preferences of the
customers without their permission.
Data Mining Challenges
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
6.) Data Visualization:
 In data mining, data visualization is a very important process because
it is the primary method that shows the output to the user in a
presentable way.
 The extracted data should convey the exact meaning of what it
intends to express.
 But many times, representing the information to the end-user in a
precise and easy way is difficult.
 The input data and the output information being complicated, very
efficient, and successful data visualization processes need to be
implemented to make it successful.
Data Mining vs Data Warehousing
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
S.no. Data Mining Data Warehousing
1 Data mining is a method of
comparing large amounts of
data to finding right patterns.
Data warehousing is a method
of centralizing data from
different sources into one
common repository.
2 Data mining is the process of
determining data patterns.
A data warehouse is a database
system designed for analytics.
3 In data mining, data is
analyzed repeatedly.
In data warehousing, data is
stored periodically.
4 Data mining uses pattern
recognition techniques to
identify patterns.
Data warehousing is the process
of extracting and storing data
that allow easier reporting.
Data Mining vs Data Warehousing
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
S.no. Data Mining Data Warehousing
5 Data mining helps to create suggestive
patterns of important factors. Like the
buying habits of customers, products.
Data Warehouse adds an extra value to
operational business systems like CRM
systems when the warehouse is integrated.
6 After successful initial queries, users
may ask more complicated queries
which would increase the workload.
Data Warehouse is complicated to
implement and maintain.
7 The Data mining techniques are never
100% accurate and may cause serious
consequences in certain conditions.
In the data warehouse, there is great
chance that the data which was required
for analysis by the organization may not
be integrated into the warehouse. It can
easily lead to loss of information.
Alternative names for Data Mining :
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Alternative names for Data Mining :
1. Knowledge discovery (mining) in databases (KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archeology
5. Data dredging
6. Information harvesting
7. Business intelligence
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Many different sectors are taking advantage of data mining to boost
their business efficiency, including manufacturing, chemical,
marketing, aerospace, etc.
 Therefore, the need for a conventional data mining process improved
effectively.
 Data mining is described as a process of finding hidden precious data
by evaluating the huge quantity of information stored in data
warehouses, using multiple data mining techniques such as Artificial
Intelligence (AI), Machine learning and statistics.
The Cross-Industry Standard Process for Data
Mining (CRISP-DM)
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Cross-industry Standard Process of Data Mining (CRISP-DM)
comprises of six phases designed as a cyclical method as the given
figure:
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1. Business understanding:
 It focuses on understanding the project goals and requirements form
a business point of view, then converting this information into a data
mining problem afterward a preliminary plan designed to accomplish
the target.
 Tasks:
• Determine business objectives
• Access situation
• Determine data mining goals
• Produce a project plan
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 First, you need to understand business and client objectives.
 You need to define what your client wants (which many times even
they do not know themselves)
 Take stock of the current data mining scenario.
 Factor in resources, assumption, constraints, and other significant
factors into your assessment.
 Using business objectives and current scenario, define your data
mining goals.
 A good data mining plan is very detailed and should be developed to
accomplish both business and data mining goals.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Determine business objectives:
• It Understands the project targets and prerequisites from a business
point of view.
• Carefully understand what the customer wants to achieve.
• Reveal significant factors, at the starting, it can impact the result of
the project.
 Access situation:
• It requires a more detailed analysis of facts about all the resources,
constraints, assumptions, and others that ought to be considered.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Determine data mining goals:
• A business goal states the target of the business terminology. For example,
increase catalog sales to the existing customer.
• A data mining goal describes the project objectives. For example, It
assumes how many objects a customer will buy, given their demographics
details (Age, Salary, and City) and the price of the item over the past three
years.
 Produce a project plan:
• It states the targeted plan to accomplish the business and data mining plan.
• The project plan should define the expected set of steps to be performed
during the rest of the project, including the latest technique and better
selection of tools.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2. Data Understanding:
 Data understanding starts with an original data collection and
proceeds with operations to get familiar with the data, to data quality
issues, to find better insight in data, or to detect interesting subsets
for concealed information hypothesis.
 Tasks:
• Collects initial data
• Describe data
• Explore data
• Verify data quality
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 First, data is collected from multiple data sources available in the
organization.
 These data sources may include multiple databases, flat filer or data
cubes.
 There are issues like object matching and schema integration which
can arise during Data Integration process.
 It is a quite complex and tricky process as data from various sources
unlikely to match easily.
 For example, table A contains an entity named cust_no whereas
another table B contains an entity named cust-id.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Therefore, it is quite difficult to ensure that both of these given
objects refer to the same value or not.
 Here, Metadata should be used to reduce errors in the data
integration process.
 Next, the step is to search for properties of acquired data.
 A good way to explore the data is to answer the data mining
questions (decided in business phase) using the query, reporting, and
visualization tools.
 Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Collect initial data:
• It acquires the information mentioned in the project resources.
• It includes data loading if needed for data understanding.
• It may lead to original data preparation steps.
• If various information sources are acquired then integration is an
extra issue, either here or at the subsequent stage of data preparation.
 Describe data:
• It examines the "gross" or "surface" characteristics of the information
obtained.
• It reports on the outcomes.
 Verify data quality:
• It examines the data quality and addressing questions.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Explore data:
• Addressing data mining issues that can be resolved by querying,
visualizing, and reporting, including:
• Distribution of important characteristics, results of simple
aggregation.
• Establish the relationship between the small number of attributes.
• Characteristics of important sub-populations, simple statical analysis.
• It may refine the data mining objectives.
• It may contribute or refine the information description, and quality
reports.
• It may feed into the transformation and other necessary information
preparation.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3. Data Preparation:
• It usually takes more than 90 percent of the time.
• It covers all operations to build the final data set from the original
raw information.
• Data preparation is probable to be done several times and not in any
prescribed order.
 Tasks:
• Select data
• Clean data
• Construct data
• Integrate data
• Format data
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data transformation operations would contribute toward the success
of the mining process.
 Smoothing: It helps to remove noise from the data.
 Aggregation: Summary or aggregation operations are applied to the
data. I.e., the weekly sales data is aggregated to calculate the monthly
and yearly total.
 Generalization: In this step, Low-level data is replaced by higher-
level concepts with the help of concept hierarchies. For example, the
city is replaced by the county.
 Normalization: Normalization performed when the attribute data are
scaled up o scaled down. Example: Data should fall in the range -2.0
to 2.0 post-normalization.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Select data:
• It decides which information to be used for evaluation.
• In the data selection criteria include significance to data mining
objectives, quality and technical limitations such as data volume
boundaries or data types.
• It covers the selection of characteristics and the choice of the
document in the table.
 Clean data:
• It may involve the selection of clean subsets of data, inserting
appropriate defaults or more ambitious methods, such as estimating
missing information by modeling.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Construct data:
• It comprises of Constructive information preparation, such as
generating derived characteristics, complete new documents, or
transformed values of current characteristics.
 Integrate data:
• Integrate data refers to the methods whereby data is combined from
various tables, or documents to create new documents or values.
 Format data:
• Formatting data refer mainly to linguistic changes produced to
information that does not alter their significance but may require a
modeling tool.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4. Modeling:
 In modeling, various modeling methods are selected and applied, and
their parameters are measured to optimum values.
 Some methods gave particular requirements on the form of data.
 Therefore, stepping back to the data preparation phase is necessary.
 Tasks:
• Select modeling technique
• Generate test design
• Build model
• Access model
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Select modeling technique:
• It selects the real modeling method that is to be used. For example,
decision tree, neural network.
• If various methods are applied, then it performs this task individually
for each method.
 Generate test Design:
• Generate a procedure or mechanism for testing the validity and
quality of the model before constructing a model.
• For example, in classification, error rates are commonly used as
quality measures for data mining models. Therefore, typically
separate the data set into train and test set, build the model on the
train set and assess its quality on the separate test set.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Build model:
• To create one or more models, we need to run the modeling tool on
the prepared data set.
 Assess model:
• It interprets the models according to its domain expertise, the data
mining success criteria, and the required design.
• It assesses the success of the application of modeling and discovers
methods more technically.
• It Contacts business analytics and domain specialists later to discuss
the outcomes of data mining in the business context.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
5. Evaluation:
• At the last of this phase, a decision on the use of the data mining
results should be reached. It evaluates the model efficiently, and
review the steps executed to build the model and to ensure that the
business objectives are properly achieved. The main objective of the
evaluation is to determine some significant business issue that has
not been regarded adequately. At the last of this phase, a decision on
the use of the data mining outcomes should be reached.
 Tasks:
• Evaluate results
• Review process
• Determine next steps
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Evaluate results:
• It assesses the degree to which the model meets the organization's
business objectives.
• It tests the model on test apps in the actual implementation when
time and budget limitations permit and also assesses other data
mining results produced.
• It unveils additional difficulties, suggestions, or information for
future instructions.
 Review process:
• The review process does a more detailed evaluation of the data
mining engagement to determine when there is a significant factor or
task that has been somehow ignored. It reviews quality assurance
problems.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Determine next steps:
• It decides how to proceed at this stage.
• It decides whether to complete the project and move on to
deployment when necessary or whether to initiate further iterations
or set up new data-mining initiatives.it includes resources analysis
and budget that influence the decisions.
• A go or no-go decision is taken to move the model in the
deployment phase.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
6. Deployment:
 Determine:
• Deployment refers to how the outcomes need to be utilized.
 Deploy data mining results by:
• It includes scoring a database, utilizing results as company
guidelines, interactive internet scoring.
• The information acquired will need to be organized and presented in
a way that can be used by the client. However, the deployment phase
can be as easy as producing. However, depending on the demands,
the deployment phase may be as simple as generating a report or as
complicated as applying a repeatable data mining method across the
organizations.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization's business policy.
 Tasks:
• Plan deployment
• Plan monitoring and maintenance
• Produce final report
• Review project
 Plan deployment:
• To deploy the data mining outcomes into the business, takes the
assessment results and concludes a strategy for deployment.
• It refers to documentation of the process for later deployment.
Data Mining Implementation Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Plan monitoring and maintenance:
• It is important when the data mining results become part of the day-to-day
business and its environment.
• It helps to avoid unnecessarily long periods of misuse of data mining
results. It needs a detailed analysis of the monitoring process.
 Produce final report:
• A final report can be drawn up by the project leader and his team.
• It may only be a summary of the project and its experience.
• It may be a final and comprehensive presentation of data mining.
 Review project:
• Review projects evaluate what went right and what went wrong, what was
done wrong, and what needs to be improved.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 One of the most important tasks in Data Mining is to select the
correct data mining technique.
 Data Mining technique has to be chosen based on the type of
business and the type of problem your business faces.
 A generalized approach has to be used to improve the accuracy and
cost-effectiveness of using data mining techniques.
 There are basically seven main Data Mining techniques which are
discussed.
 There are also a lot of other Data Mining techniques but these seven
are considered more frequently used by business people.
• Statistics, Clustering, Visualization, Decision Tree, Association
Rules, Neural Networks, Classification.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1. Classification:
 This technique is used to obtain important and relevant information
about data and metadata.
 This data mining technique helps to classify data in different classes.
 Data mining techniques classification is the most commonly used
data mining technique which contains a set of pre-classified samples
to create a model which can classify the large set of data.
 There are two main processes involved in this technique
• Learning – In this process the data are analyzed by the classification
algorithm.
• Classification – In this process, the data is used to measure the
precision of the classification rules.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data mining techniques can be classified by different criteria, as
follows:
i. Classification of Data mining frameworks as per the type of data
sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example.
Object-oriented database, transactional database, relational database,
and so on..
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or
data mining functionalities.
For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few
data mining functionalities together..
iv. Classification of data mining frameworks according to data
mining techniques used:
This classification is as per the data analysis approach utilized, such as
neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database-oriented, etc.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are different types of classification models. They are as
follows
• Classification by decision tree induction
• Bayesian Classification
• Neural Networks
• Support Vector Machines (SVM)
• Classification Based on Associations
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
• Learning step (training phase): In this, a classification algorithm
builds the classifier by analyzing a training set.
• Classification step: Test data are used to estimate the accuracy or
precision of the classification rules.
 For example, a banking company uses to identify loan applicants at
low, medium or high credit risks. Similarly, a medical researcher
analyzes cancer data to predict which medicine to prescribe to the
patient.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to
the new data tuples if the accuracy is considered acceptable.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2. Clustering Technique
 Clustering is one of the oldest techniques used in Data Mining.
 Clustering analysis is the process of identifying data that are similar
to each other.
 This will help to understand the differences and similarities between
the data.
 This is sometimes called segmentation and helps the users to
understand what is going on within the database.
 For example, an insurance company can group its customers based
on their income, age, nature of policy and type of claims.
 Clustering is very similar to the classification, but it involves
grouping chunks of data together based on their similarities.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are different types of clustering methods. They are as follows
• Partitioning Methods
• Hierarchical Agglomerative methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Methods
 The most popular clustering algorithm is the Nearest Neighbour. In
business, the Nearest Neighbour technique is most often used in the
process of Text Retrieval.
 They are used to find the documents that share the important
characteristics with that main document that have been marked as
interesting.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 A similar example of loan applicants can be considered here also. There
are some differences that are depicted in the figure below.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3. Regression:
 Regression analysis is the data mining method of identifying and
analyzing the relationship between variables.
 It is used to identify the likelihood of a specific variable, given the
presence of other variables.
 Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of
the other factor.
 It is used to define the probability of the specific variable.
 Regression, primarily a form of planning and modeling.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 For example, we might use it to project certain costs, depending on
other factors such as availability, consumer demand, and
competition.
 Primarily it gives the exact relationship between two or more
variables in the given data set.
 A good example of regression analysis is the use of this data mining
technique in matching people on dating portals.
 Many websites use variables to match people according to their likes,
interest, and hobbies.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4. Association Rule Technique
 This technique helps to find the association between two or more
items.
 It helps to know the relations between the different variables in
databases.
 It discovers the hidden patterns in the data sets which is used to
identify the variables and the frequent occurrence of different
variables that appear with the highest frequencies.
 There are three types of association rule. They are
1. Multilevel Association Rule
2. Multidimensional Association Rule
3. Quantitative Association Rule
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 This technique is most often used in the retail industry to find
patterns in sales. This will help increase the conversion rate and thus
increases profit.
 Association rules are if-then statements that support to show the
probability of interactions between data items within large data sets
in different types of databases.
 Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.
 The way the algorithm works is that you have various data, For
example, a list of grocery items that you have been buying for the
last six months.
 It calculates a percentage of items being purchased together.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 These are three major measurements technique:
• Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
• Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
• Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Suppose, the marketing manager of a supermarket wants to
determine which products are frequently purchased together.
 As an example,
 Buys (x,”beer”) -> buys(x, “chips”) [support = 1%, confidence =
50%]
• Here x represents a customer buying beer and chips together.
• Confidence shows certainty that if a customer buys a beer, there is a
50% chance that he/she will buy the chips also.
• Support means that 1% of all the transactions under analysis showed
that beer and chips were bought together.
• Many similar examples like bread and butter or computer and
software can be considered.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are two types of Association Rules:
• Single dimensional association rule: These rules contain a single
attribute that is repeated.
• Multidimensional association rule: These rules contain multiple
attributes that are repeated.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
5. Outer detection:
 This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior.
 This technique may be used in various domains like intrusion,
detection, fraud detection, etc.
 It is also known as Outlier Analysis or Outlier mining. The outlier is
a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier. Outlier detection
plays a significant role in the data mining field.
 Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection,
detecting outlying in wireless sensor network data, etc.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 For example, let’s assume the graph below is plotted using some data
sets in our database.
 So the best fit line is drawn. The points lying nearby the line show
expected behavior while the point far from the line is an Outlier.
 This would help to detect the anomalies and take possible actions
accordingly.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
6. Sequential Patterns:
 This data mining technique helps to discover or identify similar
patterns or trends in transaction data for certain period.
 The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns.
 It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms
of different criteria like length, occurrence frequency, etc.
 In other words, this technique of data mining helps to discover or
recognize similar patterns in transaction data over some time.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 This method is used to identify patterns that occur frequently over a
certain period of time.
 For example, the sales manager of clothing company sees that sales
of jackets seem to increase just before the winter season, or sales in
bakery increases during Christmas or New Year’s eve.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
7. Prediction:
 Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc.
 It analyzes past events or instances in the right sequence to predict a
future event.
 Prediction is one of the most valuable data mining techniques, since
it’s used to project the types of data you’ll see in the future.
 In many cases, just recognizing and understanding historical trends is
enough to chart a somewhat accurate prediction of what will happen
in the future.
 For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 For example, if the sales manager of a supermarket would like to
predict the amount of revenue that each item would generate based
on past sales data. It models a continuous valued function that
predicts missing numeric data values.
 Regression Analysis is the best choice to perform prediction. It can
be used to set a relationship between independent variables and
dependent variables.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Sequence Prediction:
 Before defining the problem of sequence prediction, it is necessary to
first explain what is a sequence. A sequence is an ordered list of
symbols. For example, here are some common types of sequences:
• A sequence of webpages visited by a user, ordered by the time of
access.
• A sequence of words or characters typed on a cellphone by a user, or
in a text such as a book.
• A sequence of products bought by a customer in a retail store
• A sequence of proteins in bioinformatics
• A sequence of symptoms observed on a patient at a hospital
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 The task of sequence prediction consists of predicting the next symbol of a
sequence based on the previously observed symbols. For example, if a user has
visited some webpages A, B, C, in that order, one may want to predict what is
the next webpage that will be visited by that user to prefetch the webpage.
 First, one must train a sequence prediction model using some previously
seen sequences called the training sequences. This process is illustrated below:
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Some Other Data Mining Techniques:-
 Statistical Techniques:
 Data mining techniques statistics is a branch of mathematics which
relates to the collection and description of data.
 The statistical technique is not considered as a data mining technique
by many analysts.
 But still, it helps to discover the patterns and build predictive models.
 For this reason, data analyst should possess some knowledge about
the different statistical techniques.
 In today’s world, people have to deal with a large amount of data and
derive important patterns from it.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Statistics can help you to a greater extent to get answers for questions
about their data like
• What are the patterns in their database?
• What is the probability of an event to occur?
• Which patterns are more useful to the business?
• What is the high-level summary that can give you a detailed view of
what is there in the database?
 Statistics not only answer these questions they help in summarizing
the data and count it.
 It also helps in providing information about the data with ease.
Through statistical reports, people can make smart decisions.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are different forms of statistics but the most important and
useful technique is the collection and counting of data. There are a
lot of ways to collect data like
• Histogram
• Mean
• Median
• Mode
• Variance
• Max
• Min
• Linear Regression
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Decision Trees
 A decision tree is a tree structure (as its name suggests), where
• Each internal node represents a test on the attribute.
• Branch denotes the result of the test.
• Terminal nodes hold the class label.
• The topmost node is the root node which has a simple question that
has two or more answers. Accordingly, the tree grows and a flow
chart like structure is generated.
Data Mining Techniques
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 In this decision, tree government classifies citizens below age 18 or
above age 18. This would help them to decide whether a license must
be issued to a particular city or not.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 In today’s world, a large amount of data is generated within seconds.
To handle this data, we should have some knowledge of different
techniques and tools.
 Data mining tools are nothing but a set of methodologies that are
used for analyzing this large amount of data and the relationship
between different data.
 Data Mining tools have the objective of discovering
patterns/trends/groupings among large sets of data and transforming
data into more refined information.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
1. Orange Data Mining Tool:
 It is open-source software written in python language.
 Orange is the best software for analyzing data and machine learning.
These components are called widgets.
 These widgets are used for reading data, analyzing components,
allows users to select the features and helps to show the data.
 With orange, data formatting and moving them with the help of
widgets becomes fast and easy.
 Besides, Orange provides a more interactive and enjoyable
atmosphere to dull analytical tools. It is quite exciting to operate.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Widgets deliver significant functionalities such as:
• Displaying data table and allowing to select features
• Data reading
• Training predictors and comparison of learning algorithms
• Data element visualization, etc.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2. SAS Data Mining:
 SAS stands for Statistical Analysis System.
 It is a product of the SAS Institute created for analytics and data
management.
 SAS can mine data, change it, manage information from various
sources, and analyze statistics.
 It offers a graphical UI for non-technical users.
 SAS data miner allows users to analyze big data and provide
accurate insight for timely decision-making purposes.
 SAS has distributed memory processing architecture that is highly
scalable.
 It is suitable for data mining, optimization, and text mining purposes.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3. DataMelt Data Mining:
 DataMelt is a computation and visualization environment which
offers an interactive structure for data analysis and visualization.
 It is primarily designed for students, engineers, and scientists.
 It is also known as DMelt.
 DMelt is a multi-platform utility written in JAVA.
 It can run on any operating system which is compatible with JVM
(Java Virtual Machine).
 It consists of Science and mathematics libraries.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
 Mathematical libraries:
Mathematical libraries are used for random number generation,
algorithms, curve fitting, etc.
 DMelt can be used for the analysis of the large volume of data, data
mining, and statistical analysis.
 It is extensively used in natural sciences, financial markets, and
engineering.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4. Rattle:
 Ratte is a data mining tool based on GUI.
 It uses the R stats programming language.
 Rattle exposes the statically power of R by offering significant data
mining features.
 While rattle has a comprehensive and well-developed user interface,
It has an integrated log code tab that produces duplicate code for any
GUI operation.
 The data set produced by Rattle can be viewed and edited.
 Rattle gives the other facility to review the code, use it for many
purposes, and extend the code without any restriction.
Data Mining Tools
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
5. Rapid Miner:
 It is written in JAVA programming language.
 Rapid Miner is one of the most popular predictive analysis systems
created by the company with the same name as the Rapid Miner.
 It offers an integrated environment for text mining, deep learning,
machine learning, and predictive analysis.
 The instrument can be used for a wide range of applications,
including company applications, commercial applications, research,
education, training, application development, machine learning.
 Rapid Miner provides the server on-site as well as in public or
private cloud infrastructure.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data Mining refers to extracting or mining knowledge from large
amounts of data.
 It is also defined as extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or knowledge
from a huge amount of data.
 Data mining is a rapidly growing field that is concerned with
developing techniques to assist managers and decision-makers to
make intelligent use of a huge amount of repositories.
 It is computational process of discovering patterns in large data sets
involving methods at intersection of artificial intelligence, machine
learning, statistics, and database systems.
 The goal of data mining process is to extract information from a data
set and transform it into an understandable structure for further use.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Major tasks of data pre-processing:
 Data Cleaning
 Data cleaning is a process to clean the data in such a way that data
can be easily integrated.
 Data Integration
 Data integration is a process to integrate/combine all the data.
 Data Reduction
 Data reduction is a process to reduce the large data into smaller
once in such a way that data can be easily transformed further.
 Data Transformation
 Data transformation is a process to transform the data into a reliable
shape.

Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data Discretization
 Data discretization converts a large number of data values into
smaller once, so that data evaluation and data management becomes
very easy.
 After the completion of these tasks, the data is ready for mining.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 1.) Data Cleansing:-
 Data cleansing or data cleaning is the process of identifying and
removing (or correcting) inaccurate records from a dataset, table, or
database and refers to recognizing unfinished, unreliable, inaccurate
or non-relevant parts of the data and then restoring, remodeling, or
removing the dirty or crude data.
 To perform the data analytics properly we need various data cleaning
techniques so that our data is ready for analysis.
 Data cleaning techniques may be performed as batch processing
through scripting or interactively with data cleansing tools.
 After cleaning, a dataset should be uniform with other related
datasets in the operation.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data cleaning techniques are not only an essential part of the data
science process – it’s also the most time-consuming part.
 With the rise of big data, data cleaning methods has become more
important than ever before. Every industry – banking, healthcare,
retail, hospitality, education – is now navigating in a large ocean of
data.
 “Data scientists spend 80% of their time cleaning and manipulating
data and only 20% of their time actually analyzing it.”
 Data cleaning is a process to clean the dirty data. Data is mostly not
clean. It means that most data can be incorrect due to a large number
of reasons like due to hardware error/failure, network error or human
error. So it is compulsory to clean the data before mining.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Sources of Missing Values
1. There are many sources of missing data. Let’s see some major
sources of missing data.
2. User forgot to fill the data in a field.
3. It can be a programming error.
4. Data can be lost when we transferring the data manually from a
legacy database.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data Cleaning Techniques-Get Rid of Extra Spaces
 Here we have the text Welcome To Medicaps University written in
four different ways.
 welcome to Medicaps University
 welcome to Medicaps University
 welcome to Medicaps University
 welcome to Medicaps University
 First one is the regular way with only one space between words, in
the second case we have more than one space between words, in a
third case we have some leading spaces along with a couple of
spaces between words and in the fourth case we have trailing spaces,
you can see there are a couple of space after the last word.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Fixing Structural errors
 The errors that arise during measurement, transfer of data or other
similar situations are called structural errors.
 Structural errors include typos in the name of features, same attribute
with different name, mislabeled classes, i.e. separate classes that
should really be the same or inconsistent capitalization.
 For example, the model will treat America and america as different
classes or values, though they represent the same value or red,
yellow and red-yellow as different classes or attributes, though one
class can be included in other two classes. So, these are some
structural errors that make our model inefficient and gives poor
quality results.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Some techniques of Data Cleaning Process
1. Parsing
2. Correcting
3. Standardizing
4. Matching
5. Consolidation
6. Dealing with missing data
7. Dealing with incorrect and noisy data
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Some data cleaning methods :-
 1 You can ignore the tuple. This is done when class label is missing.
This method is not very effective , unless the tuple contains several
attributes with missing values.
 2 You can fill in the missing value manually. This approach is
effective on small data set with some missing values.
 3 You can replace all missing attribute values with global constant,
such as a label like “Unknown” or minus infinity.
 4 You can use the attribute mean to fill in the missing value.For
example customer average income is 25000 then you can use this
value to replace missing value for income.
 5 Use the most probable value to fill in the missing value.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Noisy Data
 Noise is a random error or variance in a measured variable. Noisy
Data may be due to faulty data collection instruments, data entry
problems and technology limitation.
 How to Handle Noisy Data?
 Binning:
 Binning methods sorted data value by consulting its “neighbor-
hood,” that is, the values around it.The sorted values are distributed
into a number of “buckets,” or bins.
 For example
 Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Partition into (equal-frequency) bins:
 Bin a: 4, 8, 15
 Bin b: 21, 21, 24
 Bin c: 25, 28, 34
 In this example, the data for price are first sorted and then partitioned
into equal-frequency bins of size 3.
 Smoothing by bin means:
 Bin a: 9, 9, 9
 Bin b: 22, 22, 22
 Bin c: 29, 29, 29
 In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Smoothing by bin boundaries:
 Bin a: 4, 4, 15
 Bin b: 21, 21, 24
 Bin c: 25, 25, 34
 In smoothing by bin boundaries, each bin value is replaced by the
closest boundary value.
 Regression
 Data can be smoothed by fitting the data into a regression functions.
 Clustering:
 Outliers may be detected by clustering,where similar values are
organized into groups, or “clusters.Values that fall outside of the set
of clusters may be considered outliers.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
2.) Data Integration In Data Mining
 Here comes a second step in the data mining process. From various
zones, your data is incorporated into a single zone.
 Data in your computer system is stored in different formats under
different locations. These are your saved spreadsheets, text files,
images, documents, etc.
 Data integration can give a real tough time if you are previously
messed up with your organization. Data integration sets free data
from repetition without affecting the reliability of the data.
 Data Integration is a data preprocessing technique that combines data
from multiple sources and provides users a unified view of these
data.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 There are mainly 2 major approaches for data integration:-
1 Tight Coupling
 In tight coupling data is combined from different sources into a
single physical location through the process of ETL - Extraction,
Transformation and Loading. Here, a data warehouse is treated as an
information retrieval component.
2 Loose Coupling
 In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and
transforms it in a way the source database can understand and then
sends the query directly to the source databases to obtain the result.
And the data only remains in the actual source databases.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Issues in Data Integration:
There are no of issues to consider during data integration: Schema
Integration, Redundancy, Detection and resolution of data value
conflicts. These are explained in brief as following below.
 1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to
as the entity identification problem.
 For example, How can the data analyst and computer be sure that
customer id in one data base and customer number in another
reference to the same attribute.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from
another attribute or set of attribute.
• Inconsistencies in attribute can also cause redundanciesin the
resulting data set.
• Some redundancies can be detected by correlation analysis.
 3. Detection and resolution of data value conflicts:
• This is the third important issues in data integration.
• Attribute values from another different sources may differ for the
same real world entity.
• An attribute in one system may be recorded at a lower level
abstraction then the “same” attribute in another.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
3.) Data Transformation In Data Mining
 In data transformation process data are transformed from one format
to another format, that is more appropriate for data mining.
 Some Data Transformation Strategies:-
 1 Smoothing
 Smoothing is a process of removing noise from the data.
 2 Aggregation
 Aggregation is a process where summary or aggregation operations
are applied to the data.
 3 Generalization
 In generalization low-level data are replaced with high-level data by
using concept hierarchies climbing.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 4 Normalization
 Normalization scaled attribute data so as to fall within a small
specified range, such as 0.0 to 1.0.
 5 Attribute Construction
 In Attribute construction, new attributes are constructed from the
given set of attributes.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Discrete vs Continuous Data
 If you have quantitative data, like a number of workers in a company,
could you divide every one of the workers into 2 parts? The answer
is absolutely NOT. Because the number of workers is discrete data.
 Discrete data is a count that involves integers.
 Only a limited number of values is possible.
 The discrete values cannot be subdivided into parts.
 For example, the number of children in a school is discrete data. You
can count whole individuals. You can’t count 1.5 kids.
 So, discrete data can take only certain values. The data variables
cannot be divided into smaller parts.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 As we mentioned above the two types of quantitative data (numerical
data) are discrete and continuous data.
 Continuous data is considered as the opposite of discrete data.
 Continuous data is information that could be meaningfully divided
into finer levels.
 It can be measured on a scale and can have almost any numeric
value.
 For example, you can measure your height at very precise scales —
meters, centimeters, millimeters and etc.
 You can record continuous data at so many different measurements –
width, temperature, time, and etc.
 This is where the key difference with discrete data lies.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
4.) Data discretization:
Data discretization converts a large number of data values into smaller
once, so that data evaluation and data management becomes very easy.
 Data Discretization techniques can be used to divide the range of
continuous attribute into intervals.Numerous continuous attribute
values are replaced by small interval labels.
 This leads to a concise, easy-to-use, knowledge-level representation
of mining results.
 Data discretization example
 we have an attribute of age with the following values. (Before
Discretization)
Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Attribute Age Age Age
10,11,13,14,17,
19,
30, 31, 32, 38,
40, 42
70 , 72, 73, 75
After
Discretization
Young Mature Old
Data Mining Process - Discretization and Concept
Hierarchy Generation for Numerical Data
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Typical methods:
1 Binning
 Binning is a top-down splitting technique based on a specified
number of bins. Binning is an unsupervised discretization technique.
2 Histogram Analysis
 Because histogram analysis does not use class information so it is an
unsupervised discretization technique. Histograms partition the
values for an attribute into disjoint ranges called buckets.
 3 Cluster Analysis
 Cluster analysis is a popular data discretization method.A clustering
algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Top-down discretization
 If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals, then it is called top-down
discretization or splitting.
 Bottom-up discretization
 If the process starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values
to form intervals, then it is called bottom-up discretization or merging.
 Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a concept
hierarchy.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Concept hierarchies
 Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
 In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization
provides users with the flexibility to view data from different
perspectives.
 Data mining on a reduced data set means fewer input/output
operations and is more efficient than mining on a larger data set.
 Because of these benefits, discretization techniques and concept
hierarchies are typically applied before data mining, rather than
during mining.
Data Mining Process
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
Summary
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Data Mining is all about explaining the past and predicting the future
for analysis.
 Data mining helps to extract information from huge sets of data. It is
the procedure of mining knowledge from data.
 Data mining process includes business understanding, Data
Understanding, Data Preparation, Modelling, Evolution,
Deployment.
 Important Data mining techniques are Classification, clustering,
Regression, Association rules, Outer detection, Sequential Patterns,
and prediction
 R-language and Oracle Data mining are prominent data mining tools.
 Data mining technique helps companies to get knowledge-based
information.
Summary
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
• The main drawback of data mining is that many analytics software is
difficult to operate and requires advance training to work on.
• Data mining is used in diverse industries such as Communications,
Insurance, Education, Manufacturing, Banking, Retail, Service
providers, eCommerce, Supermarkets Bioinformatics.
Unit – 2
Any - 5 Assignment Questions Marks:-20
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
 Q.1 What is Data Mining? Explain the different stages for Data
Mining Process.
 Q.2 Describe challenges to Data Mining regarding data mining
methodology and user interaction issues.
 Q.3 Describe the various techniques of Data Mining. Write tools for
Data Mining.
 Q.4 What is Data Cleaning? Describe the approaches to fill missing
values and noisy data.
 Q.5 Explain Knowledge Discovery Process.
 Q.6 Define Support and Confidence in Association rule mining.
 Q.7 Explain Data mining Architecture. Write Some Application of
Data mining.
Questions
Thank You
Great God, Medi-Caps, All the attendees
Mr. Sagar Pandya
sagar.pandya@medicaps.ac.in
www.sagarpandya.tk
LinkedIn: /in/seapandya
Twitter: @seapandya
Facebook: /seapandya

More Related Content

What's hot

Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process Shuvra Ghosh
 
Data mining
Data mining Data mining
Data mining AthiraR23
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
data mining
data miningdata mining
data mininguoitc
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Application of data mining
Application of data miningApplication of data mining
Application of data miningSHIVANI SONI
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesHouw Liong The
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 

What's hot (20)

Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Data mining
Data mining Data mining
Data mining
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
data mining
data miningdata mining
data mining
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data science
Data scienceData science
Data science
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 

Similar to Data Mining

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptxExploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptxYashwanthKumar306128
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashokAshok Kumar
 
Data Mining and Business Analytics by Seyed Ziae Mousavi Mojab
Data Mining and Business Analytics by Seyed Ziae Mousavi MojabData Mining and Business Analytics by Seyed Ziae Mousavi Mojab
Data Mining and Business Analytics by Seyed Ziae Mousavi Mojabzmojab
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docxAbshar Fatima
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Miningsamiksha sharma
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentationmillerca2
 
datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxshyam1985
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its ApplicationsIRJET Journal
 

Similar to Data Mining (20)

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptxExploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
 
Data mining
Data miningData mining
Data mining
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
 
Data mining-basic
Data mining-basicData mining-basic
Data mining-basic
 
Data Mining and Business Analytics by Seyed Ziae Mousavi Mojab
Data Mining and Business Analytics by Seyed Ziae Mousavi MojabData Mining and Business Analytics by Seyed Ziae Mousavi Mojab
Data Mining and Business Analytics by Seyed Ziae Mousavi Mojab
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
Datamining
DataminingDatamining
Datamining
 
Datamining
DataminingDatamining
Datamining
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
datamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptxdatamining management slyabbus and ppt.pptx
datamining management slyabbus and ppt.pptx
 
datamining.ppt
datamining.pptdatamining.ppt
datamining.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 

More from Medicaps University

More from Medicaps University (14)

data mining and warehousing computer science
data mining and warehousing computer sciencedata mining and warehousing computer science
data mining and warehousing computer science
 
Unit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptxUnit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptx
 
Unit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptxUnit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptx
 
UNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptxUNIT-3 Complete PPT.pptx
UNIT-3 Complete PPT.pptx
 
UNIT-2.pptx
UNIT-2.pptxUNIT-2.pptx
UNIT-2.pptx
 
UNIT-1 CSA.pptx
UNIT-1 CSA.pptxUNIT-1 CSA.pptx
UNIT-1 CSA.pptx
 
Scheduling
SchedulingScheduling
Scheduling
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Clock synchronization
Clock synchronizationClock synchronization
Clock synchronization
 
Distributed Objects and Remote Invocation
Distributed Objects and Remote InvocationDistributed Objects and Remote Invocation
Distributed Objects and Remote Invocation
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Association and Classification Algorithm
Association and Classification AlgorithmAssociation and Classification Algorithm
Association and Classification Algorithm
 
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Data Mining

  • 1. MEDI-CAPS UNIVERSITY Faculty of Engineering Mr. Sagar Pandya Information Technology Department sagar.pandya@medicaps.ac.in
  • 2. Data Mining and Warehousing Mr. Sagar Pandya Information Technology Department sagar.pandya@medicaps.ac.in Course Code Course Name Hours Per Week Total Credits L T P IT3ED02 Data Mining and Warehousing 3 0 0 3
  • 3. IT3ED02 Data Mining and Warehousing 3-0-0 Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Unit 1. Introduction  Unit 2. Data Mining  Unit 3. Association and Classification  Unit 4. Clustering  Unit 5. Business Analysis
  • 4. Reference Books Text Books  Han, Kamber and Pi, Data Mining Concepts & Techniques, Morgan Kaufmann, India, 2012.  Mohammed Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press.  Z. Markov, Daniel T. Larose Data Mining the Web, Jhon wiley & son, USA. Reference Books  Sam Anahory and Dennis Murray, Data Warehousing in the Real World, Pearson Education Asia.  W. H. Inmon, Building the Data Warehouse, 4th Ed Wiley India. and many others Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 5. Unit-2 Data Mining  Basics of Data Mining ,  Data mining techniques,  KDP (Knowledge Discovery Process),  Application and Challenges of Data Mining,  Data Pre-processing: Overview,  Data cleaning, Data integration, Data reduction, Data transformation and discretization. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 6. Data Mining  There is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information.  It is necessary to analyze this huge amount of data and extract useful information from it.  Extraction of information is not the only process we need to perform; data mining also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation.  Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern analysis, information harvesting, etc. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 7. Data Mining  Definition:- “The process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions.”  “The Science of extracting useful information from large datasets or databases.” Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 8. Data Mining  Data mining is looking for hidden, valid, and potentially useful patterns in huge data sets.  In other words, we can say that data mining is the procedure of mining knowledge from data.  Data Mining is all about discovering unsuspected/ previously unknown relationships amongst the data.  It is a multi-disciplinary skill that uses machine learning, statistics, AI and database technology.  The insights derived via Data Mining can be used for Market Analysis, Fraud Detection, Customer Retention, Production Control, Science Exploration etc. Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 9. Types of Data in Data Mining  Data mining can be performed on following types of data:- 1. Relational databases 2. Data warehouses 3. Advanced DB and information repositories 4. Object-oriented and object-relational databases 5. Transactional and Spatial databases 6. Heterogeneous and legacy databases 7. Multimedia and streaming database 8. Text databases 9. Text mining and Web mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 10. History of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector with an extensive history.  Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution of regression(1800s).  The generation and growing power of computer science have boosted data collection, storage, and manipulation as data sets have broad in size and complexity level. Explicit hands-on data investigation has progressively been improved with indirect, automatic data processing, and other computer science discoveries such as neural networks, clustering, genetic algorithms (1950s), decision trees(1960s), and supporting vector machines (1990s).  Data mining origins are traced back to three family lines: Classical statistics, Artificial intelligence, and Machine learning.
  • 11. History of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 12. Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 13. Evolution of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 14. Applications of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Application Usage Insurance Data mining helps insurance companies to price their products profitable and promote new offers to their new or existing customers. Education Data mining benefits educators to access student data, predict achievement levels and find students or groups of students which need extra attention. For example, students who are weak in maths subject. Communications Data mining techniques are used in communication sector to predict customer behavior to offer highly targeted and relevant campaigns.
  • 15. Applications of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Application Usage Banking Data mining helps finance sector to get a view of market risks and manage regulatory compliance. It helps banks to identify probable defaulters to decide whether to issue credit cards, loans, etc. Retail Data Mining techniques help retail malls and grocery stores identify and arrange most sellable items in the most attentive positions. It helps store owners to comes up with the offer which encourages customers to increase their spending. Crime Investigation Data Mining helps crime investigation agencies to deploy police workforce (where is a crime most likely to happen and when?), who to search at a border crossing etc.
  • 16. Applications of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Application Usage Bioinformatics Data Mining helps to mine biological data from massive datasets gathered in biology and medicine. Service Providers Service providers like mobile phone and utility industries use Data Mining to predict the reasons when a customer leaves their company. They analyze billing details, customer service interactions, complaints made to the company to assign each customer a probability score and offers incentives. E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells through their websites. One of the most famous names is Amazon, who use Data mining techniques to get more customers into their eCommerce store.
  • 17. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The data mining tasks can be classified generally into two types based on what a specific task tries to achieve. Those two categories are descriptive tasks and predictive tasks.  The two “High Level” primary goals of data mining are prediction and description.  Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest.  Descriptive tasks focuses on finding human-interpretable patterns describing the data.
  • 18. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 19. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in a) Classification  Classification derives a model to determine the class of an object based on its attributes.  A collection of records will be available, each record with a set of attributes.  Classification can be used in direct marketing, that is to reduce marketing costs by targeting a set of customers who are likely to buy a new product.  Using the available data, it is possible to know which customers purchased similar products and who did not purchase in the past. Hence, {purchase, don’t purchase} decision forms the class attribute in this case.  Once the class attribute is assigned, demographic and lifestyle information of customers who purchased similar products can be collected and promotion mails can be sent to them directly.
  • 20. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in b) Prediction  Prediction task predicts the possible values of missing or future data.  Prediction involves developing a model based on the available data and this model is used in predicting future values of a new data set of interest.  For example, a model can predict the income of an employee based on education, experience and other demographic factors like place of stay, gender etc.  Also prediction analysis is used in different areas including medical diagnosis, fraud detection etc.
  • 21. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in c) Time - Series Analysis  Time series is a sequence of events where the next event is determined by one or more of the preceding events.  Time series reflects the process being measured and there are certain components that affect the behavior of a process.  Time series analysis includes methods to analyze time-series data in order to extract useful patterns, trends, rules and statistics.  Stock market prediction is an important application of time- series analysis.
  • 22. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in d) Association  Association discovers the association or connection among a set of items.  Association identifies the relationships between objects.  Association analysis is used for commodity management, advertising, catalog design, direct marketing etc.  A retailer can identify the products that normally customers purchase together or even find the customers who respond to the promotion of same kind of products.  If a retailer finds that surf and soap are bought together mostly, he can put nappies on sale to promote the soap of surf.
  • 23. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in e) Clustering  Clustering is used to identify data objects that are similar to one another.  The similarity can be decided based on a number of factors like purchase behavior, responsiveness to certain actions, geographical locations and so on.  For example, an insurance company can cluster its customers based on age, residence, income etc.  This group information will be helpful to understand the customers better and hence provide better customized services.
  • 24. Basic Data Mining Task Mr. Sagar Pandya sagar.pandya@medicaps.ac.in f) Summarization  Summarization is the generalization of data.  A set of relevant data is summarized which result in a smaller set that gives aggregated information of the data.  For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc.  Such high level summarized information can be useful for sales or customer relationship team for detailed customer and purchase behavior analysis.  Data can be summarized in different abstraction levels and from different angles.
  • 25. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Mining refers to the detection and extraction of new patterns from the already collected data.  Data mining architecture has many elements like Data Mining Engine, Pattern evaluation, Data Warehouse, User Interface and Knowledge Base.  Each and every component of the data mining technique and architecture has its own way of performing responsibilities and also in completing data mining efficiently.  The different modules are needed to interact correctly so as to produce a valuable result and complete the complex procedure of data mining successfully by providing the right set of information to the business.
  • 26. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 27. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. Data Sources  A huge variety of present documents such as data warehouse, database, www or popularly called a World wide web which becomes the actual data sources.  Most of the times, it can also be the case that the data is not present in any of these golden sources but only in the form of text files, plain files or sequence files or spreadsheets and then the data needs to be processed in a very similar way as the processing would be done upon the data received from golden sources.  Most of the major chunk of data today is received from the internet or the world wide web as everything which is present on the internet today is data in some form or another which forms some form of information repository units.
  • 28. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Before the data is processed ahead the different processes through which it goes involves data cleansing, integration, and selection before finally the data is passed onto the database or any of the EDW (enterprise data warehouse ) server.  The major challenge which lies at times with this set of data is different levels of sources and a wide array of data formats which forms the data components. Therefore the data cannot be directly used for processing in its naïve state but processed, transformed and crafted in a much more usable way.  This way, the reliability and completeness of the data are also ensured. So, the primary step involves data collection, cleaning and integration, and post that only the relevant data is passed forward. All this activity forms a part of a separate set of tools and techniques.
  • 29. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2. Data Warehouse Server or Database  The database server is the actual space where the data is contained once it is received from the various number of data sources.  The server contains the actual set of data which becomes ready to be processed and therefore the server manages the data retrieval.  All this activity is based on the request for data mining of the person.
  • 30. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3. Data Mining Engine  Data Mining Engine is the core component of data mining process  In the case of data mining, the engine forms the core component and is the most vital part, or to say the driving force which handles all the requests and manages them and is used to contain a number of modules.  The number of modules present includes mining tasks such as classification technique, association technique, regression technique, characterization, prediction and clustering, time series analysis, naive Bayes, support vector machines, ensemble methods, boosting and bagging techniques, random forests, decision trees, etc.  In other words, we can say data mining is the root of our data mining architecture.
  • 31. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4. Pattern Evaluation Modules  They are responsible for finding interesting patterns in the data and sometimes they also interact with the database servers for producing the result of the user requests.  All in all, the main purpose of this component is to look out and search for all the interesting and useable patterns which could make the data of comparatively better quality.  Pattern Evaluation is responsible for finding various patterns with the help of Data Mining Engine.
  • 32. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 5. Graphical User Interface  The graphical user interface (GUI) module communicates between the data mining system and the user.  This module helps the user to easily and efficiently use the system without knowing the complexity of the process.  This module cooperates with the data mining system when the user specifies a query or a task and displays the results.
  • 33. Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 6. Knowledge Base  The knowledge base is helpful in the entire process of data mining.  It might be helpful to guide the search or evaluate the stake of the result patterns.  The knowledge base may even contain user views and data from user experiences that might be helpful in the data mining process.  The data mining engine may receive inputs from the knowledge base to make the result more accurate and reliable.  The pattern assessment module regularly interacts with the knowledge base to get inputs, and also update it.
  • 34. Types of Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. No Coupling:  The no coupling data mining architecture retrieves data from particular data sources.  It does not use the database for retrieving the data which is otherwise quite an efficient and accurate way to do the same.  The no coupling architecture for data mining is poor and only used for performing very simple data mining processes. 2. Loose Coupling:  In loose coupling architecture data mining system retrieves data from the database and stores the data in those systems.  This mining is for memory-based data mining architecture.
  • 35. Types of Data Mining Architecture Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3. Semi Tight Coupling:  It tends to use various advantageous features of the data warehouse systems.  It includes sorting, indexing, aggregation.  In this architecture, an intermediate result can be stored in the database for better performance. 4. Tight coupling:  In this architecture, a data warehouse is considered as one of it’s most important components whose features are employed for performing data mining tasks.  This architecture provides scalability, performance, and integrated information
  • 36. Advantages of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Assists in preventing future adversaries by accurately predicting future trends.  Contributes to the making of important decisions.  Compresses data into valuable information.  Provides new trends and unexpected patterns.  Helps to analyze huge data sets.  Aids companies to find, attract and retain customers.  Helps the company to improve its relationship with the customers.  Assist Companies to optimize their production according to the likability of a certain product thus saving cost to the company.
  • 37. Disadvantages of Data Mining Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Excessive work intensity requires high-performance teams and staff training.  The requirement of large investments can also be considered as a problem as sometimes data collection consumes many resources that suppose a high cost.  Lack of security could also put the data at huge risk, as the data may contain private customer details.  Inaccurate data may lead to the wrong output.  Huge databases are quite difficult to manage.
  • 38. DIFFERENT TYPES OF KNOWLEDGE Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Knowledge is a collection of interesting and useful pattern in a database. The key issue in Knowledge Discovery in Database is to realize that there is more information hidden in your data than you are table to distinguish at first sight. In data mining we distinguish four different types of knowledge.  Shallow Knowledge This is information that can be easily retrieved from database using a query tool such as Structured Query Language (SQL).  Multi-Dimensional Knowledge OLAP tools you have the ability to rapidly explore all sorts of clustering this is information that can be analyzed using online analytical processing tools. With and different orderings of the data but it is important to realize that most of the things you can do with an OLAP tool can also be done using SQL.
  • 39. DIFFERENT TYPES OF KNOWLEDGE Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The advantage of OLAP tools is that they are optimized for the kind of search and analysis operation.  However, OLAP is not as powerful as data mining; it cannot search for optimal solutions.  Hidden Knowledge This is data that can be found relative easily by using pattern recognition or machine learning algorithms. Again, one could use SQL to find these patterns but this would probably prove extremely time-consuming.  A pattern recognition algorithm could find regularities in a database in minutes or at most a couple of hours, whereas you would have to spend months using SQL to achieve the same result. Here information that can be obtained through data mining techniques.
  • 40. DIFFERENT TYPES OF KNOWLEDGE Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Deep Knowledge This is information that is stored in the database but can only be located if we have a clue that tells us where to look.  Different Types of Knowledge and Techniques:
  • 41. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data mining is the core part of the knowledge discovery process.  KDP is a process of finding knowledge in data, it does this by using data mining methods (algorithms) in order to extract demanding knowledge from large amount of data.  Data Mining also known as Knowledge Discovery in Databases.  Here is the list of steps involved in the knowledge discovery process: 1.) Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection. Cleaning in case of Missing values.  Cleaning noisy data, where noise is a random or variance error.  Cleaning with Data discrepancy detection and Data transformation tools.  Parser decides weather the given string of data is acceptable within data specification.
  • 42. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2.) Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(DataWarehouse).Data integration using Data Migration tools.  Data integration using Data Synchronization tools.  Data integration using ETL(Extract-Load-Transformation) process. 3.) Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.  Data selection using Decision Trees.  Data selection using Naive bayes.  Data selection using Neural network.  Data selection using Clustering, Regression, etc.
  • 43. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 44. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4.) Data Transformation:  Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.  Data Transformation is a two step process:  Data Mapping: Assigning elements from source base to destination to capture transformations.  Code generation: Creation of the actual transformation program. 5.) Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful. Transforms task relevant data into patterns.  Decides purpose of model using classification or characterization.
  • 45. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 6.) Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing patterns representing knowledge based on given measures.  Find interestingness score of each pattern.  Uses summarization and Visualization to make data understandable by user. 7.) Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results.  Generate reports.  Generate tables.  Generate discriminant rules, classification rules, characterization rules, etc.
  • 46. Knowledge Discovery Process (KDP) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 47. Data Mining Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Nowadays Data Mining and knowledge discovery are evolving a crucial technology for business and researchers in many domains.  Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place.  It needs to be integrated from various heterogeneous data sources. These factors also create some issues.  Here, we will discuss the major issues regarding − 1. Mining Methodology and User Interaction 2. Performance Issues 3. Diverse Data Types Issues
  • 48. Data Mining Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 49. Mining Methodology and User Interaction Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Mining different kinds of knowledge in databases −  Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.  Interactive mining of knowledge at multiple levels of abstraction −  The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results.
  • 50. Mining Methodology and User Interaction Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Incorporation of background knowledge −  To guide discovery process and to express the discovered patterns, the background knowledge can be used.  Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.  Data mining query languages and ad hoc data mining −  Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
  • 51. Mining Methodology and User Interaction Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Presentation and visualization of data mining results −  Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable.  Handling noisy or incomplete data −  The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.  Pattern evaluation −  The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
  • 52. Performance Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.  Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms.  These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.
  • 53. Diverse Data Types Issues Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Handling of relational and complex types of data −  The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data.  Mining information from heterogeneous databases and global information systems −  The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.
  • 54. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 55. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Although data mining is very powerful, it faces many challenges during its execution. Various challenges could be related to performance, data, methods, and techniques, etc. The process of data mining becomes effective when the challenges or problems are correctly recognized and adequately resolved.  Incomplete and noisy data  Data Distribution  Complex Data  Performance  Data Privacy and Security  Data Visualization
  • 56. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1.) Incomplete and noisy data:  The process of extracting useful data from large volumes of data is data mining.  The data in the real-world is heterogeneous, incomplete, and noisy.  Data in huge quantities will usually be inaccurate or unreliable. These problems may occur due to data measuring instrument or because of human errors.  Suppose a retail chain collects phone numbers of customers who spend more than $ 500, and the accounting employees put the information into their system.  The person may make a digit mistake when entering the phone number, which results in incorrect data.
  • 57. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Even some customers may not be willing to disclose their phone numbers, which results in incomplete data.  The data could get changed due to human or system error.  All these consequences (noisy and incomplete data)makes data mining challenging. 2.) Data Distribution:  Real-worlds data is usually stored on various platforms in a distributed computing environment.  It might be in a database, individual systems, or even on the internet.  Practically, It is a quite tough task to make all the data to a centralized data repository mainly due to organizational and technical concerns.
  • 58. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  For example, various regional offices may have their servers to store their data. It is not feasible to store, all the data from all the offices on a central server. Therefore, data mining requires the development of tools and algorithms that allow the mining of distributed data. 3.) Complex Data:  Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images, complex data, spatial data, time series, and so on.  Managing these various types of data and extracting useful information is a tough task.  Most of the time, new technologies, new tools, and methodologies would have to be refined to obtain specific information.
  • 59. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4.) Performance:  The data mining system's performance relies primarily on the efficiency of algorithms and techniques used.  If the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining process will be affected adversely. 5.) Data Privacy and Security:  Data mining usually leads to serious problems in terms of data security, governance, and privacy.  For example, if a retailer analyzes the details of the purchased items, then it reveals data about buying habits and preferences of the customers without their permission.
  • 60. Data Mining Challenges Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 6.) Data Visualization:  In data mining, data visualization is a very important process because it is the primary method that shows the output to the user in a presentable way.  The extracted data should convey the exact meaning of what it intends to express.  But many times, representing the information to the end-user in a precise and easy way is difficult.  The input data and the output information being complicated, very efficient, and successful data visualization processes need to be implemented to make it successful.
  • 61. Data Mining vs Data Warehousing Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no. Data Mining Data Warehousing 1 Data mining is a method of comparing large amounts of data to finding right patterns. Data warehousing is a method of centralizing data from different sources into one common repository. 2 Data mining is the process of determining data patterns. A data warehouse is a database system designed for analytics. 3 In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically. 4 Data mining uses pattern recognition techniques to identify patterns. Data warehousing is the process of extracting and storing data that allow easier reporting.
  • 62. Data Mining vs Data Warehousing Mr. Sagar Pandya sagar.pandya@medicaps.ac.in S.no. Data Mining Data Warehousing 5 Data mining helps to create suggestive patterns of important factors. Like the buying habits of customers, products. Data Warehouse adds an extra value to operational business systems like CRM systems when the warehouse is integrated. 6 After successful initial queries, users may ask more complicated queries which would increase the workload. Data Warehouse is complicated to implement and maintain. 7 The Data mining techniques are never 100% accurate and may cause serious consequences in certain conditions. In the data warehouse, there is great chance that the data which was required for analysis by the organization may not be integrated into the warehouse. It can easily lead to loss of information.
  • 63. Alternative names for Data Mining : Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Alternative names for Data Mining : 1. Knowledge discovery (mining) in databases (KDD) 2. Knowledge extraction 3. Data/pattern analysis 4. Data archeology 5. Data dredging 6. Information harvesting 7. Business intelligence
  • 64. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Many different sectors are taking advantage of data mining to boost their business efficiency, including manufacturing, chemical, marketing, aerospace, etc.  Therefore, the need for a conventional data mining process improved effectively.  Data mining is described as a process of finding hidden precious data by evaluating the huge quantity of information stored in data warehouses, using multiple data mining techniques such as Artificial Intelligence (AI), Machine learning and statistics.
  • 65. The Cross-Industry Standard Process for Data Mining (CRISP-DM) Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Cross-industry Standard Process of Data Mining (CRISP-DM) comprises of six phases designed as a cyclical method as the given figure:
  • 66. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 67. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. Business understanding:  It focuses on understanding the project goals and requirements form a business point of view, then converting this information into a data mining problem afterward a preliminary plan designed to accomplish the target.  Tasks: • Determine business objectives • Access situation • Determine data mining goals • Produce a project plan
  • 68. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  First, you need to understand business and client objectives.  You need to define what your client wants (which many times even they do not know themselves)  Take stock of the current data mining scenario.  Factor in resources, assumption, constraints, and other significant factors into your assessment.  Using business objectives and current scenario, define your data mining goals.  A good data mining plan is very detailed and should be developed to accomplish both business and data mining goals.
  • 69. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Determine business objectives: • It Understands the project targets and prerequisites from a business point of view. • Carefully understand what the customer wants to achieve. • Reveal significant factors, at the starting, it can impact the result of the project.  Access situation: • It requires a more detailed analysis of facts about all the resources, constraints, assumptions, and others that ought to be considered.
  • 70. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Determine data mining goals: • A business goal states the target of the business terminology. For example, increase catalog sales to the existing customer. • A data mining goal describes the project objectives. For example, It assumes how many objects a customer will buy, given their demographics details (Age, Salary, and City) and the price of the item over the past three years.  Produce a project plan: • It states the targeted plan to accomplish the business and data mining plan. • The project plan should define the expected set of steps to be performed during the rest of the project, including the latest technique and better selection of tools.
  • 71. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2. Data Understanding:  Data understanding starts with an original data collection and proceeds with operations to get familiar with the data, to data quality issues, to find better insight in data, or to detect interesting subsets for concealed information hypothesis.  Tasks: • Collects initial data • Describe data • Explore data • Verify data quality
  • 72. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  First, data is collected from multiple data sources available in the organization.  These data sources may include multiple databases, flat filer or data cubes.  There are issues like object matching and schema integration which can arise during Data Integration process.  It is a quite complex and tricky process as data from various sources unlikely to match easily.  For example, table A contains an entity named cust_no whereas another table B contains an entity named cust-id.
  • 73. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Therefore, it is quite difficult to ensure that both of these given objects refer to the same value or not.  Here, Metadata should be used to reduce errors in the data integration process.  Next, the step is to search for properties of acquired data.  A good way to explore the data is to answer the data mining questions (decided in business phase) using the query, reporting, and visualization tools.  Based on the results of query, the data quality should be ascertained. Missing data if any should be acquired.
  • 74. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Collect initial data: • It acquires the information mentioned in the project resources. • It includes data loading if needed for data understanding. • It may lead to original data preparation steps. • If various information sources are acquired then integration is an extra issue, either here or at the subsequent stage of data preparation.  Describe data: • It examines the "gross" or "surface" characteristics of the information obtained. • It reports on the outcomes.  Verify data quality: • It examines the data quality and addressing questions.
  • 75. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Explore data: • Addressing data mining issues that can be resolved by querying, visualizing, and reporting, including: • Distribution of important characteristics, results of simple aggregation. • Establish the relationship between the small number of attributes. • Characteristics of important sub-populations, simple statical analysis. • It may refine the data mining objectives. • It may contribute or refine the information description, and quality reports. • It may feed into the transformation and other necessary information preparation.
  • 76. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3. Data Preparation: • It usually takes more than 90 percent of the time. • It covers all operations to build the final data set from the original raw information. • Data preparation is probable to be done several times and not in any prescribed order.  Tasks: • Select data • Clean data • Construct data • Integrate data • Format data
  • 77. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data transformation operations would contribute toward the success of the mining process.  Smoothing: It helps to remove noise from the data.  Aggregation: Summary or aggregation operations are applied to the data. I.e., the weekly sales data is aggregated to calculate the monthly and yearly total.  Generalization: In this step, Low-level data is replaced by higher- level concepts with the help of concept hierarchies. For example, the city is replaced by the county.  Normalization: Normalization performed when the attribute data are scaled up o scaled down. Example: Data should fall in the range -2.0 to 2.0 post-normalization.
  • 78. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Select data: • It decides which information to be used for evaluation. • In the data selection criteria include significance to data mining objectives, quality and technical limitations such as data volume boundaries or data types. • It covers the selection of characteristics and the choice of the document in the table.  Clean data: • It may involve the selection of clean subsets of data, inserting appropriate defaults or more ambitious methods, such as estimating missing information by modeling.
  • 79. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Construct data: • It comprises of Constructive information preparation, such as generating derived characteristics, complete new documents, or transformed values of current characteristics.  Integrate data: • Integrate data refers to the methods whereby data is combined from various tables, or documents to create new documents or values.  Format data: • Formatting data refer mainly to linguistic changes produced to information that does not alter their significance but may require a modeling tool.
  • 80. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4. Modeling:  In modeling, various modeling methods are selected and applied, and their parameters are measured to optimum values.  Some methods gave particular requirements on the form of data.  Therefore, stepping back to the data preparation phase is necessary.  Tasks: • Select modeling technique • Generate test design • Build model • Access model
  • 81. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Select modeling technique: • It selects the real modeling method that is to be used. For example, decision tree, neural network. • If various methods are applied, then it performs this task individually for each method.  Generate test Design: • Generate a procedure or mechanism for testing the validity and quality of the model before constructing a model. • For example, in classification, error rates are commonly used as quality measures for data mining models. Therefore, typically separate the data set into train and test set, build the model on the train set and assess its quality on the separate test set.
  • 82. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Build model: • To create one or more models, we need to run the modeling tool on the prepared data set.  Assess model: • It interprets the models according to its domain expertise, the data mining success criteria, and the required design. • It assesses the success of the application of modeling and discovers methods more technically. • It Contacts business analytics and domain specialists later to discuss the outcomes of data mining in the business context.
  • 83. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 5. Evaluation: • At the last of this phase, a decision on the use of the data mining results should be reached. It evaluates the model efficiently, and review the steps executed to build the model and to ensure that the business objectives are properly achieved. The main objective of the evaluation is to determine some significant business issue that has not been regarded adequately. At the last of this phase, a decision on the use of the data mining outcomes should be reached.  Tasks: • Evaluate results • Review process • Determine next steps
  • 84. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Evaluate results: • It assesses the degree to which the model meets the organization's business objectives. • It tests the model on test apps in the actual implementation when time and budget limitations permit and also assesses other data mining results produced. • It unveils additional difficulties, suggestions, or information for future instructions.  Review process: • The review process does a more detailed evaluation of the data mining engagement to determine when there is a significant factor or task that has been somehow ignored. It reviews quality assurance problems.
  • 85. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Determine next steps: • It decides how to proceed at this stage. • It decides whether to complete the project and move on to deployment when necessary or whether to initiate further iterations or set up new data-mining initiatives.it includes resources analysis and budget that influence the decisions. • A go or no-go decision is taken to move the model in the deployment phase.
  • 86. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 6. Deployment:  Determine: • Deployment refers to how the outcomes need to be utilized.  Deploy data mining results by: • It includes scoring a database, utilizing results as company guidelines, interactive internet scoring. • The information acquired will need to be organized and presented in a way that can be used by the client. However, the deployment phase can be as easy as producing. However, depending on the demands, the deployment phase may be as simple as generating a report or as complicated as applying a repeatable data mining method across the organizations.
  • 87. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A final project report is created with lessons learned and key experiences during the project. This helps to improve the organization's business policy.  Tasks: • Plan deployment • Plan monitoring and maintenance • Produce final report • Review project  Plan deployment: • To deploy the data mining outcomes into the business, takes the assessment results and concludes a strategy for deployment. • It refers to documentation of the process for later deployment.
  • 88. Data Mining Implementation Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Plan monitoring and maintenance: • It is important when the data mining results become part of the day-to-day business and its environment. • It helps to avoid unnecessarily long periods of misuse of data mining results. It needs a detailed analysis of the monitoring process.  Produce final report: • A final report can be drawn up by the project leader and his team. • It may only be a summary of the project and its experience. • It may be a final and comprehensive presentation of data mining.  Review project: • Review projects evaluate what went right and what went wrong, what was done wrong, and what needs to be improved.
  • 89. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  One of the most important tasks in Data Mining is to select the correct data mining technique.  Data Mining technique has to be chosen based on the type of business and the type of problem your business faces.  A generalized approach has to be used to improve the accuracy and cost-effectiveness of using data mining techniques.  There are basically seven main Data Mining techniques which are discussed.  There are also a lot of other Data Mining techniques but these seven are considered more frequently used by business people. • Statistics, Clustering, Visualization, Decision Tree, Association Rules, Neural Networks, Classification.
  • 90. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 91. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. Classification:  This technique is used to obtain important and relevant information about data and metadata.  This data mining technique helps to classify data in different classes.  Data mining techniques classification is the most commonly used data mining technique which contains a set of pre-classified samples to create a model which can classify the large set of data.  There are two main processes involved in this technique • Learning – In this process the data are analyzed by the classification algorithm. • Classification – In this process, the data is used to measure the precision of the classification rules.
  • 92. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data mining techniques can be classified by different criteria, as follows: i. Classification of Data mining frameworks as per the type of data sources mined: This classification is as per the type of data handled. For example, multimedia, spatial data, text data, time-series data, World Wide Web, and so on.. ii. Classification of data mining frameworks as per the database involved: This classification based on the data model involved. For example. Object-oriented database, transactional database, relational database, and so on..
  • 93. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in iii. Classification of data mining frameworks as per the kind of knowledge discovered: This classification depends on the types of knowledge discovered or data mining functionalities. For example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be extensive frameworks offering a few data mining functionalities together.. iv. Classification of data mining frameworks according to data mining techniques used: This classification is as per the data analysis approach utilized, such as neural networks, machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented, etc.
  • 94. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are different types of classification models. They are as follows • Classification by decision tree induction • Bayesian Classification • Neural Networks • Support Vector Machines (SVM) • Classification Based on Associations
  • 95. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • Learning step (training phase): In this, a classification algorithm builds the classifier by analyzing a training set. • Classification step: Test data are used to estimate the accuracy or precision of the classification rules.  For example, a banking company uses to identify loan applicants at low, medium or high credit risks. Similarly, a medical researcher analyzes cancer data to predict which medicine to prescribe to the patient.
  • 96. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • This step is the learning step or the learning phase. • In this step the classification algorithms build the classifier. • The classifier is built from the training set made up of database tuples and their associated class labels.
  • 97. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.
  • 98. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2. Clustering Technique  Clustering is one of the oldest techniques used in Data Mining.  Clustering analysis is the process of identifying data that are similar to each other.  This will help to understand the differences and similarities between the data.  This is sometimes called segmentation and helps the users to understand what is going on within the database.  For example, an insurance company can group its customers based on their income, age, nature of policy and type of claims.  Clustering is very similar to the classification, but it involves grouping chunks of data together based on their similarities.
  • 99. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are different types of clustering methods. They are as follows • Partitioning Methods • Hierarchical Agglomerative methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods  The most popular clustering algorithm is the Nearest Neighbour. In business, the Nearest Neighbour technique is most often used in the process of Text Retrieval.  They are used to find the documents that share the important characteristics with that main document that have been marked as interesting.
  • 100. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  A similar example of loan applicants can be considered here also. There are some differences that are depicted in the figure below.
  • 101. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3. Regression:  Regression analysis is the data mining method of identifying and analyzing the relationship between variables.  It is used to identify the likelihood of a specific variable, given the presence of other variables.  Regression analysis is the data mining process is used to identify and analyze the relationship between variables because of the presence of the other factor.  It is used to define the probability of the specific variable.  Regression, primarily a form of planning and modeling.
  • 102. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  For example, we might use it to project certain costs, depending on other factors such as availability, consumer demand, and competition.  Primarily it gives the exact relationship between two or more variables in the given data set.  A good example of regression analysis is the use of this data mining technique in matching people on dating portals.  Many websites use variables to match people according to their likes, interest, and hobbies.
  • 103. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4. Association Rule Technique  This technique helps to find the association between two or more items.  It helps to know the relations between the different variables in databases.  It discovers the hidden patterns in the data sets which is used to identify the variables and the frequent occurrence of different variables that appear with the highest frequencies.  There are three types of association rule. They are 1. Multilevel Association Rule 2. Multidimensional Association Rule 3. Quantitative Association Rule
  • 104. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  This technique is most often used in the retail industry to find patterns in sales. This will help increase the conversion rate and thus increases profit.  Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different types of databases.  Association rule mining has several applications and is commonly used to help sales correlations in data or medical data sets.  The way the algorithm works is that you have various data, For example, a list of grocery items that you have been buying for the last six months.  It calculates a percentage of items being purchased together.
  • 105. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  These are three major measurements technique: • Lift: This measurement technique measures the accuracy of the confidence over how often item B is purchased. (Confidence) / (item B)/ (Entire dataset) • Support: This measurement technique measures how often multiple items are purchased and compared it to the overall dataset. (Item A + Item B) / (Entire dataset) • Confidence: This measurement technique measures how often item B is purchased when item A is purchased as well. (Item A + Item B)/ (Item A)
  • 106. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Suppose, the marketing manager of a supermarket wants to determine which products are frequently purchased together.  As an example,  Buys (x,”beer”) -> buys(x, “chips”) [support = 1%, confidence = 50%] • Here x represents a customer buying beer and chips together. • Confidence shows certainty that if a customer buys a beer, there is a 50% chance that he/she will buy the chips also. • Support means that 1% of all the transactions under analysis showed that beer and chips were bought together. • Many similar examples like bread and butter or computer and software can be considered.
  • 107. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are two types of Association Rules: • Single dimensional association rule: These rules contain a single attribute that is repeated. • Multidimensional association rule: These rules contain multiple attributes that are repeated.
  • 108. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 5. Outer detection:  This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected behavior.  This technique may be used in various domains like intrusion, detection, fraud detection, etc.  It is also known as Outlier Analysis or Outlier mining. The outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a significant role in the data mining field.  Outlier detection is valuable in numerous fields like network interruption identification, credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
  • 109. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  For example, let’s assume the graph below is plotted using some data sets in our database.  So the best fit line is drawn. The points lying nearby the line show expected behavior while the point far from the line is an Outlier.  This would help to detect the anomalies and take possible actions accordingly.
  • 110. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 6. Sequential Patterns:  This data mining technique helps to discover or identify similar patterns or trends in transaction data for certain period.  The sequential pattern is a data mining technique specialized for evaluating sequential data to discover sequential patterns.  It comprises of finding interesting subsequences in a set of sequences, where the stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.  In other words, this technique of data mining helps to discover or recognize similar patterns in transaction data over some time.
  • 111. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  This method is used to identify patterns that occur frequently over a certain period of time.  For example, the sales manager of clothing company sees that sales of jackets seem to increase just before the winter season, or sales in bakery increases during Christmas or New Year’s eve.
  • 112. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 7. Prediction:  Prediction used a combination of other data mining techniques such as trends, clustering, classification, etc.  It analyzes past events or instances in the right sequence to predict a future event.  Prediction is one of the most valuable data mining techniques, since it’s used to project the types of data you’ll see in the future.  In many cases, just recognizing and understanding historical trends is enough to chart a somewhat accurate prediction of what will happen in the future.  For example, you might review consumers’ credit histories and past purchases to predict whether they’ll be a credit risk in the future.
  • 113. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  For example, if the sales manager of a supermarket would like to predict the amount of revenue that each item would generate based on past sales data. It models a continuous valued function that predicts missing numeric data values.  Regression Analysis is the best choice to perform prediction. It can be used to set a relationship between independent variables and dependent variables.
  • 114. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Sequence Prediction:  Before defining the problem of sequence prediction, it is necessary to first explain what is a sequence. A sequence is an ordered list of symbols. For example, here are some common types of sequences: • A sequence of webpages visited by a user, ordered by the time of access. • A sequence of words or characters typed on a cellphone by a user, or in a text such as a book. • A sequence of products bought by a customer in a retail store • A sequence of proteins in bioinformatics • A sequence of symptoms observed on a patient at a hospital
  • 115. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  The task of sequence prediction consists of predicting the next symbol of a sequence based on the previously observed symbols. For example, if a user has visited some webpages A, B, C, in that order, one may want to predict what is the next webpage that will be visited by that user to prefetch the webpage.  First, one must train a sequence prediction model using some previously seen sequences called the training sequences. This process is illustrated below:
  • 116. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Some Other Data Mining Techniques:-  Statistical Techniques:  Data mining techniques statistics is a branch of mathematics which relates to the collection and description of data.  The statistical technique is not considered as a data mining technique by many analysts.  But still, it helps to discover the patterns and build predictive models.  For this reason, data analyst should possess some knowledge about the different statistical techniques.  In today’s world, people have to deal with a large amount of data and derive important patterns from it.
  • 117. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Statistics can help you to a greater extent to get answers for questions about their data like • What are the patterns in their database? • What is the probability of an event to occur? • Which patterns are more useful to the business? • What is the high-level summary that can give you a detailed view of what is there in the database?  Statistics not only answer these questions they help in summarizing the data and count it.  It also helps in providing information about the data with ease. Through statistical reports, people can make smart decisions.
  • 118. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are different forms of statistics but the most important and useful technique is the collection and counting of data. There are a lot of ways to collect data like • Histogram • Mean • Median • Mode • Variance • Max • Min • Linear Regression
  • 119. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Decision Trees  A decision tree is a tree structure (as its name suggests), where • Each internal node represents a test on the attribute. • Branch denotes the result of the test. • Terminal nodes hold the class label. • The topmost node is the root node which has a simple question that has two or more answers. Accordingly, the tree grows and a flow chart like structure is generated.
  • 120. Data Mining Techniques Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In this decision, tree government classifies citizens below age 18 or above age 18. This would help them to decide whether a license must be issued to a particular city or not.
  • 121. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  In today’s world, a large amount of data is generated within seconds. To handle this data, we should have some knowledge of different techniques and tools.  Data mining tools are nothing but a set of methodologies that are used for analyzing this large amount of data and the relationship between different data.  Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data and transforming data into more refined information.
  • 122. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 123. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 1. Orange Data Mining Tool:  It is open-source software written in python language.  Orange is the best software for analyzing data and machine learning. These components are called widgets.  These widgets are used for reading data, analyzing components, allows users to select the features and helps to show the data.  With orange, data formatting and moving them with the help of widgets becomes fast and easy.  Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical tools. It is quite exciting to operate.
  • 124. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Widgets deliver significant functionalities such as: • Displaying data table and allowing to select features • Data reading • Training predictors and comparison of learning algorithms • Data element visualization, etc.
  • 125. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2. SAS Data Mining:  SAS stands for Statistical Analysis System.  It is a product of the SAS Institute created for analytics and data management.  SAS can mine data, change it, manage information from various sources, and analyze statistics.  It offers a graphical UI for non-technical users.  SAS data miner allows users to analyze big data and provide accurate insight for timely decision-making purposes.  SAS has distributed memory processing architecture that is highly scalable.  It is suitable for data mining, optimization, and text mining purposes.
  • 126. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3. DataMelt Data Mining:  DataMelt is a computation and visualization environment which offers an interactive structure for data analysis and visualization.  It is primarily designed for students, engineers, and scientists.  It is also known as DMelt.  DMelt is a multi-platform utility written in JAVA.  It can run on any operating system which is compatible with JVM (Java Virtual Machine).  It consists of Science and mathematics libraries.
  • 127. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Scientific libraries: Scientific libraries are used for drawing the 2D/3D plots.  Mathematical libraries: Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.  DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis.  It is extensively used in natural sciences, financial markets, and engineering.
  • 128. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4. Rattle:  Ratte is a data mining tool based on GUI.  It uses the R stats programming language.  Rattle exposes the statically power of R by offering significant data mining features.  While rattle has a comprehensive and well-developed user interface, It has an integrated log code tab that produces duplicate code for any GUI operation.  The data set produced by Rattle can be viewed and edited.  Rattle gives the other facility to review the code, use it for many purposes, and extend the code without any restriction.
  • 129. Data Mining Tools Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 5. Rapid Miner:  It is written in JAVA programming language.  Rapid Miner is one of the most popular predictive analysis systems created by the company with the same name as the Rapid Miner.  It offers an integrated environment for text mining, deep learning, machine learning, and predictive analysis.  The instrument can be used for a wide range of applications, including company applications, commercial applications, research, education, training, application development, machine learning.  Rapid Miner provides the server on-site as well as in public or private cloud infrastructure.
  • 130. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Mining refers to extracting or mining knowledge from large amounts of data.  It is also defined as extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from a huge amount of data.  Data mining is a rapidly growing field that is concerned with developing techniques to assist managers and decision-makers to make intelligent use of a huge amount of repositories.  It is computational process of discovering patterns in large data sets involving methods at intersection of artificial intelligence, machine learning, statistics, and database systems.  The goal of data mining process is to extract information from a data set and transform it into an understandable structure for further use.
  • 131. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Major tasks of data pre-processing:  Data Cleaning  Data cleaning is a process to clean the data in such a way that data can be easily integrated.  Data Integration  Data integration is a process to integrate/combine all the data.  Data Reduction  Data reduction is a process to reduce the large data into smaller once in such a way that data can be easily transformed further.  Data Transformation  Data transformation is a process to transform the data into a reliable shape. 
  • 132. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Discretization  Data discretization converts a large number of data values into smaller once, so that data evaluation and data management becomes very easy.  After the completion of these tasks, the data is ready for mining.
  • 133. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 134. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  1.) Data Cleansing:-  Data cleansing or data cleaning is the process of identifying and removing (or correcting) inaccurate records from a dataset, table, or database and refers to recognizing unfinished, unreliable, inaccurate or non-relevant parts of the data and then restoring, remodeling, or removing the dirty or crude data.  To perform the data analytics properly we need various data cleaning techniques so that our data is ready for analysis.  Data cleaning techniques may be performed as batch processing through scripting or interactively with data cleansing tools.  After cleaning, a dataset should be uniform with other related datasets in the operation.
  • 135. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data cleaning techniques are not only an essential part of the data science process – it’s also the most time-consuming part.  With the rise of big data, data cleaning methods has become more important than ever before. Every industry – banking, healthcare, retail, hospitality, education – is now navigating in a large ocean of data.  “Data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it.”  Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.
  • 136. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Sources of Missing Values 1. There are many sources of missing data. Let’s see some major sources of missing data. 2. User forgot to fill the data in a field. 3. It can be a programming error. 4. Data can be lost when we transferring the data manually from a legacy database.
  • 137. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Cleaning Techniques-Get Rid of Extra Spaces  Here we have the text Welcome To Medicaps University written in four different ways.  welcome to Medicaps University  welcome to Medicaps University  welcome to Medicaps University  welcome to Medicaps University  First one is the regular way with only one space between words, in the second case we have more than one space between words, in a third case we have some leading spaces along with a couple of spaces between words and in the fourth case we have trailing spaces, you can see there are a couple of space after the last word.
  • 138. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Fixing Structural errors  The errors that arise during measurement, transfer of data or other similar situations are called structural errors.  Structural errors include typos in the name of features, same attribute with different name, mislabeled classes, i.e. separate classes that should really be the same or inconsistent capitalization.  For example, the model will treat America and america as different classes or values, though they represent the same value or red, yellow and red-yellow as different classes or attributes, though one class can be included in other two classes. So, these are some structural errors that make our model inefficient and gives poor quality results.
  • 139. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Some techniques of Data Cleaning Process 1. Parsing 2. Correcting 3. Standardizing 4. Matching 5. Consolidation 6. Dealing with missing data 7. Dealing with incorrect and noisy data
  • 140. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Some data cleaning methods :-  1 You can ignore the tuple. This is done when class label is missing. This method is not very effective , unless the tuple contains several attributes with missing values.  2 You can fill in the missing value manually. This approach is effective on small data set with some missing values.  3 You can replace all missing attribute values with global constant, such as a label like “Unknown” or minus infinity.  4 You can use the attribute mean to fill in the missing value.For example customer average income is 25000 then you can use this value to replace missing value for income.  5 Use the most probable value to fill in the missing value.
  • 141. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Noisy Data  Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty data collection instruments, data entry problems and technology limitation.  How to Handle Noisy Data?  Binning:  Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values around it.The sorted values are distributed into a number of “buckets,” or bins.  For example  Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
  • 142. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Partition into (equal-frequency) bins:  Bin a: 4, 8, 15  Bin b: 21, 21, 24  Bin c: 25, 28, 34  In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3.  Smoothing by bin means:  Bin a: 9, 9, 9  Bin b: 22, 22, 22  Bin c: 29, 29, 29  In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
  • 143. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Smoothing by bin boundaries:  Bin a: 4, 4, 15  Bin b: 21, 21, 24  Bin c: 25, 25, 34  In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.  Regression  Data can be smoothed by fitting the data into a regression functions.  Clustering:  Outliers may be detected by clustering,where similar values are organized into groups, or “clusters.Values that fall outside of the set of clusters may be considered outliers.
  • 144. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 2.) Data Integration In Data Mining  Here comes a second step in the data mining process. From various zones, your data is incorporated into a single zone.  Data in your computer system is stored in different formats under different locations. These are your saved spreadsheets, text files, images, documents, etc.  Data integration can give a real tough time if you are previously messed up with your organization. Data integration sets free data from repetition without affecting the reliability of the data.  Data Integration is a data preprocessing technique that combines data from multiple sources and provides users a unified view of these data.
  • 145. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 146. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  There are mainly 2 major approaches for data integration:- 1 Tight Coupling  In tight coupling data is combined from different sources into a single physical location through the process of ETL - Extraction, Transformation and Loading. Here, a data warehouse is treated as an information retrieval component. 2 Loose Coupling  In loose coupling data only remains in the actual source databases. In this approach, an interface is provided that takes query from user and transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result. And the data only remains in the actual source databases.
  • 147. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Issues in Data Integration: There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection and resolution of data value conflicts. These are explained in brief as following below.  1. Schema Integration: • Integrate metadata from different sources. • The real world entities from multiple source be matched referred to as the entity identification problem.  For example, How can the data analyst and computer be sure that customer id in one data base and customer number in another reference to the same attribute.
  • 148. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  2. Redundancy: • An attribute may be redundant if it can be derived or obtaining from another attribute or set of attribute. • Inconsistencies in attribute can also cause redundanciesin the resulting data set. • Some redundancies can be detected by correlation analysis.  3. Detection and resolution of data value conflicts: • This is the third important issues in data integration. • Attribute values from another different sources may differ for the same real world entity. • An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.
  • 149. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 3.) Data Transformation In Data Mining  In data transformation process data are transformed from one format to another format, that is more appropriate for data mining.  Some Data Transformation Strategies:-  1 Smoothing  Smoothing is a process of removing noise from the data.  2 Aggregation  Aggregation is a process where summary or aggregation operations are applied to the data.  3 Generalization  In generalization low-level data are replaced with high-level data by using concept hierarchies climbing.
  • 150. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  4 Normalization  Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to 1.0.  5 Attribute Construction  In Attribute construction, new attributes are constructed from the given set of attributes.
  • 151. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Discrete vs Continuous Data  If you have quantitative data, like a number of workers in a company, could you divide every one of the workers into 2 parts? The answer is absolutely NOT. Because the number of workers is discrete data.  Discrete data is a count that involves integers.  Only a limited number of values is possible.  The discrete values cannot be subdivided into parts.  For example, the number of children in a school is discrete data. You can count whole individuals. You can’t count 1.5 kids.  So, discrete data can take only certain values. The data variables cannot be divided into smaller parts.
  • 152. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  As we mentioned above the two types of quantitative data (numerical data) are discrete and continuous data.  Continuous data is considered as the opposite of discrete data.  Continuous data is information that could be meaningfully divided into finer levels.  It can be measured on a scale and can have almost any numeric value.  For example, you can measure your height at very precise scales — meters, centimeters, millimeters and etc.  You can record continuous data at so many different measurements – width, temperature, time, and etc.  This is where the key difference with discrete data lies.
  • 153. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 154. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in 4.) Data discretization: Data discretization converts a large number of data values into smaller once, so that data evaluation and data management becomes very easy.  Data Discretization techniques can be used to divide the range of continuous attribute into intervals.Numerous continuous attribute values are replaced by small interval labels.  This leads to a concise, easy-to-use, knowledge-level representation of mining results.  Data discretization example  we have an attribute of age with the following values. (Before Discretization) Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
  • 155. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in Attribute Age Age Age 10,11,13,14,17, 19, 30, 31, 32, 38, 40, 42 70 , 72, 73, 75 After Discretization Young Mature Old
  • 156. Data Mining Process - Discretization and Concept Hierarchy Generation for Numerical Data Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Typical methods: 1 Binning  Binning is a top-down splitting technique based on a specified number of bins. Binning is an unsupervised discretization technique. 2 Histogram Analysis  Because histogram analysis does not use class information so it is an unsupervised discretization technique. Histograms partition the values for an attribute into disjoint ranges called buckets.  3 Cluster Analysis  Cluster analysis is a popular data discretization method.A clustering algorithm can be applied to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
  • 157. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Top-down discretization  If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then repeats this recursively on the resulting intervals, then it is called top-down discretization or splitting.  Bottom-up discretization  If the process starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals, then it is called bottom-up discretization or merging.  Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a concept hierarchy.
  • 158. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 159. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Concept hierarchies  Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.  In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives.  Data mining on a reduced data set means fewer input/output operations and is more efficient than mining on a larger data set.  Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining.
  • 160. Data Mining Process Mr. Sagar Pandya sagar.pandya@medicaps.ac.in
  • 161. Summary Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Data Mining is all about explaining the past and predicting the future for analysis.  Data mining helps to extract information from huge sets of data. It is the procedure of mining knowledge from data.  Data mining process includes business understanding, Data Understanding, Data Preparation, Modelling, Evolution, Deployment.  Important Data mining techniques are Classification, clustering, Regression, Association rules, Outer detection, Sequential Patterns, and prediction  R-language and Oracle Data mining are prominent data mining tools.  Data mining technique helps companies to get knowledge-based information.
  • 162. Summary Mr. Sagar Pandya sagar.pandya@medicaps.ac.in • The main drawback of data mining is that many analytics software is difficult to operate and requires advance training to work on. • Data mining is used in diverse industries such as Communications, Insurance, Education, Manufacturing, Banking, Retail, Service providers, eCommerce, Supermarkets Bioinformatics.
  • 163. Unit – 2 Any - 5 Assignment Questions Marks:-20 Mr. Sagar Pandya sagar.pandya@medicaps.ac.in  Q.1 What is Data Mining? Explain the different stages for Data Mining Process.  Q.2 Describe challenges to Data Mining regarding data mining methodology and user interaction issues.  Q.3 Describe the various techniques of Data Mining. Write tools for Data Mining.  Q.4 What is Data Cleaning? Describe the approaches to fill missing values and noisy data.  Q.5 Explain Knowledge Discovery Process.  Q.6 Define Support and Confidence in Association rule mining.  Q.7 Explain Data mining Architecture. Write Some Application of Data mining.
  • 165. Thank You Great God, Medi-Caps, All the attendees Mr. Sagar Pandya sagar.pandya@medicaps.ac.in www.sagarpandya.tk LinkedIn: /in/seapandya Twitter: @seapandya Facebook: /seapandya

Editor's Notes

  1. Sample Slide
  2. Sample Slide
  3. Sample Slide
  4. Sample Slide
  5. Sample Slide
  6. Sample Slide
  7. Sample Slide
  8. Sample Slide
  9. Sample Slide
  10. Sample Slide
  11. Sample Slide
  12. Sample Slide
  13. Sample Slide
  14. Sample Slide
  15. Sample Slide
  16. Sample Slide
  17. Sample Slide
  18. Sample Slide
  19. Sample Slide
  20. Sample Slide
  21. Sample Slide
  22. Sample Slide
  23. Sample Slide
  24. Sample Slide
  25. Sample Slide
  26. Sample Slide
  27. Sample Slide
  28. Sample Slide
  29. Sample Slide
  30. Sample Slide
  31. Sample Slide
  32. Sample Slide
  33. Sample Slide
  34. Sample Slide
  35. Sample Slide
  36. Sample Slide
  37. Sample Slide
  38. Sample Slide
  39. Sample Slide
  40. Sample Slide
  41. Sample Slide
  42. Sample Slide
  43. Sample Slide
  44. Sample Slide
  45. Sample Slide
  46. Sample Slide
  47. Sample Slide
  48. Sample Slide
  49. Sample Slide
  50. Sample Slide
  51. Sample Slide
  52. Sample Slide
  53. Sample Slide
  54. Sample Slide
  55. Sample Slide
  56. Sample Slide
  57. Sample Slide
  58. Sample Slide
  59. Sample Slide
  60. Sample Slide
  61. Sample Slide
  62. Sample Slide
  63. Sample Slide
  64. Sample Slide
  65. Sample Slide
  66. Sample Slide
  67. Sample Slide
  68. Sample Slide
  69. Sample Slide
  70. Sample Slide
  71. Sample Slide
  72. Sample Slide
  73. Sample Slide
  74. Sample Slide
  75. Sample Slide
  76. Sample Slide
  77. Sample Slide
  78. Sample Slide
  79. Sample Slide
  80. Sample Slide
  81. Sample Slide
  82. Sample Slide
  83. Sample Slide
  84. Sample Slide
  85. Sample Slide
  86. Sample Slide
  87. Sample Slide
  88. Sample Slide
  89. Sample Slide
  90. Sample Slide
  91. Sample Slide
  92. Sample Slide
  93. Sample Slide
  94. Sample Slide
  95. Sample Slide
  96. Sample Slide
  97. Sample Slide
  98. Sample Slide
  99. Sample Slide
  100. Sample Slide
  101. Sample Slide
  102. Sample Slide
  103. Sample Slide
  104. Sample Slide
  105. Sample Slide
  106. Sample Slide
  107. Sample Slide
  108. Sample Slide
  109. Sample Slide
  110. Sample Slide
  111. Sample Slide
  112. Sample Slide
  113. Sample Slide
  114. Sample Slide
  115. Sample Slide
  116. Sample Slide
  117. Sample Slide
  118. Sample Slide
  119. Sample Slide
  120. Sample Slide
  121. Sample Slide
  122. Sample Slide
  123. Sample Slide
  124. Sample Slide
  125. Sample Slide
  126. Sample Slide
  127. Sample Slide
  128. Sample Slide
  129. Sample Slide
  130. Sample Slide
  131. Sample Slide
  132. Sample Slide
  133. Sample Slide
  134. Sample Slide
  135. Sample Slide
  136. Sample Slide
  137. Sample Slide
  138. Sample Slide
  139. Sample Slide
  140. Sample Slide
  141. Sample Slide
  142. Sample Slide
  143. Sample Slide
  144. Sample Slide
  145. Sample Slide
  146. Sample Slide
  147. Sample Slide
  148. Sample Slide
  149. Sample Slide
  150. Sample Slide
  151. Sample Slide
  152. Sample Slide
  153. Sample Slide
  154. Sample Slide
  155. Sample Slide
  156. Sample Slide
  157. Sample Slide
  158. Sample Slide
  159. Sample Slide
  160. Sample Slide