1. MC0088- DATA WAREHOUSING & DATA MINING
Que.1 Differentiate between Data Mining and Data Warehousing?
Ans: -
Data Mining: -
Data Mining: A hot buzzword for a class of database applications that look for hidden patterns in a
group of data. For example, data mining software can help retail companies find customers with
common interests.
The term is commonly misused to describe software that presents data in new ways. True data
mining software doesn't just change the presentation, but actually discovers previously unknown
relationships among the data.
Data mining consists of many up-to-date techniques such as classification (decision trees, native
Bayes classifier, k-nearest neighbor, and neural networks), clustering (k-means, hierarchical
clustering, and density-based clustering), association (one-dimensional, multidimensional,
multilevel association, constraint-based association). Many years of practice show that data mining
is a process, and its successful application requires data preprocessing (dimensionality reduction,
cleaning, noise/outlier removal), post processing (understand ability, summary, presentation),
good understanding of problem domains and domain expertise.
Data Warehousing: -
The construction of data warehouse, which involves data cleaning and data integration, can be
viewed as an important preprocessing step for data mining. Moreover, data warehouses provide
on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of
varied granularities, which facilitate effective data mining. Furthermore, many other data mining
functions such as classification, prediction, association and clustering can be integrated with OLAP
operation to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the
data warehouse has become an increasingly important platform for data analysis and online
analytical processing and will provide an effective platform for data mining. Therefore, prior to
presenting a systematic coverage of data mining technology in the remainder of this book, we
devote this unit to an overview of data warehouse technology. Such an overview is essential for
understanding data mining technology.
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. A data warehouse refers to a database that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for the integration of a variety
of application systems.
Data warehousing is defined as a process of centralized data management and retrieval. Data
warehousing, like data mining, is a relatively new term although the concept itself has been around
for years.
2. Que.2 Describe the key features of a Data Warehouse?
Ans: -
According to W. H. Inmon, a leading architect in the construction of data warehouse systems, “A
data warehouse is a subject – oriented, integrated, and time – variant, and nonvolatile collection of
data in support of management’s decision making process”.
Key features of a Data Warehouse
1) Subject – oriented
2) Integrated
3) Time – variant:
4) Nonvolatile
Subject – oriented: -
A data warehouse is organized around major subjects, such as customer, supplier, product, and
sales. Rather than concentrating on the day-to-day operation and transaction processing of an
organization, a data warehouse focuses on the modeling and analysis of data for decision makers.
Hence, data warehouses typically provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision support process.
Integrated: -
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as
relational databases, flat files, and on – line transaction records. Data cleaning and data integration
techniques are applied to ensure consistency in naming conventions, encoding structures, attribute
measures, and so on.
Time – variant: -
Data are stored to provide information from a historical perspective (e.g., the past 5 – 10 years).
Every key structure in the data warehouse contains, either implicitly or explicitly, an element of
time.
Nonvolatile: -
A data warehouse is always a physically separate store of data transformed from the application
data found in the operational environment. Due to this separation, a data warehouse does not
require transaction processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases (examples include IBM Data
Joiner and Informix Data Blade). When a query is posed to a client site, a metadata dictionary is
used to translate the query into queries appropriate for the individual heterogeneous sites
involved. These queries are then mapped and sent to local query processors. The results returned
form the different sites are integrated into a global answer set.
3. Que. 3 Differentiate between Data Integration and Transformation?
Ans: -
Data Integration: -
Data Integration is one of the steps of Data Preprocessing that involves combining data residing in
different sources and providing users with a unified view of these data It does merging data from
multiple data stores (data sources) like as under : -
1) Data Migration
2) Data Synchronization
3) ETL
4) Business Intelligence
5) Master Data Management
Data Migration: -
Data Migration is the process of transferring data from one system to another while changing the
storage, database or application.
Data Synchronization: -
Data Synchronization is a process of establishing consistency among systems and subsequent
continuous updates to maintain consistency.
ETL: -
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of
how the data are loaded from the source system to the data warehouse.
Business Intelligence: -
Business Intelligence (BI) is a set of tools supporting the transformation of raw data into useful
information which can support decision making.
Master Data Management: -
Master Data Management (MDM) represents a set of tools and processes used by an enterprise to
consistently manage their non-transactional data.
4. Transformation
Data transformation is the process of converting data from one format (e.g. a database file, XML
document, or Excel sheet) to another. Because data often resides in different locations and formats
across the enterprise, data transformation is necessary to ensure data from one application or
database is intelligible to other applications and databases, a critical feature for applications
integration.
In a typical scenario where information needs to be shared, data is extracted from the source
application or data warehouse, transformed into another format, and then loaded into the target
location. Extraction, transformation, and loading (together known as ETL) are the central processes
of data integration. Depending on the nature of the integration scenario, data may need to be
merged, aggregated, enriched, summarized, or filtered.
The first step of data transformation is data mapping. Data mapping determines the relationship
between the data elements of two applications and establishes instructions for how the data from
the source application is transformed before it is loaded into the target application. In other words,
data mapping produces the critical metadata that is needed before the actual data conversion takes
place.
5. Que. 4 Differentiate between database management systems (DBMS) and data mining?
Ans: -
Database Management System (DBMS) is the software that manages data on physical storage
devices.
Data Mining: - Data mining is the process of discovering relationships among data in the database.
Area DBMS Data mining
Task
Extraction of detailed and
summary data
Knowledge discovery of hidden
patterns and insights
Type of result Information Insight and Prediction
Method
Deduction (Ask the question,
verify the data)
Induction (Build the model,
apply it to new data, get the
result)
Example question
Who purchased mutual funds
in the last 3 years?
Who will buy a mutual fund in
the next 6 months and why?
Data mining is concerned with finding hidden relationships present in business data to allow
businesses to make predictions for future use. It is the process of data-driven extraction of not so
obvious but useful information from large databases.
The aim of data mining is to extract implicit, previously unknown and potentially useful (or
actionable) patterns from data. Data mining consists of many up-to-date techniques such as
classification (decision trees, naïve bays classifier, k -nearest neighbor, and neural networks),
clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional,
multidimensional, multilevel association, constraint-based association).
Data warehousing is defined as a process of centralized data management and retrieval.
Data warehouse is an enabled relational database system designed to support very large databases
(VLDB) at a significantly higher level of performance and manageability.
Data warehouse is an environment, not a product. It is an architectural construct of information
that is hard to accessory present in traditional operational data stores
6. Que. 5 Differentiate between K-means and Hierarchical clustering?
Ans: -
K-means clustering
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is
nearest. The center is the average of all the points in the cluster — that is, its coordinates are the
arithmetic mean for each dimension separately over all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1,x2,x3) and Y =
(y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where
The algorithm steps are as under: -
Choose the number of clusters, k.
Randomly generate k clusters and determine the cluster centers, or directly generate k random
points as cluster centers.
Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the
distance measures discussed above.
Recomputed the new cluster centers.
Repeat the two previous steps until some convergence criterion is met (usually that the assignment
hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large
datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting
clusters depend on the initial random assignments.
Hierarchical clustering: -
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure
called a dendrogram. The root of the tree consists of a single cluster containing all observations,
and the leaves correspond to individual observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the
leaves and successively merges clusters together; or divisive, in which one starts at the root and
recursively splits the clusters.
Any non-negative-valued function may be used as a measure of similarity between pairs of
observations. The choice of which clusters to merge or split is determined by a linkage criterion,
which is a function of the pair wise distances between observations.
Cutting the tree at a given height will give a clustering at a selected precision. In the following
example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third
row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger
clusters.
This method builds the hierarchy from the individual elements by progressively merging clusters.
In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which
elements to merge in a cluster.
7. Que. 6 Differentiate between Web content mining and Web usage mining?
Ans: -
Web Content Mining: -
Web content mining targets the knowledge discovery, in which the main objects are the traditional
collections of multimedia documents such as images, video, and audio, which are embedded in or
linked to the web pages. It is also quite different from Data mining because Web data are mainly
semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web
content mining is also different from Text mining because of the semi-structure nature of the Web,
while Text mining focuses on unstructured texts. Web content mining thus requires creative
applications of Data mining and / or Text mining techniques and also its own unique approaches. In
the past few years, there was a rapid expansion of activities in the Web content mining area. This is
not surprising because of the phenomenal growth of the Web contents and significant economic
benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data,
automated discovery of targeted or unexpected knowledge information still present many
challenging research problems. Web content mining could be differentiated from two points of
view:
1) Agent-based approach
2) Database approach.
The first approach aims on improving the information finding and filtering.
The second approach aims on modeling the data on the. Web into more structured form in order to
apply standard database querying mechanism and data mining applications to analyze it
Web Usage Mining: -
Web Usage Mining focuses on techniques that could predict the behavior of users while they are
interacting with the WWW. Web usage mining, discover user navigation patterns from web data,
tries to discover the useful information from the second array data derived from the interactions of
the users while surfing on the Web.
There are several available research projects and commercial tools that analyze those patterns for
different purposes. The insight knowledge could be utilized in personalization, system
improvement, site modification, business intelligence and usage characterization. The only
information left behind by many users visiting a Web site is the path through the pages they have
accessed. Most of the Web information retrieval tools only use the textual information, while they
ignore the link information that could be very valuable. In general, there are mainly four kinds of
data mining techniques applied to the web mining domain to discover the user navigation pattern:
1) Association Rule mining
2) Sequential pattern
3) Clustering
4) Classification