SlideShare una empresa de Scribd logo
1 de 96
A.V.C.COLLEGE OF ENGINEERING
MANNAMPANDAL, MAYILADUTHURAI-609 305
COURSE MATERIAL
FOR THE SUBJECT OF
SUB NAME : CS1011 DATA WAREHOUSING AND DATA MINING
SEM : VII
DEPARTMENT : COMPUTER SCIENCE AND ENGINEERING
ACADEMIC YEAR : 2013-2013
NAME OF THE FACULTY : PARVATHI.M
DESIGNATION : Asst.Professor
1
A.V.C College of Engineering
Department of Computer Science & Engineering
2013 Odd Semester
Lesson Plan
SYLLABUS
ELECTIVE II
CS1011 – DATA WAREHOUSING AND DATA MINING L T P
3 0 0
UNIT I BASICS OF DATA WAREHOUSING 8
Introduction − Data warehouse − Multidimensional data model − Data warehouse
architecture −Implementation − Further development − Data warehousing to data mining.
UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES,
CONCEPT
DESCRIPTION 8
Why preprocessing − Cleaning − Integration − Transformation − Reduction −
Discretization – Concept hierarchy generation − Data mining primitives − Query
language − Graphical user interfaces − Architectures − Concept description − Data
generalization − Characterizations − Class comparisons − Descriptive statistical
measures.
UNIT III ASSOCIATION RULES 9
Association rule mining − Single-dimensional boolean association rules from
transactional databases − Multi level association rules from transaction databases
UNIT IV CLASSIFICATION AND CLUSTERING 12
Classification and prediction − Issues − Decision tree induction − Bayesian classification
– Association rule based − Other classification methods − Prediction − Classifier
accuracy − Cluster analysis – Types of data − Categorization of methods − Partitioning
methods − Outlier analysis.
UNIT V RECENT TRENDS 8
Multidimensional analysis and descriptive mining of complex data objects − Spatial
databases −
Multimedia databases − Time series and sequence data − Text databases − World Wide
Web −
Applications and trends in data mining.
Total:
45
2
TEXT BOOKS
1. Han, J. and Kamber, M., “Data Mining: Concepts and Techniques”, Harcourt India /
Morgan
Kauffman, 2001.
2. Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Pearson
Education
2004.
REFERENCES
1. Sam Anahory and Dennis Murry, “Data Warehousing in the real world”, Pearson
Education,
2003.
2. David Hand, Heikki Manila and Padhraic Symth, “Principles of Data Mining”, PHI
2004.
3. W.H.Inmon, “Building the Data Warehouse”, 3rd Edition, Wiley, 2003.
4. Alex Bezon and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”,
McGraw- Hill Edition, 2001.
5. Paulraj Ponniah, “Data Warehousing Fundamentals”, Wiley-Interscience Publication,
2003.
3
UNIT I BASICS OF DATA WAREHOUSING
Introduction − Data warehouse − Multidimensional data model − Data warehouse
architecture −Implementation – Fur ther development − Data warehousing to data mining.
1.1Introduction to Data Warehousing
A data warehouse is a collection of data marts representing historical data from
different operations in the company. This data is stored in a structure optimized
for querying and data analysis as a data warehouse. Table design, dimensions
and organization should be consistent throughout a data warehouse so that
reports or queries across the data warehouse are consistent. A data warehouse
can also be viewed as a database for historical data from different functions
within a company.
Bill Inmon coined the term Data Warehouse in 1990, which he defined in the
following way: "A warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management's decision making
process".
• Subject Oriented: Data that gives information about a particular subject
instead of about a company's ongoing operations. Focusing on the
modelling and analysis of data for decision makers, not on daily
operations or transaction processing. It is used to provide a simple and
concise view around particular subject issues by excluding data that are
not useful in the decision support process.
• Integrated: Data that is gathered into the data warehouse from a variety
of sources and merged into a coherent whole. Data cleaning and data
integration techniques are applied. It is used to ensure consistency in
naming conventions, encoding structures, attribute measures, etc.
among different data sources. E.g., Hotel price: currency, tax, breakfast
covered, etc.
• Time-variant: All data in the data warehouse is identified with a
particular time period. The time horizon for the data warehouse is
significantly longer than that of operational systems.
o Operational database: current value data.
o Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Non-volatile: Data is stable in a data warehouse. More data is added
but data is never removed. Operational update of data does not occur in
the data warehouse environment. It does not require transaction
processing, recovery, and concurrency control mechanisms. It requires
only two operations in data accessing:
4
o initial loading of data and
o Access of data.
Data Warehouse is a single, complete and consistent store of data obtained
from a variety of different sources made available to end users in what they can
understand and use in a business context.
It can be
• Used for decision Support
• Used to manage and control business
• Used by managers and end-users to understand the business and make
judgments
Data Warehousing is an architectural construct of information systems that
provides users with current and historical decision support information that is
hard to access or present in traditional operational data stores
Other important terminology
Enterprise Data warehouse: It collects all information about subjects
(customers, products, sales, assets, personnel) that span the entire organization
Data Mart: Departmental subsets that focus on selected subjects. A data mart is
a segment of a data warehouse that can provide data for reporting and analysis
on a section, unit, department or operation in the company. E.g. sales, payroll,
production. Data marts are sometimes complete individual data warehouses
which are usually smaller than the corporate data warehouse.
Decision Support System (DSS): Information technology to help the
knowledge worker (executive, manager, and analyst) makes faster & better
decisions
Drill-down: Traversing the summarization levels from highly summarized data
to the underlying current or old detail
Metadata: Data about data. It is used to describe location and description of
warehouse system components such as names, definition, structure…
Benefits of data warehousing
• Data warehouses are designed to perform well with aggregate queries
running on large amounts of data.
• The structure of data warehouses is easier for end users to navigate,
understand and query against unlike the relational databases primarily
designed to handle lots of transactions.
5
• Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different databases
with different structures.
• Queries that would be complex in normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on
transaction systems.
• Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
• Data warehousing is an efficient way to manage demand for lots of
information from lots of users.
• Data warehousing provides the capability to analyze large amounts of
historical data for nuggets of wisdom that can provide an organization with
competitive advantage.
Operational and informational Data
1. Operational Data are used for focusing on transactional function such as bank
card withdrawals and deposits and they are
• Detailed
• Updateable
• Reflects current data
2. Informational Data are used for focusing on providing answers to problems
posed by decision makers
• Summarized
• Non updateable
1.1 Building a Data Warehouse
The selection of data warehouse technology - both hardware and software -
depends on many factors, such as:
• the volume of data to be accommodated,
• the speed with which data is needed,
• the history of the organization,
• which level of data is being built,
• how many users there will be,
• what kind of analysis is to be performed,
• Cost of technology, etc.
The hardware is typically mainframe, parallel, or client/server hardware. The
software that must be selected is for the basic data base manipulation of the data as
it resides on the hardware. Typically the software is either full function DBMS or
specialized data base software that has been optimized for the data warehouse.
Other software that needs to be considered is the interface software that provides
transformation and metadata capability such as PRISM Solutions Warehouse
6
Manager. A final piece of software that is important is the software needed for
changed data capture.
A rough sizing of data needs to be done to determine the fitness of the hardware
and software platforms. If the hardware and DBMS software are much too large
for the data warehouse, the costs of building and running the data warehouse will
be exorbitant. Even though performance will be no problem, development and
operational costs and finances will be a problem.
Conversely, if the hardware and DBMS software are much too small for the size of
the data warehouse, then performance of operations and the ultimate end user
satisfaction with the data warehouse will suffer. So, it is important that there be a
comfortable fit between the data warehouse and the hardware and DBMS software
that will house and manipulate the warehouse.
There are two factors required to build and use data warehouse. They are:
Business factors:
• Business users want to make decision quickly and correctly using all
available data.
Technological factors:
• To address the incompatibility of operational data stores
• IT infrastructure is changing rapidly. Its capacity is increasing and cost is
decreasing so that building a data warehouse is easy
There are several things to be considered while building a successful data
warehouse
1.2.1 Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
• Top - Down Approach (Suggested by Bill Inmon)
• Bottom - Up Approach (Suggested by Ralph Kimball)
a. Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized
repository to house corporate wide business data. This repository is called
Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized
form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of
truth of the data. The data in the EDW is stored at the most detail level. The reason
to build the EDW on the most detail level is to leverage the flexibility to be used
by multiple departments and to cater for future requirements.
7
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased
cost.
Once the EDW is implemented we start building subject area specific data marts
which contain data in a de normalized form also called star schema. The data in the
marts are usually summarized based on the end users analytical requirements.
The reason to de normalize the data in the mart is to provide faster access to the
data for the end users analytics. If we were to have queried a normalized schema
for the same analytics, we would end up in a complex multiple level joins that
would be much slower as compared to the one on the de normalized schema.
The top-down approach can be used when
1. The business has complete clarity on all or multiple subject areas data
warehouse requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important
for the data to be reliable, consistent across subject areas and for reconciliation in
case of data related contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time
and initial investment. The business has to wait for the EDW to be implemented
followed by building the data marts before which they can access their reports.
b. Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach
to build a data warehouse. In this approach data marts are built separately at
different points of time as and when the specific subject area requirements are
clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed
dimensions and conformed facts. A conformed dimension and a conformed fact is
one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means
exact same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to
it and at the same granularity across data marts.
8
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have
to wait for knowing the overall requirements of the warehouse. We should
implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only
one data mart.
Merits of Bottom Up approach:
• It does not require high initial costs and have a faster implementation time;
hence the business can start using the marts much earlier as compared to
the top-down approach.
Drawbacks of Bottom Up approach:
• It stores data in the de normalized format, so there would be high space
usage for detailed data.
• We have a tendency of not keeping detailed data in this approach hence
losing out on advantage of having detail data.
1.2.2 Design considerations
A successful data warehouse designer must adopt a holistic approach by
considering all data warehouse components as parts of a single complex system,
and take into account all possible data sources and all known usage requirements.
Most successful data warehouses have the following common characteristics:
1. Are based on a dimensional model
2. Contain historical and current data
3. Include both detailed and summarized data
4. Consolidate disparate data from multiple sources while retaining
consistency
Data warehouse is difficult to build due to the following reason:
• Heterogeneity of data sources
• Use of historical data
• Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following
specific points relevant to the data warehouse design:
1. Data Content
The content and structure of the data warehouse are reflected in its data model.
The data model is the template that describes how information will be organized
within the integrated warehouse framework. The data in a data warehouse must be
9
a detailed data. It must be formatted, cleaned up and transformed to fit the
warehouse data model.
2. Meta Data
It defines the location and contents of data in the warehouse. Meta data is
searchable by users to find definitions or subject areas. In other words, it must
provide decision support oriented pointers to warehouse data and thus provides a
logical link between warehouse data and decision support applications.
3. Data Distribution
One of the biggest challenges when designing a data warehouse is the data
placement and distribution strategy. Data volumes continue to grow in nature.
Therefore, it becomes necessary to know how the data should be divided across
multiple servers and which users should get access to which types of data. The
data can be distributed based on the subject area, location (geographical region), or
time (current, month, year).
4. Tools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with
the given data warehouse environment and with each other. All tools must be able
to use a common Meta data repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
1.2.3 Technical Considerations
A number of technical issues are to be considered when designing a data
warehouse environment. These issues include:
• The hardware platform that would house the data warehouse
• The DBMS that supports the warehouse database
• The communication infrastructure that connects data marts, operational
systems and end users
• The hardware and software to support meta data repository
• The systems management framework that enables centralized management
and administration of the entire environment.
10
1.2.4 Implementation Considerations
The following logical steps needed to implement a data warehouse:
• Collect and analyze business requirements
• Create a data model and a physical design
• Define data sources
• Choose the database technology and platform
• Extract the data from operational database, transform it, clean it up and load
it into the warehouse
• Choose database access and reporting tools
• Choose database connectivity software
• Choose data analysis and presentation software
• Update the data warehouse
Access Tools
Data warehouse implementation relies on selecting suitable data access tools. The
best way to choose this is based on the type of data and the kind of access it
permits for a particular user. The following lists the various types of data that can
be accessed:
• Simple tabular form data
• Ranking data
• Multivariable data
• Time series data
• Graphing, charting and pivoting data
• Complex textual search data
• Statistical analysis data
• Data for testing of hypothesis, trends and patterns
• Predefined repeatable queries
• Ad hoc user specified queries
• Reporting and analysis data
• Complex queries with multiple joins, multi level sub queries and
sophisticated search criteria
Data Extraction, Clean Up, Transformation and Migration
A proper attention must be paid to data extraction which represents a success
factor for data warehouse architecture. When implementing data warehouse the
following selection criteria should be considered:
• Timeliness of data delivery to the warehouse
• The tool must have the ability to identify the particular data and that can be
read by conversion tool
11
• The tool must support flat files, indexed files since corporate data is still in
this type
• The tool must have the capability to merge data from multiple data stores
• The tool should have specification interface to indicate the data to be
extracted
• The tool should have the ability to read data from data dictionary
• The code generated by the tool should be completely maintainable
• The tool should permit the user to extract the required data
• The tool must have the facility to perform data type and character set
translation
• The tool must have the capability to create summarization, aggregation and
derivation of records
• The data warehouse database system must be able to perform loading data
directly from these tools
Data Placement Strategies
As a data warehouse grows, there are at least two options for data placement. One
is to put some of the data in the data warehouse into another storage media. The
second option is to distribute the data in the data warehouse across multiple
servers.
It considers Data Replication and Database gateways.
Metadata
Meta data can define all data elements and their attributes, data sources and timing
and the rules that govern data use and data transformations.
User Sophistication Levels
The users of data warehouse data can be classified on the basis of their skill level
in accessing the warehouse. There are three classes of users:
Casual users: are most comfortable in retrieving information from warehouse in
pre defined formats and running pre existing queries and reports. These users do
not need tools that allow for building standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple
and ad hoc reports. These users can engage in drill down operations. These users
may have the experience of using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform
standard analysis on the info they retrieve. These users have the knowledge about
the use of query and report tools.
1.2 Multi-Tier Architecture
The functions of data warehouse are based on the relational data base technology.
The relational data base technology is implemented in parallel manner.
12
There are two advantages of having parallel relational data base technology for
data warehouse:
• Linear Speed up: refers the ability to increase the number of processor to
reduce response time
• Linear Scale up: refers the ability to provide same performance on the same
requests as the database size increases
1.3.1 Types of parallelism
There are two types of parallelism:
• Inter Query Parallelism: In which different server threads or processes
handle multiple requests at the same time.
• Intra Query Parallelism: This form of parallelism decomposes the serial
SQL query into lower level operations such as scan, join, sort etc. Then
these lower level operations are executed concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal Parallelism: the data base is partitioned across multiple disks
and parallel processing occurs within a specific task that is performed
concurrently on different processors against different set of data.
• Vertical Parallelism: This occurs among different tasks. All query
components such as scan, join, sort etc are executed in parallel in a
pipelined fashion. In other words, an output from one task becomes an input
into another task.
1.3.2 Database Architecture
There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shared nothing architecture
1. Shared Memory Architecture
Tightly coupled shared memory systems have the following characteristics:
• Multiple Processor Units share memory.
• Each Processor Unit has full access to all shared memory through a
common bus.
• Communication between nodes occurs via shared memory.
• Performance is limited by the bandwidth of the memory bus.
13
Interconnection Network
Process
or Unit
(PU)
Process
or Unit
(PU)
Process
or Unit
(PU)
Processo
r
Unit
Global Shared Memory
Fig. 1.3.2.1 Shared Memory Architecture
Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple
SMP nodes can be used with Oracle Parallel Server in a tightly coupled system,
where memory is shared by the multiple Processor Units, and is accessible by all
the Processor Units through a memory bus. Examples of tightly coupled systems
include the Pyramid, Sequent, and Sun SparcServer.
Performance is limited in a tightly coupled system by the factors:
• Memory bandwidth
• Processor Unit to Processor Unit communication bandwidth
• Memory availability
• I/O bandwidth and
• Bandwidth of the common bus.
Parallel processing advantages of shared memory systems are these:
• Memory access is cheaper than inter-node communication. This means that
internal synchronization is faster than using the Lock Manager.
• Shared memory systems are easier to administer than a cluster.
A disadvantage of shared memory systems for parallel processing is as follows:
• Scalability is limited by bus bandwidth and latency, and by available
memory.
2. Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated
in following figure, have the following characteristics:
• Each node consists of one or more Processor Units and associated memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-speed bus.
• Each node has access to the same disks and other resources.
• A node can be an SMP if the hardware supports it.
• Bandwidth of the high-speed bus limits the number of nodes of the system.
Fig. 1.3.2.2 Shared Disk Architecture
14
Interconnection Network
Processo
r
Unit
(PU)
Processo
r
Unit
(PU)
Processo
r
Unit
(PU)
Processor
Unit (PU)
Global Shared Memory
Each node is having its own data cache as the memory is not shared among the
nodes. Cache consistency must be maintained across the nodes and a lock manager
is needed to maintain the consistency. Additionally, instance locks using the DLM
on the Oracle level must be maintained to ensure that all nodes in the cluster see
identical data.
There is additional overhead in maintaining the locks and ensuring that the data
caches are consistent. The performance impact is dependent on the hardware and
software components, such as the bandwidth of the high-speed bus through which
the nodes communicate, and DLM performance.
Merits of shared disk systems:
• Shared disk systems permit high availability. All data is accessible even if
one node dies.
• These systems have the concept of one database, which is an advantage
over shared nothing systems.
• Shared disk systems provide for incremental growth.
Drawbacks of shared disk systems:
• Inter-node synchronization is required, involving DLM overhead and
greater dependency on high-speed interconnect.
• If the workload is not partitioned well, there may be high synchronization
overhead.
• There is operating system overhead of running shared disk software.
3. Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems
only one CPU is connected to a given disk. If a table or database is located on that
disk, access depends entirely on the Processor Unit which owns it. Shared nothing
systems can be represented as follows:
Fig. 1.3.2.3 Distributed Memory Architecture
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel
Server can access the disks on a shared nothing system as long as the operating
15
Interconnection Network
Processo
r
Unit
(PU)
Processo
r
Unit
(PU)
Processo
r
Unit
(PU)
Processor
Unit (PU)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
system provides transparent disk access, but this access is expensive in terms of
latency.
Advantages of Shared nothing systems:
• Shared nothing systems provide for incremental growth.
• System growth is practically unlimited.
• MPPs are good for read-only databases and decision support applications.
• Failure is local: if one node fails, the others stay up.
Drawbacks of Shared nothing systems:
• More coordination is required.
• More overhead is required for a process working on a disk belonging to
another node.
• If there is a heavy workload of updates or inserts, as in an online transaction
processing system, it may be worthwhile to consider data-dependent routing
to alleviate contention.
1.3 Data Warehousing Schema
There are three basic schemas that are used in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
1.4.1 Star schema
The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The
basic of star schema is that information can be classified into two groups:
• Facts
• Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
• Facts are core data element being analyzed
• Dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse
should be based upon the analysis of project requirements, accessible tools and
project team preferences.
16
Fig. 1.4.1.1 Star Schema
Star schema has points radiating from a center. The center of the star consists of
fact table and the points of the star are the dimension tables. Usually the fact tables
in a star schema are in third normal form (3NF) whereas dimensional tables are de-
normalized. Star schema is the simplest architecture and is most commonly used
and recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data
(facts) and a multipart index composed of foreign keys from the primary keys of
related dimension tables.
A fact table typically has two types of columns: foreign keys to dimension tables
and measures those that contain numeric facts. A fact table can contain fact's data
on detail or aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month,
quarter, year), Region dimension (profit by country, state, city), Product dimension
(profit for product1, product2).
A dimension is a structure usually composed of one or more hierarchies that
categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat
dimension or list. The primary keys of each of the dimension tables are part of the
composite primary key of the fact table. Dimensional attributes help to describe
the dimensional value. They are normally descriptive, textual values. Dimension
tables are generally small in size then fact table.
Measures
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit
measure which represents profit on each sale.
17
Cubes are data processing units composed of fact tables and dimensions from the
data warehouse. They provide multidimensional views of data, querying and
analytical capabilities to clients.
The main characteristics of star schema:
• Simple structure and easy to understand.
• Great query effectives for small number of tables to join
• Relatively long time of loading data into dimension tables for de-
normalization, redundancy data caused that size of the table could be
large.
• The most commonly used in the data warehouse implementations.
1.4.2 Snowflake schema: is the result of decomposing one or more of the
dimensions. The many-to-one relationships among sets of attributes of a dimension
can separate new dimension tables, forming a hierarchy. The decomposed
snowflake structure visualizes the hierarchical structure of dimensions very well.
1.4.3 Fact constellation schema: For each star schema it is possible to construct
fact constellation schema. The fact constellation architecture contains multiple fact
tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated
design because many variants for particular kinds of aggregation must be
considered and selected.
1.4 Multidimensional data model
Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold) with dimensions-product type, market and time with the
unit variables organized as cell in an array.
This cube can be expended to include another array-price-which can be associates
with all or only some dimensions. As number of dimensions increases number of
cubes cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies
for years, quarters, months, week and day. GEOGRAPHY may contain country,
state, city etc.
18
Fig. 1.5.1 Multidimensional cube
Each side of the cube represents one of the elements of the question. The x-axis
represents the time, the y-axis represents the products and the z-axis represents
different centers. The cells in the cube represents the number of product sold or
can represent the price of the items.
When the size of the dimension increases, the size of the cube will also increase
exponentially. The time response of the cube depends on the size of the cube.
1.5.1 Operations in Multidimensional Data Model:
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total sales by city and
year -> total sales by region and by year
• Selection (slice) defines a sub cube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by average income
• Visualization Operations (e.g., Pivot or dice)
1.5 OLAP operations
OLAP stands for Online Analytical Processing. It uses database tables (fact and
dimension tables) to enable multidimensional viewing, analysis and querying of
large amounts of data. OLAP technology could provide management with fast
answers to complex queries on their operational data or enable them to analyze
their company's historical data for trends and patterns.
Online Analytical Processing (OLAP) applications and tools are those that are
designed to ask “complex queries of large multidimensional collections of data.”
Operations:
 Roll up (drill-up): summarize data
19
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed data,
or introducing new dimensions
 Slice and dice:
 project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
1.6.1 OLAP Guidelines
Dr. E.F. Codd the “father” of the relational model, created a list of rules to deal
with the OLAP systems. Users should priorities these rules according to their
needs to match their business requirements. These rules are:
1) Multidimensional conceptual view: The OLAP should provide an
appropriate multidimensional Business model that suits the Business
problems and Requirements.
2) Transparency: The OLAP tool should provide transparency to the input data
for the users.
3) Accessibility: The OLAP tool should only access the data required only to
the analysis needed.
4) Consistent reporting performance: The Size of the database should not
affect in any way the performance.
5) Client/server architecture: The OLAP tool should use the client server
architecture to ensure better performance and flexibility.
6) Generic dimensionality: Data entered should be equivalent to the structure
and operation requirements.
7) Dynamic sparse matrix handling: The OLAP too should be able to manage
the sparse matrix and so maintain the level of performance.
8) Multi-user support: The OLAP should allow several users working
concurrently to work together.
9) Unrestricted cross-dimensional operations: The OLAP tool should be able
to perform operations across the dimensions of the cube.
10)Intuitive data manipulation. “Consolidation path re-orientation, drilling
down across columns or rows, zooming out, and other manipulation
inherent in the consolidation path outlines should be accomplished via
direct action upon the cells of the analytical model, and should neither
require the use of a menu nor multiple trips across the user interface.”
11)Flexible reporting: It is the ability of the tool to present the rows and
column in a manner suitable to be analyzed.
20
12)Unlimited dimensions and aggregation levels: This depends on the kind of
Business, where multiple dimensions and defining hierarchies can be made.
In addition to these guidelines an OLAP system should also support:
• Comprehensive database management tools: This gives the database
management to control distributed Businesses
• The ability to drill down to detail source record level: Which requires that
The OLAP tool should allow smooth transitions in the multidimensional
database.
• Incremental database refresh: The OLAP tool should provide partial refresh.
• Structured Query Language (SQL interface): the OLAP system should be
able to integrate effectively in the surrounding enterprise environment.
1.7Data warehouse implementation
1.7.1 Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L levels?
Materialization of data cube
 Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)
 Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.
Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator cube by,
introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
21

Join index: JI(R-id, S-id) where R (R-id, …) >< S (S-id, …)
 Traditional indices map the values to a list of record ids
 It materializes relational join in JI file and speeds up relational join
— a rather costly operation
 In data warehouses, join index relates the values of the dimensions of a start
schema to rows in the fact table.
 E.g. fact table: Sales and two dimensions city and product
 A join index on city maintains for each distinct city a list of R-
IDs of the tuples recording the Sales in the city
 Join indices can span multiple dimensions
22
1.8 Data Warehouse to Data Mining
1.8.1 Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing
 supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
23
1.8.2 From On-Line Analytical Processing to On Line Analytical
Mining (OLAM
 Why online analytical mining?
 High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
 Available information processing structure surrounding data
warehouses
 ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
 OLAP-based exploratory data analysis
 mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
 integration and swapping of multiple mining functions,
algorithms, and tasks.

1.8.3Architecture of OLAM
24
UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES,
CONCEPT DESCRIPTION
Why preprocessing − Cleaning − Integration − Transformation − Reduction −
Discretization – Concept hierarchy generation − Data mining primitives − Query
language − Graphical user interfaces − Architectures − Concept description − Data
generalization − Characterizations − Class comparisons − Descriptive statistical
measures.
2.1 Data preprocessing
Data preprocessing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to
prepare it for another processing procedure. Commonly used as a preliminary data
mining practice, data preprocessing transforms the data into a format that will be
more easily and effectively processed for the purpose of the user.
We need data processing as data in the real world are dirty. It can be in incomplete,
noisy and inconsistent from. These data needs to be preprocessed in order to
improve the quality of the data, and quality of the mining results.
• If no quality data, then no quality mining results. The quality decision is always
based on the quality data.
• If there is much irrelevant and redundant information present or noisy and
unreliable data, then knowledge discovery during the training phase is more
difficult.
• Incomplete data may come from
o “Not applicable” data value when collected
o Different considerations between the time when the data was collected
and when it is analyzed.
o Due to Human/hardware/software problems
o e.g., occupation=“ ”.
• Noisy data (incorrect values) may come from
o Faulty data collection by instruments
o Human or computer error at data entry
o Errors in data transmission and contain errors or outliers data. e.g.,
Salary=“-10”
• Inconsistent data may come from
o Different data sources
o Functional dependency violation (e.g., modify some linked data)
o Having discrepancies in codes or names. e.g., Age=“42”
Birthday=“03/07/1997”
2.5.1 Major Tasks in Data Preprocessing
• Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
25
• Data integration
o Integration of multiple databases, data cubes, or files
• Data transformation
o Normalization and aggregation
• Data reduction
o Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
o Part of data reduction but with particular importance, especially for
numerical data
Fig. 2.5.1.1 Forms of Data Preprocessing
2.2 Data cleaning:
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
i. Missing Values:
The various methods for handling the problem of missing values in data tuples
include:
(a) Ignoring the tuple: When the class label is missing the tuple can be
ignored. This method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is
time-consuming and may not be a reasonable task for large data sets with
26
many missing values, especially when the value to be filled in is not easily
determined.
(c) Using a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown” .
(d) Using the attribute mean for quantitative (numeric) values or
attribute mode for categorical (nominal) values, for all samples
belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that
of the given tuple.
(e) Using the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
ii. Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is
used for removing such noisy data.
Several Data smoothing techniques used:
a. Binning Method
b. Regression Method
c. Cluster Method
1 Binning methods: Binning methods smooth a sorted data value by consulting
the neighborhood", or values around it. The sorted values are distributed into a
number of 'buckets', or bins. Because binning methods consult the neighborhood of
values, they perform local smoothing.
In this technique,
1. The data for first sorted
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
a. Smoothing by bin means: Each value in the bin is replaced by the
mean value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the
bin median.
c. Smoothing by boundaries: The min and max values of a bin are
identified as the bin boundaries. Each bin value is replaced by the
closest boundary value.
• Example: Binning Methods for Data Smoothing
27
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
o Partition into (equi-depth) bins(equi depth of 3 since each bin
contains three values):
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
o Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
o Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore,
each original value in this bin is replaced by the value 9.
Smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the
data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
The following steps are required to smooth the above data using smoothing by bin
means with a bin depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already
sorted.)
• Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
• Step 3: Calculate the arithmetic mean of each bin.
• Step 4: Replace each of the values in each bin by the arithmetic mean
calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
28
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
2 Regression: smooth by fitting the data into regression functions.
• Linear regression involves finding the best of line to fit two variables, so that
one variable can be used to predict the other.
Fig. 2.5.1.2 Regression
• Multiple linear regression is an extension of linear regression, where more than
two variables are involved and the data are fit to a multidimensional surface.
Using regression to find a mathematical equation to fit the data helps smooth out
the noise.
3. Clustering: Outliers in the data may be detected by clustering, where similar
values are organized into groups, or ‘clusters’. Values that fall outside of the set of
clusters may be considered outliers.
Fig. 2.5.1.3 Clustering
iii. Data Cleaning Process:
• Field overloading: is a kind of source of errors that typically occurs when
developers compress new attribute definitions into unused portions of already
defined attributes.
• Unique rule is a rule says that each value of the given attribute must be
different from all other values of that attribute
• Consecutive rule is a rule says that there can be no missing values between the
lowest and highest values of the attribute and that all values must also be
unique.
• Null rule specifies the use of blanks, question marks, special characters or
other strings that may indicate the null condition and how such values should
be handled.
2.3 Data Integration
It combines data from multiple sources into a coherent store. There are number of
issues to consider during data integration.
Issues:
• Schema integration: refers integration of metadata from different sources.
29
• Entity identification problem: Identifying entity in one data source similar
to entity in another table. For example, customer_id in one database and
customer_no in another database refer to the same entity
• Detecting and resolving data value conflicts: Attribute values from
different sources can be different due to different representations, different
scales. E.g. metric vs. British units
• Redundancy: Redundancy can occur due to the following reasons:
• Object identification: The same attribute may have different names
in different db
• Derived Data: one attribute may be derived from another attribute.
• Correlation analysis is used to detect the redundancy.
2.4 Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
Data transformation can involve the following:
• Smoothing is used to remove noise from the data. It includes binning,
regression, and clustering.
• Aggregation operations such as are applied to the data.
o For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts.
• Generalization of the data, where low-level or “primitive” (raw) data are
replaced by higher-level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher-level
concepts, like city or country.
• Normalization is used to scale the attribute data to fall within a small specified
range, such as -1:0 to 1:0, or 0:0 to 1:0.
• Attribute construction (or feature construction) is use to construct new
attributes which can be added from the given set of attributes to help the
mining process.
2.5 Data Reduction
Data reduction is a technique used to have a reduced representation of data set.
Various Strategies used for data reduction:
1. Data cube aggregation uses aggregation operations that can be applied to the
data in the construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant or redundant
attributes or dimensions may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the
data set size.
30
4. Numerosity reduction, where the data are replaced or estimated by smaller data
representations such as parametric models or nonparametric methods such as
clustering, sampling, and the use of histograms.
2.6.Data Discretization
Raw data values for attributes are replaced by ranges or higher conceptual levels in
data discretization.
The various methods used in Data Discretization are Binning, Histogram Analysis,
Entropy-Based Discretization, Interval merging by x2
analysis and Clustering.
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization.
Concept hierarchies
 reduce the data by collecting and replacing low level concepts (such
as numeric values for the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
 Prepare for further analysis
 Binning
 Histogram analysis
 Clustering analysis
 Entropy-based discretization
 Segmentation by natural partitioning
Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S1 and S2
using boundary T, the entropy after partitioning is
 The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization.
 The process is recursively applied to partitions obtained until some stopping
criterion is met, e.g.,
 Experiments show that it may reduce data size and improve classification
accuracy
31
Concept eneration for Categorical data
 Specification of a partial ordering of attributes explicitly at the schema level
by users or experts
 Specification of a portion of a hierarchy by explicit data grouping
 Specification of a set of attributes, but not of their partial ordering
 Specification of only a partial set of attributes
2.7 Data Mining Primitives
 Finding all the patterns autonomously in a database? — unrealistic because
the patterns could be too many but uninteresting
 Data mining should be an interactive process
 User directs what to be mined
 Users must be provided with a set of primitives to be used to communicate
with the data mining system
 Incorporating these primitives in a data mining query language
 More flexible user interaction
 Foundation for design of graphical user interface
 Standardization of data mining industry and practice
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria
Types of knowledge to be mined
 Characterization
 Discrimination
 Association
 Classification/prediction
 Clustering
 Outlier analysis
 Other data mining tasks
Background knowledge:Concept Hierarchies
 Schema hierarchy
 E.g., street < city < province_or_state < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
32
 Operation-derived hierarchy
 email address: login-name < department < university < country
 Rule-based hierarchy
 low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 -
P2) < $50
Measurements of Pattern Interestingness
 Simplicity
e.g., (association) rule length, (decision) tree size
 Certainty
e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.
 Utility
potential usefulness, e.g., support (association), noise threshold
(description)
 Novelty
not previously known, surprising (used to remove redundant rules, e.g.,
Canada vs. Vancouver rule implication support ratio.
2.8 Data Mining Query Language (DMQL)
 Motivation
 A DMQL can provide the ability to support ad-hoc and interactive
data mining
 By providing a standardized language like SQL
 Hope to achieve a similar effect like that SQL has on
relational database
 Foundation for system development and evolution
 Facilitate information exchange, technology transfer,
commercialization and wide acceptance
 Design
 DMQL is designed with the primitives described earlier
Syntax for DMQL
 Syntax for specification of
 task-relevant data
 the kind of knowledge to be mined
 concept hierarchy specification
 interestingness measure
 pattern presentation and visualization
 Putting it all together — a DMQL query
33
Syntax for task-relevant data specification
 use database database_name, or use data warehouse data_warehouse_name
 from relation(s)/cube(s) [where condition]
 in relevance to att_or_dim_list
 order by order_list
 group by grouping_list
 having condition
Syntax for specifying the kind of knowledge to be mined
 Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
 Discrimination
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
 Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
Classification
Mine_Knowledge_Specification ::=
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Prediction
Mine_Knowledge_Specification ::=
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Syntax for concept hierarchy specification
 To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
 We use different syntax to define different type of hierarchies
 schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
34
 set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
Syntax for interestingness measure specification
 Interestingness measures and thresholds can be specified by the user with
the statement:
with <interest_measure_name> threshold = threshold_value
 Example:
with support threshold = 0.05
with confidence threshold = 0.7
2.9 Designing Graphical User Interfaces based on a data
mining query language
 What tasks should be considered in the design GUIs based on a data mining
query language?
 Data collection and data mining query composition
 Presentation of discovered patterns
 Hierarchy specification and manipulation
 Manipulation of data mining primitives
 Interactive multilevel mining
 Other miscellaneous information
2.10 Data Mining System Architectures
 Coupling data mining system with DB/DW system
 No coupling—flat file processing, not recommended
 Loose coupling
 Fetching data from DB/DW
 Semi-tight coupling—enhanced DM performance
 Provide efficient implement a few data mining primitives in a
DB/DW system, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some
stat functions
 Tight coupling—A uniform information processing environment
 DM is smoothly integrated into a DB/DW system, mining
query is optimized based on mining query, indexing, query
processing methods, etc.
35
2.11Concept Description
 Descriptive vs. predictive data mining
 Descriptive mining: describes concepts or task-relevant data sets in
concise, summarative, informative, discriminative forms
 Predictive mining: Based on data and analysis, constructs models for
the database, and predicts the trend and properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct summarization of
the given collection of data
 Comparison: provides descriptions comparing two or more
collections of data
Concept Description vs. OLAP
 Concept description:
 can handle complex data types of the attributes and their
aggregations
 a more automated process
 OLAP:
 restricted to a small number of dimension and measure types
 user-controlled process
2.12Data Generalization and Summarization-based
Characterization
 Data generalization
 A process which abstracts a large set of task-relevant data in a
database from a low conceptual levels to higher ones.
 Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
Characterization: Data Cube Approach (without using AO-Induction)
 Perform computations and store results in data cubes
 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures
 e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a data cube
by roll-up and drill-down
 Limitations
36
 handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular measures.
 How it is done?
 Collect the task-relevant data( initial relation) using a relational
database query
 Perform generalization by attribute removal or attribute
generalization.
 Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
 Interactive presentation with users.
Basic Principles of Attribute-Oriented Induction
 Data focusing: task-relevant data, including dimensions, and the result is the
initial relation.
 Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct values for A, and
there exists a set of generalization operators on A, then select an operator
and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final relation/rule size.
Basic Algorithm for Attribute-Oriented Induction
 InitialRel: Query processing of task-relevant data, deriving the initial
relation.
 PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or how
high to generalize?
 PrimeGen: Based on the PreGen plan, perform generalization to the right
level to derive a “prime generalized relation”, accumulating the counts.
37
 Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3)
mapping into rules, cross tabs, visualization presentations.
Example
 DMQL: Describe general characteristics of graduate students in the Big-
University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#,
gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Presentation of Generalized Results
 Generalized relation:
 Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to contingency
tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
2.13 Mining Class Comparisons
 Comparison: Comparing two or more classes.
 Method:
 Partition the set of relevant data into the target class and the
contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures:
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different classes.
38
 Task
 Compare graduate and undergraduate students using discriminant
rule.
 DMQL query
 use Big_University_DB
 mine comparison as “grad_vs_undergrad_students”
 in relevance to name, gender, major, birth_place, birth_date, residence,
phone#, gpa
 for “graduate_students”
 where status in “graduate”
 versus “undergraduate_students”
 where status in “undergraduate”
 analyze count%
 from student
Example: Analytical comparison (2)
 Given
 attributes name, gender, major, birth_place, birth_date, residence,
phone# and gpa
 Gen(ai) = concept hierarchies on attributes ai
 Ui = attribute analytical thresholds for attributes ai
 Ti = attribute generalization thresholds for attributes ai
 R = attribute relevance threshold
Example: Analytical comparison (3)
 1. Data collection
 target and contrasting classes
 2. Attribute relevance analysis
 remove attributes name, gender, major, phone#
 3. Synchronous generalization
 controlled by user-specified dimension thresholds
 prime target and contrasting class(es) relations/cuboids
Class Description
 Quantitative characteristic rule
 necessary
 Quantitative discriminant rule
 sufficient
 Quantitative description rule
 necessary and sufficient

39
2.14Mining descriptive statistical measures in large databases
2.14.1 Mining Data Dispersion Characteristics
 Motivation
 To better understand the data: central tendency, variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency
 Mean
 Weighted arithmetic mean
 Median: A holistic measure
 Middle value if odd number of values, or average of the middle two
values otherwise
 estimated by interpolation
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:mean-mode=3x(mean-mode)
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th
percentile), Q3 (75th
percentile)
 Inter-quartile range: IQR = Q3 –Q1
 Five number summary: min, Q1, M,Q3, max
 Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation
 Variance s2
: (algebraic, scalable computation)
 Standard deviation s is the square root of variance s2
Boxplot Analysis
 Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
 Boxplot
40
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e., the height
of the box is IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to Minimum and
Maximum
Graphic Displays of Basic Statistical Descriptions
 Histogram: (shown before)
 Boxplot: (covered before)
 Quantile plot: each value xi is paired with fi indicating that approximately
100 fi % of data are ≤ xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another
 Scatter plot: each pair of values is a pair of coordinates and plotted as points
in the plane
 Loess (local regression) curve: add a smooth curve to a scatter plot to
provide better perception of the pattern of dependence
UNIT III ASSOCIATION RULES
Association rule mining − Single-dimensional boolean association rules from
transactional databases − Multi level association rules from transaction databases
3.1 Association rule mining:
Association rule mining is used for finding frequent patterns, associations,
correlations, or causal structures among sets of items or objects in transaction
databases, relational databases, and other information repositories. It searches for
interesting relationships among items in a given data set.
3.1.1 Market basket analysis: Electronic shops
A motivating example for association rule mining
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
Association rule mining is used for analyzing buying behavior. Frequently
purchased items can be placed in close proximity in order to further encourage the
sale of such items together.
If customers who purchase computers also tend to buy financial management
software at the same time, then placing the hardware display close to the software
display may help to increase the sales of both of these items.
41
Each basket can then be represented by a Boolean vector of values assigned to
these variable. The Boolean vectors can be analyzed for buying patterns which
reflect items that are frequent associated or purchased together. These patterns can
be represented in the form of association rules.
For example, the information that customers who purchase computers also tend to
buy financial management software at the same time is represented in association
Rule.
computer =>financial management software [support = 2%; confidence =
60%]
Example of association rule mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
Fig. 3.1.1 Market basket analysis
The discovery of such associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by
customers. For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the supermarket? Such
information can lead to increased sales by helping retailers to do selective
marketing and plan their shelf space.
For instance, placing milk and bread within close proximity may further encourage
the sale of these items together within single visits to the store.
3.1.2 Basic Concepts: Frequent Patterns and Association Rules
• Itemset X={x1, …, xk}
• Find all the rules XàY with min confidence and support
• support, s, probability that a transaction contains X∪Y
42
• confidence, c, conditional probability that a transaction having X also
contains Y.
Rule support and confidence are two measures of rule interestingness that were
described
A support of 2% for association Rule means that 2% of all the transactions
under analysis show that computer and financial management software are
purchased together
A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. Such thresholds can be set by users or domain experts.
Rules that satisfy both a minimum support threshold (min sup) and a
minimum confidence threshold (min conf) are called strong. By convention, we
write min sup and min conf values so as to occur between 0% and 100%,
• A set of items is referred to as an itemset.
• An itemset that contains k items is a k-itemset.
• The set of computer, financial management software is a 2-itemset.
• The occurrence frequency of an itemset is the number of transactions that
contain the itemset. This is also known as the frequency or support count
of the itemset.
• The number of transactions required for the itemset to satisfy minimum
support is referred to as the minimum support count.
Association rule mining - a two-step process:
Step 1: Find all frequent itemsets. By definition, each of these itemsets will
occur at least as frequently as a pre-determined minimum support count.
Step 2: Generate strong association rules from the frequent itemsets. By
definition, these rules must satisfy minimum support and minimum
confidence.
3.1.3 Association rule mining:
Association rules can be classified in various ways, based on the following criteria:
1. Based on the types of values handled in the rule:
• If a rule concerns associations between the presence or absence of items, it
is a Boolean association rule.
• If a rule describes associations between quantitative items or attributes, then
it is a quantitative association rule. In these rules, quantitative values for
items or attributes are partitioned into intervals.
43
age(X; “30 ……39") ^ income(X; “42K ….. 48K") )=>buys( X, high resolution
TV")
2. Based on the dimensions of data involved in the rule:
If the items or attributes in an association rule each reference only one dimension,
then it is a single- dimensional association rule.
The above rule could be rewritten as
buys(X; “computer") => buys(X; financial management software")
The above example is a single-dimensional association rule since it refers to only
one dimension, i.e., buys. If a rule references two or more dimensions, such as the
dimensions buys, time of transaction, and customer category, then it is a
multidimensional association rule.
3. Based on the levels of abstractions involved in the rule set:
Some methods for association rule mining can find rules at differing levels of
abstraction.
For example, suppose that a set of association rules mined included Rule
age(X,”30…..39")) buys(X; “laptop computer")
age(X; “30 …39") ) buys(X; “computer")
In the above said examples the items bought are referenced at different levels of
abstraction. We refer to the rule set mined as consisting of multilevel association
rules. If, instead, the rules within a given set do not reference items or attributes at
different levels of abstraction, then the set contains single-level association rules.
4. Based on the nature of the association involved in the rule:
Association mining can be extended to correlation analysis, where the absence or
presence of correlated items can be identified.
3.2 Mining single-Dimensional Boolean association rules from
Transactional databases
Different methods for mining the simplest form of association rules -
single-dimensional, single-level, Boolean association rules, such as those
discussed for market basket analysis presenting Apriori. It is a basic algorithm for
finding frequent itemsets. It uses a procedure for generating strong association
rules from frequent itemsets.
3.2.1 The Apriori algorithm: Finding frequent itemsets
Apriori is an influential algorithm for mining frequent itemsets for Boolean
association rules. The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties.
44
Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is
found. This set is denoted L1. L1 is used to find L2, the frequent 2-itemsets, which
is used to find L3, and so on, until no more frequent k-itemsets can be found. The
finding of each Lk requires one full scan of the database. To improve the
efficiency of the level-wise generation of frequent itemsets, an important property
called the Apriori property is used to reduce the search space.
The Apriori property. All non-empty subsets of a frequent itemset must also be
frequent.
By definition, if an itemset I does not satisfy the minimum support threshold, s,
then I is not frequent, i.e., P(I) < s. If an item A is added to the itemset I, then the
resulting itemset cannot occur more frequently than I. This property belongs to a
special category of properties called anti-monotone in the sense that if a set cannot
pass a test, all of its supersets will fail the same test as well. It is called anti-
monotone because the property is monotonic in the context of failing a test.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining
Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in
Lk_1. The notation li[j] refers to the jth item in li.
By convention, Apriori assumes that items within a transaction or itemset are
sorted in increasing lexicographic order. It also ensures that no duplicates are
generated.
3. The prune step: Ck is a superset of Lk, that is, its members may or may not
be frequent, but all of the frequent k-itemsets are included in Ck. A scan of
the database to determine the count of each candidate in Ck would result in
the determination of Lk. Ck can be huge, and so this could involve heavy
computation.
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
45
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-
itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if
any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot
be frequent either and so can be removed from Ck. This subset testing can be done
quickly by maintaining a hash tree of all frequent itemsets.
46
Fig. 3.2.1.1 Transactional data for an All Electronics branch
.
Let's look at a concrete example of Apriori, based on the All Electronics
transaction database, D, of There are nine transactions in this database.
1. In the first iteration of the algorithm, each item is a member of the set of
candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in
order to count the number of occurrences of each item.
2. Suppose that the minimum transaction support count required is 2 (i.e., min sup
= 2). The set of frequent 1-itemsets, L1, can then be determined. It consists of the
candidate 1-itemsets having minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses L1×L1 to
generate a candidate set of 2-itemsets, C2.
4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate 2-itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets, C3. Based on the Apriori
property that all subsets of a frequent itemset must also be frequent, we can
determine that the four latter candidates cannot possibly be frequent. We therefore
remove them from C3, thereby saving the effort of unnecessarily obtaining their
counts during the subsequent scan of D to determine L3.
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses L3×L3 to generate a candidate set of 4-itemsets, C4.
Generating association rules from frequent itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence). This
can be done for confidence, where the conditional probability is expressed in terms
of itemset support count:
• support count(A U B) is the number of transactions containing the itemsets
AUB, and
• support count(A) is the number of transactions containing the itemset A.
• Based on this equation, association rules can be generated as follows.
• For each frequent itemset, l, generate all non-empty subsets of l.
• For every non-empty subset s of l, output the rule s → (l-s)"
47
where min_conf is the minimum confidence threshold.
Variations of the Apriori algorithm
Many variations of the Apriori algorithm have been proposed. A number of
these variations are enumerated below. Methods 1 to 6 focus on improving the
efficiency of the original algorithm, while methods 7 and 8 consider transactions
over time.
1. A hash-based technique: Hashing itemset counts.
A hash-based technique can be used to reduce the size of the candidate k-
itemsets, Ck, for k > 1. For example, when scanning each transaction in the
database to generate the frequent 1-itemsets, L1, from the candidate
A 2-itemset whose corresponding bucket count in the hash table is below the
support threshold cannot be frequent and thus should be removed from the
candidate set. Such a hash-based technique may substantially reduce the number of
the candidate k-itemsets examined (especially when k = 2).
2. Transaction reduction: Reducing the number of transactions scanned in future
iterations. A transaction which does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked
or removed from further consideration since subsequent scans of the database for j-
itemsets, where j > k, will not require it.
3. Partitioning:
It is used for partitioning the data to find candidate itemsets. A partitioning
technique can be used which requires just two database scans to mine the frequent
itemsets. It consists of two phases.
• In Phase I, the algorithm subdivides the transactions of D into n
non-overlapping partitions. If the minimum support threshold for
transactions in D is min_sup, then the minimum itemset support count for a
partition is min_sup*the number of transactions in that partition.
• For each partition, all frequent itemsets within the partition are found.
These are referred to as local frequent itemsets.
• The procedure employs a special data structure which, for each itemset,
records the TID's of the transactions containing the items in the itemset.
This allows it to find all of the local frequent k-itemsets, for k = 1,2 ……..n
in just one scan of the database.
• The collection of frequent itemsets from all partitions forms a global
candidate itemset with respect to D.
• In Phase II, a second scan of D is conducted in which the actual support of
each candidate is assessed in order to determine the global frequent
48
itemsets. Partition size and the number of partitions are set so that each
partition can fit into main memory and therefore be read only once in each
phase.
4. Sampling:
It is used for Mining on a subset of the given data. The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead D.
5. Dynamic itemset counting:
It adds candidate itemsets at different points during a scan. A dynamic itemset
counting technique was proposed in which the database is partitioned into blocks
marked by start points.
In this variation, new candidate itemsets can be added at any start point, unlike in
Apriori, which determines new candidate itemsets only immediately prior to each
complete database scan. The technique is dynamic in that it estimates the support
of all of the itemsets that have been counted so far, adding new candidate itemsets
if all of their subsets are estimated to be frequent. The resulting algorithm requires
two database scans.
5.Calendric market basket analysis: Finding itemsets that are frequent in a set of
user-defined time intervals. Calendric market basket analysis uses transaction time
stamps to define subsets of the given database .
Construct FP-tree from a Transaction DB
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
49
Benefits of the FP-tree Structure
 Completeness:
 never breaks a long pattern of any transaction
 preserves complete information for frequent pattern mining
 Compactness
 reduce irrelevant information—infrequent items are gone
 frequency descending ordering: more frequent items are more likely
to be shared
 never be larger than the original database (if not count node-links
and counts)
 Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
 Recursively grow frequent pattern path using the FP-tree
 Method
 For each item, construct its conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path
(single path will generate all the combinations of its sub-paths, each
of which is a frequent pattern)
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
50
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so
far
 If the conditional FP-tree contains a single path, simply enumerate all
the patterns
3.3 Mining multilevel association rules from transaction
databases
Multilevel association rules
For many applications, it is difficult to find strong associations among data items
at low or primitive levels of abstraction due to the sparsity of data in
multidimensional space. Strong associations discovered at very high concept levels
may represent common sense knowledge.
Example : Suppose we are given the task-relevant set of transactional data in for
sales at the computer department of an All electronics branch, showing the items
purchased for each transaction TID. A concept hierarchy defines sequence of
mappings from a set of low level concepts to higher level, more general concepts.
Data can be generalized by replacing low level concepts within the
Fig. 3.3.1 Class Hierarchy
data by their higher level concepts, or ancestors, from a concept hierarchy 4. The
concept hierarchy has four levels, referred to as levels 0, 1, 2, and 3. By
convention, levels within a concept hierarchy are numbered from top to bottom,
starting with level 0 at the root node for all (the most general abstraction level).
• Level 1 includes computer, software, printer and computer accessory,
• Level 2 includes home computer, laptop computer, education software,
financial management software, .., and
• Level 3 includes IBM home computer, .., Microsoft educational software,
and so on. Level 3 represents the most specific abstraction level of this
hierarchy.
51
Fig. 2.3.2 Multilevel Mining with Reduced Support
Rules generated from association rule mining with concept hierarchies are called
multiple-level or multilevel association rules, since they consider more than one
concept level.
Approaches to mining multilevel association rules
In general, a top-down strategy is employed, where counts are accumulated
for the calculation of frequent itemsets at each concept level, starting at the
concept level 1 and working towards the lower, more specific concept levels, until
no more frequent itemsets can be found. That is, once all frequent itemsets at
concept level 1 are found, then the frequent itemsets at level 2 are found, and so
on.
For each level, any algorithm for discovering frequent itemsets may be used, such
as Apriori or its variations.
1. Using uniform minimum support for all levels (referred to as uniform
support): The same minimum support threshold is used when mining at each level
of abstraction. For example, a minimum support threshold of 5% is used
throughout (e.g., for mining from “computer" down to “laptop computer"). Both
“computer" and “laptop computer" are found to be frequent, while “home
computer" is not.
When a uniform minimum support threshold is used, the search procedure is
simplified. The method is also simple in that users are required to specify only one
minimum support threshold. An optimization technique can be adopted, based on
the knowledge that an ancestor is a superset of its descendents: the search avoids
examining itemsets containing any item whose ancestors do not have minimum
support.
Fig. 2.7.3.2.1 Multilevel Mining with Uniform Support
52
The uniform support approach is unlikely that items at lower levels of abstraction
will occur as frequently as those at higher levels of abstraction. If the minimum
support threshold is set too high, it could miss several meaningful associations
occurring at low abstraction levels.
If the threshold is set too low, it may generate many uninteresting associations
occurring at high abstraction levels. This provides the motivation for the following
approach.
2. Using reduced minimum support at lower levels (referred to as reduced
support): Each level of abstraction has its own minimum support threshold. The
lower the abstraction level is, the smaller the corresponding threshold is. For
example, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, “computer", “laptop computer", and “home computer"
are all considered frequent.
Fig. 2.7.3.2.2 Multilevel Mining with Reduced Support
For mining multiple-level associations with reduced support, there are a number of
alternative search strategies.
These include:
1. Level-By-Level Independent: This is a full breadth search, where no background
knowledge of frequent itemsets is used for pruning. Each node is examined,
regardless of whether or not its parent node is found to be frequent.
2. Level-Cross Filtering By Single Item: An item at the i-th
level is examined if and
only if its parent node at the (i-1)-th
level is frequent.
If a node is frequent, its children will be examined; otherwise, its descendents are
pruned from the search. For example, the descendent nodes of “computer" (i.e.,
“laptop computer" and home computer") are not examined, since “computer" is
not frequent.
3. Level-Cross Filtering By K-Item Set: A k-itemset at the ith
level is examined if
and only if its corresponding parent k-itemset at the (i-1)th
level is frequent.
53
UNIT IV CLASSIFICATION AND CLUSTERING
Classification and prediction − Issues − Decision tree induction − Bayesian
classification – Association rule based − Other classification methods − Prediction
− Classifier accuracy − Cluster analysis – Types of data − Categorization of
methods − Partitioning methods − Outlier analysis.
4.1 Classification vs. Prediction
 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in
classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts unknown or
missing values
 Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
54
 Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
4.2 Issues regarding classification and prediction (2):
Evaluating Classification Methods
 Predictive accuracy
 Speed and scalability
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provded by the model
 Goodness of rules
55
 decision tree size
 compactness of classification rules
4.3 Classification by Decision Tree Induction
• Decision tree
o A decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute.
o Each branch represents an outcome of the test, and each leaf node holds
a class label.
o The topmost node in a tree is the root node.
o Internal nodes are denoted by rectangles, and leaf nodes are denoted by
ovals.
o Some decision tree algorithms produce only binary trees whereas others
can produce non binary trees.
• Decision tree generation consists of two phases
o Tree construction
 Attribute selection measures are used to select the attribute that
best partitions the tuples into distinct classes.
o Tree pruning
 Tree pruning attempts to identify and remove such branches,
with the goal of improving classification accuracy on unseen
data.
• Use of decision tree: Classifying an unknown sample
o Test the attribute values of the sample against the decision
tree
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
o Tree is constructed in a top-down recursive divide-and-conquer manner
o At start, all the training examples are at the root
o Attributes are categorical (if continuous-valued, they are discretized in
advance)
o Examples are partitioned recursively based on selected attributes
o Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
o All samples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
o There are no samples left
Attribute Selection Measure
• Information gain
o All attributes are assumed to be categorical and can be modified for
continuous-valued attributes
56
• Gini index
o All attributes are assumed continuous-valued
o Assume there exist several possible split values for each attribute
o May need other tools, such as clustering, to get the possible split values
o Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
o Let the set of examples S contain p elements of class P and n elements
of class N
o The amount of information, needed to decide if an arbitrary example in
S belongs to P or N is defined as
Information Gain in Decision Tree Induction :
• Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …,
Sv}
• If Si contains pi examples of P and ni examples of N, the entropy, or the
expected information needed to classify objects in all subtrees Si is
• The encoding information that would be gained by branching on A
4.4 Bayesian Classification:
• Probabilistic Learning: Calculate explicit probabilities for hypothesis,
among the most practical approaches to certain types of learning
problems
• Incremental: Each training example can incrementally increase/decrease
the probability that a hypothesis is correct. Prior knowledge can be
combined with observed data.
• Probabilistic Prediction: Predict multiple hypotheses, weighted by their
probabilities
• Standard: Even when Bayesian methods are computationally intractable,
they can provide a standard of optimal decision making against which
other methods can be measured
Bayesian Theorem
57
np
n
np
n
np
p
np
p
npI
++
−
++
−= 22 loglog),(
)(),()( AEnpIAGain −=
• Given training data D, posteriori probability of a hypothesis h, P(h|
D) follows the Bayes theorem
• MAP (maximum posteriori) hypothesis
• Practical difficulty: It requires initial knowledge of many
probabilities, significant computational cost.
Naïve Bayes Classifier
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
o Let D be a training set of tuples and their associated class labels. As
usual, each tuple is represented by an n-dimensional attribute vector,
X = (x1, x2, : : : , xn), depicting n measurements made on the tuple
from n attributes, respectively, A1, A2, : : : , An.
o Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X,
the classifier will predict that X belongs to the class having the
highest posterior probability, conditioned on X.
P(Ci|X) > P(Cj|X) for 1≤ j ≤m; j≠ i:
o The class Ci for which P(CijX) is maximized is called the Maximum
posteriori hypothesis.
o As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be
maximized. If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely, that is, P(C1)
= P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
o The attributes are conditionally independent of one another, given
the
class label of the tuple
Rule Based Classification
A set of IF-THEN rules are used in Rule Based Classification.
Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-
based Classifier uses a set of IF-THEN rules for classification.
An IF-THEN rule is an expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
58
• The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or
precondition. The “THEN”-part (or right-hand side) is the rule consequent.
R1 can also be written as
R1: (age = youth) ^ (student = yes)) (buys computer = yes).
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Fig. If Then Example
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of
the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Advantages of Rule-Based Classifiers
• As highly expressive as decision trees
59
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes
Data mining notes

Más contenido relacionado

La actualidad más candente

Spatial data mining
Spatial data miningSpatial data mining
Spatial data miningMITS Gwalior
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data modelmoni sindhu
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouseJ M
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bankpkaviya
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Storage Area Network (San)
Storage Area Network (San)Storage Area Network (San)
Storage Area Network (San)sankcomp
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
Networking in cloud computing
Networking in cloud computingNetworking in cloud computing
Networking in cloud computingBarani Tharan
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 

La actualidad más candente (20)

Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Ppt
PptPpt
Ppt
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bank
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Storage Area Network (San)
Storage Area Network (San)Storage Area Network (San)
Storage Area Network (San)
 
OLAP
OLAPOLAP
OLAP
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Networking in cloud computing
Networking in cloud computingNetworking in cloud computing
Networking in cloud computing
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 

Destacado

Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDatamining Tools
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDataminingTools Inc
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Destacado (6)

Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Substitution Cipher
Substitution CipherSubstitution Cipher
Substitution Cipher
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 

Similar a Data mining notes

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data WarehousingAAKANKSHA JAIN
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptSumathiG8
 
presentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptxpresentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptxvipush1
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseSOMASUNDARAM T
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse conceptsobieefans
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyKate Campbell
 
Information Storage and Management notes ssmeena
Information Storage and Management notes ssmeena Information Storage and Management notes ssmeena
Information Storage and Management notes ssmeena ssmeena7
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptPalaniKumarR2
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptSamPrem3
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 

Similar a Data mining notes (20)

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Unit 1
Unit 1Unit 1
Unit 1
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
presentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptxpresentationofism-complete-1-100227093028-phpapp01.pptx
presentationofism-complete-1-100227093028-phpapp01.pptx
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Abstract
AbstractAbstract
Abstract
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Advances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing TechnologyAdvances And Research Directions In Data-Warehousing Technology
Advances And Research Directions In Data-Warehousing Technology
 
Information Storage and Management notes ssmeena
Information Storage and Management notes ssmeena Information Storage and Management notes ssmeena
Information Storage and Management notes ssmeena
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
DATA WAREHOUSING.2.pptx
DATA WAREHOUSING.2.pptxDATA WAREHOUSING.2.pptx
DATA WAREHOUSING.2.pptx
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx
 

Más de AVC College of Engineering (6)

Java lab 2
Java lab 2Java lab 2
Java lab 2
 
operating system lecture notes
operating system lecture notesoperating system lecture notes
operating system lecture notes
 
Ds lesson plan
Ds lesson planDs lesson plan
Ds lesson plan
 
Programming paradigms
Programming paradigmsProgramming paradigms
Programming paradigms
 
Dpsd lecture-notes
Dpsd lecture-notesDpsd lecture-notes
Dpsd lecture-notes
 
Software quality management lecture notes
Software quality management lecture notesSoftware quality management lecture notes
Software quality management lecture notes
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Data mining notes

  • 1. A.V.C.COLLEGE OF ENGINEERING MANNAMPANDAL, MAYILADUTHURAI-609 305 COURSE MATERIAL FOR THE SUBJECT OF SUB NAME : CS1011 DATA WAREHOUSING AND DATA MINING SEM : VII DEPARTMENT : COMPUTER SCIENCE AND ENGINEERING ACADEMIC YEAR : 2013-2013 NAME OF THE FACULTY : PARVATHI.M DESIGNATION : Asst.Professor 1
  • 2. A.V.C College of Engineering Department of Computer Science & Engineering 2013 Odd Semester Lesson Plan SYLLABUS ELECTIVE II CS1011 – DATA WAREHOUSING AND DATA MINING L T P 3 0 0 UNIT I BASICS OF DATA WAREHOUSING 8 Introduction − Data warehouse − Multidimensional data model − Data warehouse architecture −Implementation − Further development − Data warehousing to data mining. UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES, CONCEPT DESCRIPTION 8 Why preprocessing − Cleaning − Integration − Transformation − Reduction − Discretization – Concept hierarchy generation − Data mining primitives − Query language − Graphical user interfaces − Architectures − Concept description − Data generalization − Characterizations − Class comparisons − Descriptive statistical measures. UNIT III ASSOCIATION RULES 9 Association rule mining − Single-dimensional boolean association rules from transactional databases − Multi level association rules from transaction databases UNIT IV CLASSIFICATION AND CLUSTERING 12 Classification and prediction − Issues − Decision tree induction − Bayesian classification – Association rule based − Other classification methods − Prediction − Classifier accuracy − Cluster analysis – Types of data − Categorization of methods − Partitioning methods − Outlier analysis. UNIT V RECENT TRENDS 8 Multidimensional analysis and descriptive mining of complex data objects − Spatial databases − Multimedia databases − Time series and sequence data − Text databases − World Wide Web − Applications and trends in data mining. Total: 45 2
  • 3. TEXT BOOKS 1. Han, J. and Kamber, M., “Data Mining: Concepts and Techniques”, Harcourt India / Morgan Kauffman, 2001. 2. Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Pearson Education 2004. REFERENCES 1. Sam Anahory and Dennis Murry, “Data Warehousing in the real world”, Pearson Education, 2003. 2. David Hand, Heikki Manila and Padhraic Symth, “Principles of Data Mining”, PHI 2004. 3. W.H.Inmon, “Building the Data Warehouse”, 3rd Edition, Wiley, 2003. 4. Alex Bezon and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”, McGraw- Hill Edition, 2001. 5. Paulraj Ponniah, “Data Warehousing Fundamentals”, Wiley-Interscience Publication, 2003. 3
  • 4. UNIT I BASICS OF DATA WAREHOUSING Introduction − Data warehouse − Multidimensional data model − Data warehouse architecture −Implementation – Fur ther development − Data warehousing to data mining. 1.1Introduction to Data Warehousing A data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company. Bill Inmon coined the term Data Warehouse in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". • Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing. It is used to provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. • Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Data cleaning and data integration techniques are applied. It is used to ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources. E.g., Hotel price: currency, tax, breakfast covered, etc. • Time-variant: All data in the data warehouse is identified with a particular time period. The time horizon for the data warehouse is significantly longer than that of operational systems. o Operational database: current value data. o Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. Operational update of data does not occur in the data warehouse environment. It does not require transaction processing, recovery, and concurrency control mechanisms. It requires only two operations in data accessing: 4
  • 5. o initial loading of data and o Access of data. Data Warehouse is a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. It can be • Used for decision Support • Used to manage and control business • Used by managers and end-users to understand the business and make judgments Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores Other important terminology Enterprise Data warehouse: It collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company. E.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse. Decision Support System (DSS): Information technology to help the knowledge worker (executive, manager, and analyst) makes faster & better decisions Drill-down: Traversing the summarization levels from highly summarized data to the underlying current or old detail Metadata: Data about data. It is used to describe location and description of warehouse system components such as names, definition, structure… Benefits of data warehousing • Data warehouses are designed to perform well with aggregate queries running on large amounts of data. • The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions. 5
  • 6. • Data warehouses enable queries that cut across different segments of a company's operation. E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures. • Queries that would be complex in normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems. • Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company. • Data warehousing is an efficient way to manage demand for lots of information from lots of users. • Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage. Operational and informational Data 1. Operational Data are used for focusing on transactional function such as bank card withdrawals and deposits and they are • Detailed • Updateable • Reflects current data 2. Informational Data are used for focusing on providing answers to problems posed by decision makers • Summarized • Non updateable 1.1 Building a Data Warehouse The selection of data warehouse technology - both hardware and software - depends on many factors, such as: • the volume of data to be accommodated, • the speed with which data is needed, • the history of the organization, • which level of data is being built, • how many users there will be, • what kind of analysis is to be performed, • Cost of technology, etc. The hardware is typically mainframe, parallel, or client/server hardware. The software that must be selected is for the basic data base manipulation of the data as it resides on the hardware. Typically the software is either full function DBMS or specialized data base software that has been optimized for the data warehouse. Other software that needs to be considered is the interface software that provides transformation and metadata capability such as PRISM Solutions Warehouse 6
  • 7. Manager. A final piece of software that is important is the software needed for changed data capture. A rough sizing of data needs to be done to determine the fitness of the hardware and software platforms. If the hardware and DBMS software are much too large for the data warehouse, the costs of building and running the data warehouse will be exorbitant. Even though performance will be no problem, development and operational costs and finances will be a problem. Conversely, if the hardware and DBMS software are much too small for the size of the data warehouse, then performance of operations and the ultimate end user satisfaction with the data warehouse will suffer. So, it is important that there be a comfortable fit between the data warehouse and the hardware and DBMS software that will house and manipulate the warehouse. There are two factors required to build and use data warehouse. They are: Business factors: • Business users want to make decision quickly and correctly using all available data. Technological factors: • To address the incompatibility of operational data stores • IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so that building a data warehouse is easy There are several things to be considered while building a successful data warehouse 1.2.1 Business considerations: Organizations interested in development of a data warehouse can choose one of the following two approaches: • Top - Down Approach (Suggested by Bill Inmon) • Bottom - Up Approach (Suggested by Ralph Kimball) a. Top - Down Approach In the top down approach suggested by Bill Inmon, we build a centralized repository to house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy. The central repository for corporate wide data helps us maintain one version of truth of the data. The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to leverage the flexibility to be used by multiple departments and to cater for future requirements. 7
  • 8. The disadvantages of storing data at the detail level are 1. The complexity of design increases with increasing level of detail. 2. It takes large amount of space to store data at detail level, hence increased cost. Once the EDW is implemented we start building subject area specific data marts which contain data in a de normalized form also called star schema. The data in the marts are usually summarized based on the end users analytical requirements. The reason to de normalize the data in the mart is to provide faster access to the data for the end users analytics. If we were to have queried a normalized schema for the same analytics, we would end up in a complex multiple level joins that would be much slower as compared to the one on the de normalized schema. The top-down approach can be used when 1. The business has complete clarity on all or multiple subject areas data warehouse requirements. 2. The business is ready to invest considerable time and money. The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. This is very important for the data to be reliable, consistent across subject areas and for reconciliation in case of data related contention between subject areas. The disadvantage of using the Top Down approach is that it requires more time and initial investment. The business has to wait for the EDW to be implemented followed by building the data marts before which they can access their reports. b. Bottom Up Approach The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. In this approach data marts are built separately at different points of time as and when the specific subject area requirements are clear. The data marts are integrated or combined together to form a data warehouse. Separate data marts are combined through the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be shared across data marts. A Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across separate data marts. The conformed dimension means exact same thing with every fact table it is joined. A Conformed fact has the same definition of measures, same dimensions joined to it and at the same granularity across data marts. 8
  • 9. The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and when the requirements are clear. We don’t have to wait for knowing the overall requirements of the warehouse. We should implement the bottom up approach when 1. We have initial cost and time constraints. 2. The complete warehouse requirements are not clear. We have clarity to only one data mart. Merits of Bottom Up approach: • It does not require high initial costs and have a faster implementation time; hence the business can start using the marts much earlier as compared to the top-down approach. Drawbacks of Bottom Up approach: • It stores data in the de normalized format, so there would be high space usage for detailed data. • We have a tendency of not keeping detailed data in this approach hence losing out on advantage of having detail data. 1.2.2 Design considerations A successful data warehouse designer must adopt a holistic approach by considering all data warehouse components as parts of a single complex system, and take into account all possible data sources and all known usage requirements. Most successful data warehouses have the following common characteristics: 1. Are based on a dimensional model 2. Contain historical and current data 3. Include both detailed and summarized data 4. Consolidate disparate data from multiple sources while retaining consistency Data warehouse is difficult to build due to the following reason: • Heterogeneity of data sources • Use of historical data • Growing nature of data base Data warehouse design approach muse be business driven, continuous and iterative engineering approach. In addition to the general considerations there are following specific points relevant to the data warehouse design: 1. Data Content The content and structure of the data warehouse are reflected in its data model. The data model is the template that describes how information will be organized within the integrated warehouse framework. The data in a data warehouse must be 9
  • 10. a detailed data. It must be formatted, cleaned up and transformed to fit the warehouse data model. 2. Meta Data It defines the location and contents of data in the warehouse. Meta data is searchable by users to find definitions or subject areas. In other words, it must provide decision support oriented pointers to warehouse data and thus provides a logical link between warehouse data and decision support applications. 3. Data Distribution One of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to know how the data should be divided across multiple servers and which users should get access to which types of data. The data can be distributed based on the subject area, location (geographical region), or time (current, month, year). 4. Tools A number of tools are available that are specifically designed to help in the implementation of the data warehouse. All selected tools must be compatible with the given data warehouse environment and with each other. All tools must be able to use a common Meta data repository. Design steps The following nine-step method is followed in the design of a data warehouse: 1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the dimensions 4. Choosing the facts 5. Storing pre calculations in the fact table 6. Rounding out the dimension table 7. Choosing the duration of the db 8. The need to track slowly changing dimensions 9. Deciding the query priorities and query models 1.2.3 Technical Considerations A number of technical issues are to be considered when designing a data warehouse environment. These issues include: • The hardware platform that would house the data warehouse • The DBMS that supports the warehouse database • The communication infrastructure that connects data marts, operational systems and end users • The hardware and software to support meta data repository • The systems management framework that enables centralized management and administration of the entire environment. 10
  • 11. 1.2.4 Implementation Considerations The following logical steps needed to implement a data warehouse: • Collect and analyze business requirements • Create a data model and a physical design • Define data sources • Choose the database technology and platform • Extract the data from operational database, transform it, clean it up and load it into the warehouse • Choose database access and reporting tools • Choose database connectivity software • Choose data analysis and presentation software • Update the data warehouse Access Tools Data warehouse implementation relies on selecting suitable data access tools. The best way to choose this is based on the type of data and the kind of access it permits for a particular user. The following lists the various types of data that can be accessed: • Simple tabular form data • Ranking data • Multivariable data • Time series data • Graphing, charting and pivoting data • Complex textual search data • Statistical analysis data • Data for testing of hypothesis, trends and patterns • Predefined repeatable queries • Ad hoc user specified queries • Reporting and analysis data • Complex queries with multiple joins, multi level sub queries and sophisticated search criteria Data Extraction, Clean Up, Transformation and Migration A proper attention must be paid to data extraction which represents a success factor for data warehouse architecture. When implementing data warehouse the following selection criteria should be considered: • Timeliness of data delivery to the warehouse • The tool must have the ability to identify the particular data and that can be read by conversion tool 11
  • 12. • The tool must support flat files, indexed files since corporate data is still in this type • The tool must have the capability to merge data from multiple data stores • The tool should have specification interface to indicate the data to be extracted • The tool should have the ability to read data from data dictionary • The code generated by the tool should be completely maintainable • The tool should permit the user to extract the required data • The tool must have the facility to perform data type and character set translation • The tool must have the capability to create summarization, aggregation and derivation of records • The data warehouse database system must be able to perform loading data directly from these tools Data Placement Strategies As a data warehouse grows, there are at least two options for data placement. One is to put some of the data in the data warehouse into another storage media. The second option is to distribute the data in the data warehouse across multiple servers. It considers Data Replication and Database gateways. Metadata Meta data can define all data elements and their attributes, data sources and timing and the rules that govern data use and data transformations. User Sophistication Levels The users of data warehouse data can be classified on the basis of their skill level in accessing the warehouse. There are three classes of users: Casual users: are most comfortable in retrieving information from warehouse in pre defined formats and running pre existing queries and reports. These users do not need tools that allow for building standard and ad hoc reports Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports. These users can engage in drill down operations. These users may have the experience of using reporting and query tools. Expert users: These users tend to create their own complex queries and perform standard analysis on the info they retrieve. These users have the knowledge about the use of query and report tools. 1.2 Multi-Tier Architecture The functions of data warehouse are based on the relational data base technology. The relational data base technology is implemented in parallel manner. 12
  • 13. There are two advantages of having parallel relational data base technology for data warehouse: • Linear Speed up: refers the ability to increase the number of processor to reduce response time • Linear Scale up: refers the ability to provide same performance on the same requests as the database size increases 1.3.1 Types of parallelism There are two types of parallelism: • Inter Query Parallelism: In which different server threads or processes handle multiple requests at the same time. • Intra Query Parallelism: This form of parallelism decomposes the serial SQL query into lower level operations such as scan, join, sort etc. Then these lower level operations are executed concurrently in parallel. Intra query parallelism can be done in either of two ways: • Horizontal Parallelism: the data base is partitioned across multiple disks and parallel processing occurs within a specific task that is performed concurrently on different processors against different set of data. • Vertical Parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from one task becomes an input into another task. 1.3.2 Database Architecture There are three DBMS software architecture styles for parallel processing: 1. Shared memory or shared everything Architecture 2. Shared disk architecture 3. Shared nothing architecture 1. Shared Memory Architecture Tightly coupled shared memory systems have the following characteristics: • Multiple Processor Units share memory. • Each Processor Unit has full access to all shared memory through a common bus. • Communication between nodes occurs via shared memory. • Performance is limited by the bandwidth of the memory bus. 13 Interconnection Network Process or Unit (PU) Process or Unit (PU) Process or Unit (PU) Processo r Unit Global Shared Memory
  • 14. Fig. 1.3.2.1 Shared Memory Architecture Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can be used with Oracle Parallel Server in a tightly coupled system, where memory is shared by the multiple Processor Units, and is accessible by all the Processor Units through a memory bus. Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer. Performance is limited in a tightly coupled system by the factors: • Memory bandwidth • Processor Unit to Processor Unit communication bandwidth • Memory availability • I/O bandwidth and • Bandwidth of the common bus. Parallel processing advantages of shared memory systems are these: • Memory access is cheaper than inter-node communication. This means that internal synchronization is faster than using the Lock Manager. • Shared memory systems are easier to administer than a cluster. A disadvantage of shared memory systems for parallel processing is as follows: • Scalability is limited by bus bandwidth and latency, and by available memory. 2. Shared Disk Architecture Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the following characteristics: • Each node consists of one or more Processor Units and associated memory. • Memory is not shared between nodes. • Communication occurs over a common high-speed bus. • Each node has access to the same disks and other resources. • A node can be an SMP if the hardware supports it. • Bandwidth of the high-speed bus limits the number of nodes of the system. Fig. 1.3.2.2 Shared Disk Architecture 14 Interconnection Network Processo r Unit (PU) Processo r Unit (PU) Processo r Unit (PU) Processor Unit (PU) Global Shared Memory
  • 15. Each node is having its own data cache as the memory is not shared among the nodes. Cache consistency must be maintained across the nodes and a lock manager is needed to maintain the consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained to ensure that all nodes in the cluster see identical data. There is additional overhead in maintaining the locks and ensuring that the data caches are consistent. The performance impact is dependent on the hardware and software components, such as the bandwidth of the high-speed bus through which the nodes communicate, and DLM performance. Merits of shared disk systems: • Shared disk systems permit high availability. All data is accessible even if one node dies. • These systems have the concept of one database, which is an advantage over shared nothing systems. • Shared disk systems provide for incremental growth. Drawbacks of shared disk systems: • Inter-node synchronization is required, involving DLM overhead and greater dependency on high-speed interconnect. • If the workload is not partitioned well, there may be high synchronization overhead. • There is operating system overhead of running shared disk software. 3. Shared Nothing Architecture Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is connected to a given disk. If a table or database is located on that disk, access depends entirely on the Processor Unit which owns it. Shared nothing systems can be represented as follows: Fig. 1.3.2.3 Distributed Memory Architecture Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel Server can access the disks on a shared nothing system as long as the operating 15 Interconnection Network Processo r Unit (PU) Processo r Unit (PU) Processo r Unit (PU) Processor Unit (PU) Local Memory Local Memory Local Memory Local Memory
  • 16. system provides transparent disk access, but this access is expensive in terms of latency. Advantages of Shared nothing systems: • Shared nothing systems provide for incremental growth. • System growth is practically unlimited. • MPPs are good for read-only databases and decision support applications. • Failure is local: if one node fails, the others stay up. Drawbacks of Shared nothing systems: • More coordination is required. • More overhead is required for a process working on a disk belonging to another node. • If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may be worthwhile to consider data-dependent routing to alleviate contention. 1.3 Data Warehousing Schema There are three basic schemas that are used in dimensional modeling: 1. Star schema 2. Snowflake schema 3. Fact constellation schema 1.4.1 Star schema The multidimensional view of data that is expressed using relational data base semantics is provided by the data base schema design called star schema. The basic of star schema is that information can be classified into two groups: • Facts • Dimension Star schema has one large central table (fact table) and a set of smaller tables (dimensions) arranged in a radial pattern around the central table. • Facts are core data element being analyzed • Dimensions are attributes about the facts. The determination of which schema model should be used for a data warehouse should be based upon the analysis of project requirements, accessible tools and project team preferences. 16
  • 17. Fig. 1.4.1.1 Star Schema Star schema has points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional tables are de- normalized. Star schema is the simplest architecture and is most commonly used and recommended by Oracle. Fact Tables A fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of foreign keys from the primary keys of related dimension tables. A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level. Dimension Tables Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product dimension (profit for product1, product2). A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table. Measures Measures are numeric data based on columns in a fact table. They are the primary data which end users are interested in. E.g. a sales fact table may contain a profit measure which represents profit on each sale. 17
  • 18. Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients. The main characteristics of star schema: • Simple structure and easy to understand. • Great query effectives for small number of tables to join • Relatively long time of loading data into dimension tables for de- normalization, redundancy data caused that size of the table could be large. • The most commonly used in the data warehouse implementations. 1.4.2 Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to-one relationships among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of dimensions very well. 1.4.3 Fact constellation schema: For each star schema it is possible to construct fact constellation schema. The fact constellation architecture contains multiple fact tables that share many dimension tables. The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. 1.4 Multidimensional data model Multidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, market and time. The cube on the right associates sales number (unit sold) with dimensions-product type, market and time with the unit variables organized as cell in an array. This cube can be expended to include another array-price-which can be associates with all or only some dimensions. As number of dimensions increases number of cubes cell increase exponentially. Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months, week and day. GEOGRAPHY may contain country, state, city etc. 18
  • 19. Fig. 1.5.1 Multidimensional cube Each side of the cube represents one of the elements of the question. The x-axis represents the time, the y-axis represents the products and the z-axis represents different centers. The cells in the cube represents the number of product sold or can represent the price of the items. When the size of the dimension increases, the size of the cube will also increase exponentially. The time response of the cube depends on the size of the cube. 1.5.1 Operations in Multidimensional Data Model: • Aggregation (roll-up) – dimension reduction: e.g., total sales by city – summarization over aggregate hierarchy: e.g., total sales by city and year -> total sales by region and by year • Selection (slice) defines a sub cube – e.g., sales where city = Palo Alto and date = 1/15/96 • Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities by average income • Visualization Operations (e.g., Pivot or dice) 1.5 OLAP operations OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns. Online Analytical Processing (OLAP) applications and tools are those that are designed to ask “complex queries of large multidimensional collections of data.” Operations:  Roll up (drill-up): summarize data 19
  • 20.  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice:  project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 1.6.1 OLAP Guidelines Dr. E.F. Codd the “father” of the relational model, created a list of rules to deal with the OLAP systems. Users should priorities these rules according to their needs to match their business requirements. These rules are: 1) Multidimensional conceptual view: The OLAP should provide an appropriate multidimensional Business model that suits the Business problems and Requirements. 2) Transparency: The OLAP tool should provide transparency to the input data for the users. 3) Accessibility: The OLAP tool should only access the data required only to the analysis needed. 4) Consistent reporting performance: The Size of the database should not affect in any way the performance. 5) Client/server architecture: The OLAP tool should use the client server architecture to ensure better performance and flexibility. 6) Generic dimensionality: Data entered should be equivalent to the structure and operation requirements. 7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse matrix and so maintain the level of performance. 8) Multi-user support: The OLAP should allow several users working concurrently to work together. 9) Unrestricted cross-dimensional operations: The OLAP tool should be able to perform operations across the dimensions of the cube. 10)Intuitive data manipulation. “Consolidation path re-orientation, drilling down across columns or rows, zooming out, and other manipulation inherent in the consolidation path outlines should be accomplished via direct action upon the cells of the analytical model, and should neither require the use of a menu nor multiple trips across the user interface.” 11)Flexible reporting: It is the ability of the tool to present the rows and column in a manner suitable to be analyzed. 20
  • 21. 12)Unlimited dimensions and aggregation levels: This depends on the kind of Business, where multiple dimensions and defining hierarchies can be made. In addition to these guidelines an OLAP system should also support: • Comprehensive database management tools: This gives the database management to control distributed Businesses • The ability to drill down to detail source record level: Which requires that The OLAP tool should allow smooth transitions in the multidimensional database. • Incremental database refresh: The OLAP tool should provide partial refresh. • Structured Query Language (SQL interface): the OLAP system should be able to integrate effectively in the surrounding enterprise environment. 1.7Data warehouse implementation 1.7.1 Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () 21
  • 22.  Join index: JI(R-id, S-id) where R (R-id, …) >< S (S-id, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join — a rather costly operation  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of R- IDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions 22
  • 23. 1.8 Data Warehouse to Data Mining 1.8.1 Data Warehouse Usage  Three kinds of data warehouse applications  Information processing  supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs  Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting  Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks 23
  • 24. 1.8.2 From On-Line Analytical Processing to On Line Analytical Mining (OLAM  Why online analytical mining?  High quality of data in data warehouses  DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses  ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis  mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions  integration and swapping of multiple mining functions, algorithms, and tasks.  1.8.3Architecture of OLAM 24
  • 25. UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES, CONCEPT DESCRIPTION Why preprocessing − Cleaning − Integration − Transformation − Reduction − Discretization – Concept hierarchy generation − Data mining primitives − Query language − Graphical user interfaces − Architectures − Concept description − Data generalization − Characterizations − Class comparisons − Descriptive statistical measures. 2.1 Data preprocessing Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. We need data processing as data in the real world are dirty. It can be in incomplete, noisy and inconsistent from. These data needs to be preprocessed in order to improve the quality of the data, and quality of the mining results. • If no quality data, then no quality mining results. The quality decision is always based on the quality data. • If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. • Incomplete data may come from o “Not applicable” data value when collected o Different considerations between the time when the data was collected and when it is analyzed. o Due to Human/hardware/software problems o e.g., occupation=“ ”. • Noisy data (incorrect values) may come from o Faulty data collection by instruments o Human or computer error at data entry o Errors in data transmission and contain errors or outliers data. e.g., Salary=“-10” • Inconsistent data may come from o Different data sources o Functional dependency violation (e.g., modify some linked data) o Having discrepancies in codes or names. e.g., Age=“42” Birthday=“03/07/1997” 2.5.1 Major Tasks in Data Preprocessing • Data cleaning o Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies 25
  • 26. • Data integration o Integration of multiple databases, data cubes, or files • Data transformation o Normalization and aggregation • Data reduction o Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization o Part of data reduction but with particular importance, especially for numerical data Fig. 2.5.1.1 Forms of Data Preprocessing 2.2 Data cleaning: Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. i. Missing Values: The various methods for handling the problem of missing values in data tuples include: (a) Ignoring the tuple: When the class label is missing the tuple can be ignored. This method is not very effective unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. (b) Manually filling in the missing value: In general, this approach is time-consuming and may not be a reasonable task for large data sets with 26
  • 27. many missing values, especially when the value to be filled in is not easily determined. (c) Using a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common — that of “Unknown” . (d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal) values, for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. (e) Using the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. ii. Noisy data: Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing such noisy data. Several Data smoothing techniques used: a. Binning Method b. Regression Method c. Cluster Method 1 Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or values around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. In this technique, 1. The data for first sorted 2. Then the sorted list partitioned into equi-depth of bins. 3. Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. a. Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin. b. Smoothing by bin medians: Each value in the bin is replaced by the bin median. c. Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries. Each bin value is replaced by the closest boundary value. • Example: Binning Methods for Data Smoothing 27
  • 28. o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 o Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values): Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34 o Smoothing by bin means: Bin 1: 9, 9, 9, 9 Bin 2: 23, 23, 23, 23 Bin 3: 29, 29, 29, 29 o Smoothing by bin boundaries: Bin 1: 4, 4, 4, 15 Bin 2: 21, 21, 25, 25 Bin 3: 26, 26, 26, 34 In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3. • Step 1: Sort the data. (This step is not required here as the data are already sorted.) • Step 2: Partition the data into equidepth bins of depth 3. Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22 Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35 Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70 • Step 3: Calculate the arithmetic mean of each bin. • Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin. Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21 Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33 28
  • 29. Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56 2 Regression: smooth by fitting the data into regression functions. • Linear regression involves finding the best of line to fit two variables, so that one variable can be used to predict the other. Fig. 2.5.1.2 Regression • Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to a multidimensional surface. Using regression to find a mathematical equation to fit the data helps smooth out the noise. 3. Clustering: Outliers in the data may be detected by clustering, where similar values are organized into groups, or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers. Fig. 2.5.1.3 Clustering iii. Data Cleaning Process: • Field overloading: is a kind of source of errors that typically occurs when developers compress new attribute definitions into unused portions of already defined attributes. • Unique rule is a rule says that each value of the given attribute must be different from all other values of that attribute • Consecutive rule is a rule says that there can be no missing values between the lowest and highest values of the attribute and that all values must also be unique. • Null rule specifies the use of blanks, question marks, special characters or other strings that may indicate the null condition and how such values should be handled. 2.3 Data Integration It combines data from multiple sources into a coherent store. There are number of issues to consider during data integration. Issues: • Schema integration: refers integration of metadata from different sources. 29
  • 30. • Entity identification problem: Identifying entity in one data source similar to entity in another table. For example, customer_id in one database and customer_no in another database refer to the same entity • Detecting and resolving data value conflicts: Attribute values from different sources can be different due to different representations, different scales. E.g. metric vs. British units • Redundancy: Redundancy can occur due to the following reasons: • Object identification: The same attribute may have different names in different db • Derived Data: one attribute may be derived from another attribute. • Correlation analysis is used to detect the redundancy. 2.4 Data Transformation In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following: • Smoothing is used to remove noise from the data. It includes binning, regression, and clustering. • Aggregation operations such as are applied to the data. o For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. • Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. • Normalization is used to scale the attribute data to fall within a small specified range, such as -1:0 to 1:0, or 0:0 to 1:0. • Attribute construction (or feature construction) is use to construct new attributes which can be added from the given set of attributes to help the mining process. 2.5 Data Reduction Data reduction is a technique used to have a reduced representation of data set. Various Strategies used for data reduction: 1. Data cube aggregation uses aggregation operations that can be applied to the data in the construction of a data cube. 2. Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. 3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. 30
  • 31. 4. Numerosity reduction, where the data are replaced or estimated by smaller data representations such as parametric models or nonparametric methods such as clustering, sampling, and the use of histograms. 2.6.Data Discretization Raw data values for attributes are replaced by ranges or higher conceptual levels in data discretization. The various methods used in Data Discretization are Binning, Histogram Analysis, Entropy-Based Discretization, Interval merging by x2 analysis and Clustering.  Three types of attributes:  Nominal — values from an unordered set  Ordinal — values from an ordered set  Continuous — real numbers  Discretization:  divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization. Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).  Prepare for further analysis  Binning  Histogram analysis  Clustering analysis  Entropy-based discretization  Segmentation by natural partitioning Entropy-Based Discretization  Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is  The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.  The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,  Experiments show that it may reduce data size and improve classification accuracy 31
  • 32. Concept eneration for Categorical data  Specification of a partial ordering of attributes explicitly at the schema level by users or experts  Specification of a portion of a hierarchy by explicit data grouping  Specification of a set of attributes, but not of their partial ordering  Specification of only a partial set of attributes 2.7 Data Mining Primitives  Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting  Data mining should be an interactive process  User directs what to be mined  Users must be provided with a set of primitives to be used to communicate with the data mining system  Incorporating these primitives in a data mining query language  More flexible user interaction  Foundation for design of graphical user interface  Standardization of data mining industry and practice  Database or data warehouse name  Database tables or data warehouse cubes  Condition for data selection  Relevant attributes or dimensions  Data grouping criteria Types of knowledge to be mined  Characterization  Discrimination  Association  Classification/prediction  Clustering  Outlier analysis  Other data mining tasks Background knowledge:Concept Hierarchies  Schema hierarchy  E.g., street < city < province_or_state < country  Set-grouping hierarchy  E.g., {20-39} = young, {40-59} = middle_aged 32
  • 33.  Operation-derived hierarchy  email address: login-name < department < university < country  Rule-based hierarchy  low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 Measurements of Pattern Interestingness  Simplicity e.g., (association) rule length, (decision) tree size  Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.  Utility potential usefulness, e.g., support (association), noise threshold (description)  Novelty not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio. 2.8 Data Mining Query Language (DMQL)  Motivation  A DMQL can provide the ability to support ad-hoc and interactive data mining  By providing a standardized language like SQL  Hope to achieve a similar effect like that SQL has on relational database  Foundation for system development and evolution  Facilitate information exchange, technology transfer, commercialization and wide acceptance  Design  DMQL is designed with the primitives described earlier Syntax for DMQL  Syntax for specification of  task-relevant data  the kind of knowledge to be mined  concept hierarchy specification  interestingness measure  pattern presentation and visualization  Putting it all together — a DMQL query 33
  • 34. Syntax for task-relevant data specification  use database database_name, or use data warehouse data_warehouse_name  from relation(s)/cube(s) [where condition]  in relevance to att_or_dim_list  order by order_list  group by grouping_list  having condition Syntax for specifying the kind of knowledge to be mined  Characterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s)  Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s)  Association Mine_Knowledge_Specification ::= mine associations [as pattern_name] Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension Prediction Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}} Syntax for concept hierarchy specification  To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension>  We use different syntax to define different type of hierarchies  schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year] 34
  • 35.  set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior Syntax for interestingness measure specification  Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_value  Example: with support threshold = 0.05 with confidence threshold = 0.7 2.9 Designing Graphical User Interfaces based on a data mining query language  What tasks should be considered in the design GUIs based on a data mining query language?  Data collection and data mining query composition  Presentation of discovered patterns  Hierarchy specification and manipulation  Manipulation of data mining primitives  Interactive multilevel mining  Other miscellaneous information 2.10 Data Mining System Architectures  Coupling data mining system with DB/DW system  No coupling—flat file processing, not recommended  Loose coupling  Fetching data from DB/DW  Semi-tight coupling—enhanced DM performance  Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions  Tight coupling—A uniform information processing environment  DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. 35
  • 36. 2.11Concept Description  Descriptive vs. predictive data mining  Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms  Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data  Concept description:  Characterization: provides a concise and succinct summarization of the given collection of data  Comparison: provides descriptions comparing two or more collections of data Concept Description vs. OLAP  Concept description:  can handle complex data types of the attributes and their aggregations  a more automated process  OLAP:  restricted to a small number of dimension and measure types  user-controlled process 2.12Data Generalization and Summarization-based Characterization  Data generalization  A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.  Approaches:  Data cube approach(OLAP approach)  Attribute-oriented induction approach Characterization: Data Cube Approach (without using AO-Induction)  Perform computations and store results in data cubes  Strength  An efficient implementation of data generalization  Computation of various kinds of measures  e.g., count( ), sum( ), average( ), max( )  Generalization and specialization can be performed on a data cube by roll-up and drill-down  Limitations 36
  • 37.  handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values.  Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should the generalization reach Attribute-Oriented Induction  Proposed in 1989 (KDD ‘89 workshop)  Not confined to categorical data nor particular measures.  How it is done?  Collect the task-relevant data( initial relation) using a relational database query  Perform generalization by attribute removal or attribute generalization.  Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.  Interactive presentation with users. Basic Principles of Attribute-Oriented Induction  Data focusing: task-relevant data, including dimensions, and the result is the initial relation.  Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.  Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A.  Attribute-threshold control: typical 2-8, specified/default.  Generalized relation threshold control: control the final relation/rule size. Basic Algorithm for Attribute-Oriented Induction  InitialRel: Query processing of task-relevant data, deriving the initial relation.  PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?  PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. 37
  • 38.  Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. Example  DMQL: Describe general characteristics of graduate students in the Big- University database use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate”  Corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } Presentation of Generalized Results  Generalized relation:  Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.  Cross tabulation:  Mapping results into cross tabulation form (similar to contingency tables).  Visualization techniques:  Pie charts, bar charts, curves, cubes, and other visual forms.  Quantitative characteristic rules:  Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., 2.13 Mining Class Comparisons  Comparison: Comparing two or more classes.  Method:  Partition the set of relevant data into the target class and the contrasting class(es)  Generalize both classes to the same high level concepts  Compare tuples with the same high level descriptions  Present for every tuple its description and two measures:  support - distribution within single class  comparison - distribution between classes  Highlight the tuples with strong discriminant features  Relevance Analysis:  Find attributes (features) which best distinguish different classes. 38
  • 39.  Task  Compare graduate and undergraduate students using discriminant rule.  DMQL query  use Big_University_DB  mine comparison as “grad_vs_undergrad_students”  in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa  for “graduate_students”  where status in “graduate”  versus “undergraduate_students”  where status in “undergraduate”  analyze count%  from student Example: Analytical comparison (2)  Given  attributes name, gender, major, birth_place, birth_date, residence, phone# and gpa  Gen(ai) = concept hierarchies on attributes ai  Ui = attribute analytical thresholds for attributes ai  Ti = attribute generalization thresholds for attributes ai  R = attribute relevance threshold Example: Analytical comparison (3)  1. Data collection  target and contrasting classes  2. Attribute relevance analysis  remove attributes name, gender, major, phone#  3. Synchronous generalization  controlled by user-specified dimension thresholds  prime target and contrasting class(es) relations/cuboids Class Description  Quantitative characteristic rule  necessary  Quantitative discriminant rule  sufficient  Quantitative description rule  necessary and sufficient  39
  • 40. 2.14Mining descriptive statistical measures in large databases 2.14.1 Mining Data Dispersion Characteristics  Motivation  To better understand the data: central tendency, variation and spread  Data dispersion characteristics  median, max, min, quantiles, outliers, variance, etc.  Numerical dimensions correspond to sorted intervals  Data dispersion: analyzed with multiple granularities of precision  Boxplot or quantile analysis on sorted intervals  Dispersion analysis on computed measures  Folding measures into numerical dimensions  Boxplot or quantile analysis on the transformed cube Measuring the Central Tendency  Mean  Weighted arithmetic mean  Median: A holistic measure  Middle value if odd number of values, or average of the middle two values otherwise  estimated by interpolation  Mode  Value that occurs most frequently in the data  Unimodal, bimodal, trimodal  Empirical formula:mean-mode=3x(mean-mode) Measuring the Dispersion of Data  Quartiles, outliers and boxplots  Quartiles: Q1 (25th percentile), Q3 (75th percentile)  Inter-quartile range: IQR = Q3 –Q1  Five number summary: min, Q1, M,Q3, max  Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually  Outlier: usually, a value higher/lower than 1.5 x IQR  Variance and standard deviation  Variance s2 : (algebraic, scalable computation)  Standard deviation s is the square root of variance s2 Boxplot Analysis  Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum  Boxplot 40
  • 41.  Data is represented with a box  The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ  The median is marked by a line within the box  Whiskers: two lines outside the box extend to Minimum and Maximum Graphic Displays of Basic Statistical Descriptions  Histogram: (shown before)  Boxplot: (covered before)  Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are ≤ xi  Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another  Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane  Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence UNIT III ASSOCIATION RULES Association rule mining − Single-dimensional boolean association rules from transactional databases − Multi level association rules from transaction databases 3.1 Association rule mining: Association rule mining is used for finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. It searches for interesting relationships among items in a given data set. 3.1.1 Market basket analysis: Electronic shops A motivating example for association rule mining • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? Association rule mining is used for analyzing buying behavior. Frequently purchased items can be placed in close proximity in order to further encourage the sale of such items together. If customers who purchase computers also tend to buy financial management software at the same time, then placing the hardware display close to the software display may help to increase the sales of both of these items. 41
  • 42. Each basket can then be represented by a Boolean vector of values assigned to these variable. The Boolean vectors can be analyzed for buying patterns which reflect items that are frequent associated or purchased together. These patterns can be represented in the form of association rules. For example, the information that customers who purchase computers also tend to buy financial management software at the same time is represented in association Rule. computer =>financial management software [support = 2%; confidence = 60%] Example of association rule mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. Fig. 3.1.1 Market basket analysis The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers to do selective marketing and plan their shelf space. For instance, placing milk and bread within close proximity may further encourage the sale of these items together within single visits to the store. 3.1.2 Basic Concepts: Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XàY with min confidence and support • support, s, probability that a transaction contains X∪Y 42
  • 43. • confidence, c, conditional probability that a transaction having X also contains Y. Rule support and confidence are two measures of rule interestingness that were described A support of 2% for association Rule means that 2% of all the transactions under analysis show that computer and financial management software are purchased together A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. By convention, we write min sup and min conf values so as to occur between 0% and 100%, • A set of items is referred to as an itemset. • An itemset that contains k items is a k-itemset. • The set of computer, financial management software is a 2-itemset. • The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known as the frequency or support count of the itemset. • The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count. Association rule mining - a two-step process: Step 1: Find all frequent itemsets. By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count. Step 2: Generate strong association rules from the frequent itemsets. By definition, these rules must satisfy minimum support and minimum confidence. 3.1.3 Association rule mining: Association rules can be classified in various ways, based on the following criteria: 1. Based on the types of values handled in the rule: • If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. • If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. In these rules, quantitative values for items or attributes are partitioned into intervals. 43
  • 44. age(X; “30 ……39") ^ income(X; “42K ….. 48K") )=>buys( X, high resolution TV") 2. Based on the dimensions of data involved in the rule: If the items or attributes in an association rule each reference only one dimension, then it is a single- dimensional association rule. The above rule could be rewritten as buys(X; “computer") => buys(X; financial management software") The above example is a single-dimensional association rule since it refers to only one dimension, i.e., buys. If a rule references two or more dimensions, such as the dimensions buys, time of transaction, and customer category, then it is a multidimensional association rule. 3. Based on the levels of abstractions involved in the rule set: Some methods for association rule mining can find rules at differing levels of abstraction. For example, suppose that a set of association rules mined included Rule age(X,”30…..39")) buys(X; “laptop computer") age(X; “30 …39") ) buys(X; “computer") In the above said examples the items bought are referenced at different levels of abstraction. We refer to the rule set mined as consisting of multilevel association rules. If, instead, the rules within a given set do not reference items or attributes at different levels of abstraction, then the set contains single-level association rules. 4. Based on the nature of the association involved in the rule: Association mining can be extended to correlation analysis, where the absence or presence of correlated items can be identified. 3.2 Mining single-Dimensional Boolean association rules from Transactional databases Different methods for mining the simplest form of association rules - single-dimensional, single-level, Boolean association rules, such as those discussed for market basket analysis presenting Apriori. It is a basic algorithm for finding frequent itemsets. It uses a procedure for generating strong association rules from frequent itemsets. 3.2.1 The Apriori algorithm: Finding frequent itemsets Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. 44
  • 45. Apriori employs an iterative approach known as a level-wise search, where k- itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. The Apriori property. All non-empty subsets of a frequent itemset must also be frequent. By definition, if an itemset I does not satisfy the minimum support threshold, s, then I is not frequent, i.e., P(I) < s. If an item A is added to the itemset I, then the resulting itemset cannot occur more frequently than I. This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called anti- monotone because the property is monotonic in the context of failing a test. 1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk_1. The notation li[j] refers to the jth item in li. By convention, Apriori assumes that items within a transaction or itemset are sorted in increasing lexicographic order. It also ensures that no duplicates are generated. 3. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk. Ck can be huge, and so this could involve heavy computation. The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do 45
  • 46. increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)- itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets. 46
  • 47. Fig. 3.2.1.1 Transactional data for an All Electronics branch . Let's look at a concrete example of Apriori, based on the All Electronics transaction database, D, of There are nine transactions in this database. 1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item. 2. Suppose that the minimum transaction support count required is 2 (i.e., min sup = 2). The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets having minimum support. 3. To discover the set of frequent 2-itemsets, L2, the algorithm uses L1×L1 to generate a candidate set of 2-itemsets, C2. 4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is accumulated. 5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. 6. The generation of the set of candidate 3-itemsets, C3. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3. 7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support. 8. The algorithm uses L3×L3 to generate a candidate set of 4-itemsets, C4. Generating association rules from frequent itemsets: Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). This can be done for confidence, where the conditional probability is expressed in terms of itemset support count: • support count(A U B) is the number of transactions containing the itemsets AUB, and • support count(A) is the number of transactions containing the itemset A. • Based on this equation, association rules can be generated as follows. • For each frequent itemset, l, generate all non-empty subsets of l. • For every non-empty subset s of l, output the rule s → (l-s)" 47
  • 48. where min_conf is the minimum confidence threshold. Variations of the Apriori algorithm Many variations of the Apriori algorithm have been proposed. A number of these variations are enumerated below. Methods 1 to 6 focus on improving the efficiency of the original algorithm, while methods 7 and 8 consider transactions over time. 1. A hash-based technique: Hashing itemset counts. A hash-based technique can be used to reduce the size of the candidate k- itemsets, Ck, for k > 1. For example, when scanning each transaction in the database to generate the frequent 1-itemsets, L1, from the candidate A 2-itemset whose corresponding bucket count in the hash table is below the support threshold cannot be frequent and thus should be removed from the candidate set. Such a hash-based technique may substantially reduce the number of the candidate k-itemsets examined (especially when k = 2). 2. Transaction reduction: Reducing the number of transactions scanned in future iterations. A transaction which does not contain any frequent k-itemsets cannot contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration since subsequent scans of the database for j- itemsets, where j > k, will not require it. 3. Partitioning: It is used for partitioning the data to find candidate itemsets. A partitioning technique can be used which requires just two database scans to mine the frequent itemsets. It consists of two phases. • In Phase I, the algorithm subdivides the transactions of D into n non-overlapping partitions. If the minimum support threshold for transactions in D is min_sup, then the minimum itemset support count for a partition is min_sup*the number of transactions in that partition. • For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. • The procedure employs a special data structure which, for each itemset, records the TID's of the transactions containing the items in the itemset. This allows it to find all of the local frequent k-itemsets, for k = 1,2 ……..n in just one scan of the database. • The collection of frequent itemsets from all partitions forms a global candidate itemset with respect to D. • In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent 48
  • 49. itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase. 4. Sampling: It is used for Mining on a subset of the given data. The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead D. 5. Dynamic itemset counting: It adds candidate itemsets at different points during a scan. A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately prior to each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent. The resulting algorithm requires two database scans. 5.Calendric market basket analysis: Finding itemsets that are frequent in a set of user-defined time intervals. Calendric market basket analysis uses transaction time stamps to define subsets of the given database . Construct FP-tree from a Transaction DB Steps: 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in frequency descending order 3. Scan DB again, construct FP-tree 49
  • 50. Benefits of the FP-tree Structure  Completeness:  never breaks a long pattern of any transaction  preserves complete information for frequent pattern mining  Compactness  reduce irrelevant information—infrequent items are gone  frequency descending ordering: more frequent items are more likely to be shared  never be larger than the original database (if not count node-links and counts)  Example: For Connect-4 DB, compression ratio could be over 100 Mining Frequent Patterns Using FP-tree  General idea (divide-and-conquer)  Recursively grow frequent pattern path using the FP-tree  Method  For each item, construct its conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) Major Steps to Mine FP-tree 1) Construct conditional pattern base for each node in the FP-tree 50
  • 51. 2) Construct conditional FP-tree from each conditional pattern-base 3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far  If the conditional FP-tree contains a single path, simply enumerate all the patterns 3.3 Mining multilevel association rules from transaction databases Multilevel association rules For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional space. Strong associations discovered at very high concept levels may represent common sense knowledge. Example : Suppose we are given the task-relevant set of transactional data in for sales at the computer department of an All electronics branch, showing the items purchased for each transaction TID. A concept hierarchy defines sequence of mappings from a set of low level concepts to higher level, more general concepts. Data can be generalized by replacing low level concepts within the Fig. 3.3.1 Class Hierarchy data by their higher level concepts, or ancestors, from a concept hierarchy 4. The concept hierarchy has four levels, referred to as levels 0, 1, 2, and 3. By convention, levels within a concept hierarchy are numbered from top to bottom, starting with level 0 at the root node for all (the most general abstraction level). • Level 1 includes computer, software, printer and computer accessory, • Level 2 includes home computer, laptop computer, education software, financial management software, .., and • Level 3 includes IBM home computer, .., Microsoft educational software, and so on. Level 3 represents the most specific abstraction level of this hierarchy. 51
  • 52. Fig. 2.3.2 Multilevel Mining with Reduced Support Rules generated from association rule mining with concept hierarchies are called multiple-level or multilevel association rules, since they consider more than one concept level. Approaches to mining multilevel association rules In general, a top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level 1 and working towards the lower, more specific concept levels, until no more frequent itemsets can be found. That is, once all frequent itemsets at concept level 1 are found, then the frequent itemsets at level 2 are found, and so on. For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations. 1. Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction. For example, a minimum support threshold of 5% is used throughout (e.g., for mining from “computer" down to “laptop computer"). Both “computer" and “laptop computer" are found to be frequent, while “home computer" is not. When a uniform minimum support threshold is used, the search procedure is simplified. The method is also simple in that users are required to specify only one minimum support threshold. An optimization technique can be adopted, based on the knowledge that an ancestor is a superset of its descendents: the search avoids examining itemsets containing any item whose ancestors do not have minimum support. Fig. 2.7.3.2.1 Multilevel Mining with Uniform Support 52
  • 53. The uniform support approach is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstraction. If the minimum support threshold is set too high, it could miss several meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. This provides the motivation for the following approach. 2. Using reduced minimum support at lower levels (referred to as reduced support): Each level of abstraction has its own minimum support threshold. The lower the abstraction level is, the smaller the corresponding threshold is. For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer", “laptop computer", and “home computer" are all considered frequent. Fig. 2.7.3.2.2 Multilevel Mining with Reduced Support For mining multiple-level associations with reduced support, there are a number of alternative search strategies. These include: 1. Level-By-Level Independent: This is a full breadth search, where no background knowledge of frequent itemsets is used for pruning. Each node is examined, regardless of whether or not its parent node is found to be frequent. 2. Level-Cross Filtering By Single Item: An item at the i-th level is examined if and only if its parent node at the (i-1)-th level is frequent. If a node is frequent, its children will be examined; otherwise, its descendents are pruned from the search. For example, the descendent nodes of “computer" (i.e., “laptop computer" and home computer") are not examined, since “computer" is not frequent. 3. Level-Cross Filtering By K-Item Set: A k-itemset at the ith level is examined if and only if its corresponding parent k-itemset at the (i-1)th level is frequent. 53
  • 54. UNIT IV CLASSIFICATION AND CLUSTERING Classification and prediction − Issues − Decision tree induction − Bayesian classification – Association rule based − Other classification methods − Prediction − Classifier accuracy − Cluster analysis – Types of data − Categorization of methods − Partitioning methods − Outlier analysis. 4.1 Classification vs. Prediction  Classification:  predicts categorical class labels  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction:  models continuous-valued functions, i.e., predicts unknown or missing values  Typical Applications  credit approval  target marketing  medical diagnosis  treatment effectiveness analysis Classification—A Two-Step Process  Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction: training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur 54
  • 55.  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 4.2 Issues regarding classification and prediction (2): Evaluating Classification Methods  Predictive accuracy  Speed and scalability  time to construct the model  time to use the model  Robustness  handling noise and missing values  Scalability  efficiency in disk-resident databases  Interpretability:  understanding and insight provded by the model  Goodness of rules 55
  • 56.  decision tree size  compactness of classification rules 4.3 Classification by Decision Tree Induction • Decision tree o A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute. o Each branch represents an outcome of the test, and each leaf node holds a class label. o The topmost node in a tree is the root node. o Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. o Some decision tree algorithms produce only binary trees whereas others can produce non binary trees. • Decision tree generation consists of two phases o Tree construction  Attribute selection measures are used to select the attribute that best partitions the tuples into distinct classes. o Tree pruning  Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data. • Use of decision tree: Classifying an unknown sample o Test the attribute values of the sample against the decision tree Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) o Tree is constructed in a top-down recursive divide-and-conquer manner o At start, all the training examples are at the root o Attributes are categorical (if continuous-valued, they are discretized in advance) o Examples are partitioned recursively based on selected attributes o Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning o All samples for a given node belong to the same class o There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf o There are no samples left Attribute Selection Measure • Information gain o All attributes are assumed to be categorical and can be modified for continuous-valued attributes 56
  • 57. • Gini index o All attributes are assumed continuous-valued o Assume there exist several possible split values for each attribute o May need other tools, such as clustering, to get the possible split values o Can be modified for categorical attributes Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Assume there are two classes, P and N o Let the set of examples S contain p elements of class P and n elements of class N o The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as Information Gain in Decision Tree Induction : • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} • If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is • The encoding information that would be gained by branching on A 4.4 Bayesian Classification: • Probabilistic Learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic Prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Bayesian Theorem 57 np n np n np p np p npI ++ − ++ −= 22 loglog),( )(),()( AEnpIAGain −=
  • 58. • Given training data D, posteriori probability of a hypothesis h, P(h| D) follows the Bayes theorem • MAP (maximum posteriori) hypothesis • Practical difficulty: It requires initial knowledge of many probabilities, significant computational cost. Naïve Bayes Classifier The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: o Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2, : : : , An. o Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. P(Ci|X) > P(Cj|X) for 1≤ j ≤m; j≠ i: o The class Ci for which P(CijX) is maximized is called the Maximum posteriori hypothesis. o As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). o The attributes are conditionally independent of one another, given the class label of the tuple Rule Based Classification A set of IF-THEN rules are used in Rule Based Classification. Using IF-THEN Rules for Classification Rules are a good way of representing information or bits of knowledge. A rule- based Classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion. An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes. 58
  • 59. • The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or precondition. The “THEN”-part (or right-hand side) is the rule consequent. R1 can also be written as R1: (age = youth) ^ (student = yes)) (buys computer = yes). Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds Fig. If Then Example R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal Advantages of Rule-Based Classifiers • As highly expressive as decision trees 59