This document provides an overview of data warehousing and data mining. It defines a data warehouse as a centralized repository of integrated data from various sources used to support management decision making. Key characteristics of a data warehouse include being subject-oriented, integrated, non-volatile, and time-variant. The document contrasts operational data with data in a warehouse and discusses components of a data warehouse system like data acquisition, staging areas, and data marts. It also outlines the history and growth of data warehousing and data mining as well as their applications in domains like marketing, finance, fraud detection, and more.
5. Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
What product prom-
-otions have the biggest
impact on revenue?
What is the most
effective distribution
channel?
A PRODUCER WANTS TO KNOW….
6. DATA, DATA EVERYWHERE
YET ...
I can’t get the data I need
need an expert to get the data
I can’t understand the data I found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from
one form to other
I can’t find the data I need
data is scattered over the network
many versions, subtle differences
7. 1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-
relational, OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia databases, and
Web databases
7
8. DATA WAREHOUSE
The data warehouse is that portion of an overall
Architected Data Environment that serves as the
single integrated source of data for processing
information.
12. HOW IS THE WAREHOUSE
DIFFERENT?
The data warehouse is distinctly different from
the operational data used and maintained by day-
to-day operational systems. Data warehousing is
not simply an “access wrapper” for operational
data, where data is simply “dumped” into tables
for direct access.
13. OPERATIONAL DATA
Application oriented
Detailed
Accurate, as of the moment
of access
Serves the clerical
community
Performance sensitive
(immediate response required
when entering a transaction)
Flexible structure; variable
contents
Small amount of data used in
a process
DATA WAREHOUSE
Subject oriented
Summarized
Represents values over time
Serves the managerial
community
Performance relaxed
(immediacy not required)
Static structure
large amount of data used in
a process
15. Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
15
16. Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
Alternative names:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs
16
17. Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents)
Stream data mining
Web mining.
DNA data analysis
17
18. Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
18
19. 19
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency and
variation)
20. Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
20
21. Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior and use
data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage accidents to
collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
21
22. 22
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr.).
Detecting telephone fraud
Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a
multimillion dollar fraud.
Retail
Analysts estimate that 38% of retail shrink is due to dishonest
employees.
23. Sports
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the
help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access logs
for market-related pages to discover customer preference and
behavior pages, analyzing effectiveness of Web
marketing, improving Web site organization, etc.
23
Notas del editor
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context
Subject-Oriented: Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject. For example, the SRDB is not simply made accessible to end-users, but is provided structure and organized according to the specific needs. Integrated: A single source of information for and about understanding multiple areas of interest. The data warehouse provides one-stop shopping and contains information about a variety of subjects. Thus the OIRAP data warehouse has information on students, faculty and staff, instructional workload, and student outcomes. Non-Volatile: Stable information that doesn’t change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed. Time-Variant: Containing a history of the subject, as well as current information. Historical information is an important component of a data warehouse. Accessible: The primary purpose of a data warehouse is to provide readily accessible information to end-users. Process-Oriented: It is important to view data warehousing as a process for delivery of information. The maintenance of a data warehouse is on-going and iterative in nature.
Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers. Staging Area: Any data store that is designed primarily to receive data into a warehousing environment. OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs where multidimensional analysis is the ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing”OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof.