Lecture 05 - The Data Warehouse and Technology

Building Data WareHouse by
Inmon

Chapter 5: The Data Warehouse and Technology

http://it-slideshares.blogspot.com/

5.0 Overview
Requires a simpler set of technological
features than its operational
predecessors:
◦ Online updating: Not need.
◦ Locking, integrity: needs are minimal.
◦ Teleprocessing interface: is required very
basic.
This chapter outlines some of
technological requirements for the data
warehouse.

MANAGING LARGE
AMOUNTS OF DATA

1. Manage Volumes
2. Manage multiple
media technology
3. Index and
monitoring data
4. Interface to retrieve
and passing data

Managing Multiple Media
Following is a hierarchy of storage of data in terms
of speed of access and cost of storage:
Main memory Very fast Very
expensive
Expanded memory Very fast Expensive
Cache Very fast Expensive
DASD Fast Moderate
Magnetic tape Not fast Not expensive
Near line Not fast* Not
expensive
Optical disk Not slow Not expensive
Fiche Slow Cheap
*Not fast to find first record sought; very fast to find all other records in the block.

Indexing and Monitoring Data
Monitoring data warehouse data
determines such factors as the following:
◦ If a reorganization needs to be done
◦ If an index is poorly structured
◦ If too much or not enough data is in overflow
◦ The statistical composition of the access of
the data
◦ Available remaining space

Interfaces to Many Technologies
The interface to different technologies requires
several considerations:
Does the data pass from one DBMS to another
easily?
Does it pass from one operating system to another
easily?
Does it change its basic format in passage (EBCDIC,
ASCII, and so forth)?
Can passage into multidimensional processing be
done easily?
Can selected increments of data, such as changed
data capture (CDC) be passed rather than entire
tables?
Is the context of data lost in translation as data is
moved to other environments?

PROGRAMMER OR
DESIGNER CONTROL OF
DATA PLACEMENT

Place data at block/page
level
Manage data in parallel
Solid Meta Data control
Rich Language Interface

Parallel Storage and Management of
Data
Metadata Management
Data warehouse table structures
Data warehouse table attribution
Data warehouse source data (the system of
record)
Mapping from the system of record to the data
warehouse
Data model specification
Extract logging
Common routines for access of data
Definitions and/or descriptions of data
Relationships of one unit of data to another

Language Interface
Typically, the language interface to the data
warehouse should do the following:
◦ Be able to access data a set at a time
◦ Be able to access data a record at a time
◦ Specifically ensure that one or more indexes
will be used in the satisfaction of a query
◦ Have an SQL interface
◦ Be able to insert, delete, or update data

EFFICIENT LOADING OF
DATA

Load efficiently
Use indexes efficiently
Store data in compact
way
Support compound
Keys

Efficient Index Utilization

Technology can support efficient index access in
several ways:
◦ Using bit maps
◦ Having multileveled indexes
◦ Storing all or parts of an index in main memory
◦ Compacting the index entries when the order of
the data being indexed allows such compaction
◦ Creating selective indexes and range indexes

Compaction of Data
Manage large amounts of data.
Programmer gets the most out of a given
I/O when data is stored compactly

Compound Keys
The time valiancy of data warehouse
data.
Key-foreign key relationships are quite
common in the atomic data

VARIABLE-LENGTH
DATA
Variable-length data efficiently
Lock Manager, explicit control at programmer Level
Able Index Only processing
Restore data in Bulk efficiently

Lock Management
Ensures that two or more people are not
updating the same record at the same
time.
Turn the lock manager off and on is
necessary.

Index-Only Processing
Looking in an index (or indexes)—
without going to the primary source of
data

Fast Restore
The capability to quickly restore a data
warehouse table from non-DASD storage

Other Technological Features
Some of those features include the
following:
◦ Transaction integrity
◦ High-speed buffering
◦ Row- or page-level locking
◦ Referential integrity
◦ VIEWs of data
◦ Partial block loadin

DBMS Types and the Data
Warehouse
Data warehouses manage massive amounts of data
because:
Granular, atomic detail
Historical information
Summary as well as detailed data
Because record level, transaction-based updates are a
regular feature of the general-purpose DBMS, must
offer facilities:
Locking
COMMITs
Checkpoints
Log tape processing
Deadlock
 Backout

Changing DBMS Technology
Such a change may be in order for several reasons:
DBMS technologies may be available.
The size of the warehouse has grown.
Use of the warehouse has escalated and changed.
The basic DBMS decision must be revisited from
time to time.
Should the decision be made to go to a new DBMS
technology, what are the considerations?
Will the new DBMS technology meet the foreseeable
requirements?
How will the conversion from the older DBMS
technology to the newer DBMS technology be done?
How will the transformation programs be converted?

Multidimensional DBMS and the
Data Warehouse

The multidimensional DBMS The data warehouse
1. holds at least an order of 1. holds massive amounts of data
magnitude less data.
2. is geared for very heavy and
unpredictable access and analysis 2. is geared for a limited amount of
of data. flexible access

3. holds a much shorter time 3. contains data with a very lengthy
horizon of data. time horizon (from 5 to 10
years)
4. allows unfettered access.
4. allows analysts to access its data
in a constrained fashion

5. being housed in a
5. enjoy a complementary multidimensional DBMS
relationship.

Data Warehouse con’t

Following is the relational foundation for
multidimensional DBMS data marts:
Strengths:
Can support a lot of data.
Can support dynamic joining of data.
Has proven technology.
 Is capable of supporting general-purpose update
processing.
If there is no known pattern of usage of data,
then the relational structure is as good as any
other.
Weaknesses:
Has performance that is less than optimal.
Cannot be purely optimized for access

Following is the cube foundation for multidimensional
DBMS data marts:
 Strengths:
Performance that is optimal for DSS processing.
Can be optimized for very fast access of data.
If pattern of access of data is known, then the structure of
data can be optimized.
 Can easily be sliced and diced.
Can be examined in many ways.
 Weaknesses:
 Cannot handle nearly as much data as a standard
relational format.
Does not support general-purpose update processing.
May take a long time to load.
If access is desired on a path not supported by the design
of the data, the structure is not flexible.

MULTIDIMENSIONAL DBMS
AND THE DATA
WAREHOUSE CON’T

Data Warehousing across Multiple
Storage Media
A large amount of data is spread across
more than one storage medium.
◦ One processing environment is the DASD
environment where online, interactive
processing is done.
◦ The other processing environment is often a
tape or mass store environment

The Role of Metadata in the Data
Warehouse Environment

Context and Content
The context of the reports is explained
for the contents

Three Types of Contextual
Information
Threelevels of contextual information must be
managed:
Simple contextual information
Complex contextual information
External contextual information
Simple contextual information relates to the basic
structure of data itself, and includes such things
as these:
The structure of data
The encoding of data
The naming conventions used for data
The metrics describing the data, such as:
How much data there is
How fast the data is growing
 What sectors of the data are growing

Information con’t
This type of information addresses such aspects
of data as these:
◦ Product definitions
◦ Marketing territories
◦ Pricing
◦ Packaging
◦ Organization structure
◦ Distribution

Information con’t
Some examples of external contextual
information include the following:
Economic forecasts:
Inflation
Financial trends
Taxation
Economic growth
Political information
Competitive information
Technological advancements
Consumer demographic movements

Capturing and Managing Contextual
Information
Complex and external contextual types
of information are hard to capture and
quantify because they are so
unstructured.

Looking at the Past
Some of these shortcomings are as follows:
The information management attempts
were aimed at the information systems
developer, not the end user.
Attempts at contextual management
were passive.
Attempts at contextual information
management were in many cases
removed from the development effort.
Attempts to manage contextual

Refreshing the Data Warehouse
Reading a log tape is no small matter,
however. Many obstacles are in the way,
including the following:
The log tape contains much extraneous
data.
The log tape format is often arcane.
The log tape contains spanned records.
The log tape often contains addresses
instead of data values.
The log tape reflects the idiosyncrasies of

Testing
It is very unusual to find a similar test
environment in the world of the data
warehouse, for the following reasons:
Data warehouses are so large that a
corporation has a hard time justifying one
of them, much less two of them.
The nature of the development life cycle
for the data warehouse is iterative.
For the most part, programs are run in a
heuristic manner, not in a repetitive

Summary
 Some technological features are
required:
 Robust language interface
 Compound keys
 Variable-length data
 The abilities to do the following:
 Manage large amounts of data  Have metadata control of the
 Manage data on a diverse media warehouse
 Easily index and monitor data  Efficiently load the warehouse
 Interface with a wide number of  Efficiently use indexes
technologies  Store data in a compact way
 Allow the programmer to place  Support compound keys
the data directly on the physical  Selectively turn off the lock
device manager
 Store and access data in parallel  Do index-only processing
 Quickly restore from bulk
storage

Summary con’t
The data architect must recognize the
differences between a transaction-based
DBMS and a data warehouse-based
DBMS.

Summary con’t
MultidimensionalOLAP technology is suited for
data mart processing and not data warehouse
processing.

When the data mart approach is used, many
problems become evident:
The number of extract programs grows large.
Each new multidimensional database must return to
the legacy operational environment for its own data.
There is no basis for reconciliation of differences in
analysis.
A tremendous amount of redundant data among
different multidimensional DBMS environments
exists.

Summary con’t
Metadata in the data warehouse
environment plays a very different role
than metadata in the operational legacy
environment.

http://it-slideshares.blogspot.com/

Lecture 05 - The Data Warehouse and Technology

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Lecture 05 - The Data Warehouse and Technology

Similar a Lecture 05 - The Data Warehouse and Technology (20)

Más de phanleson

Más de phanleson (20)

Lecture 05 - The Data Warehouse and Technology