2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
Datawarehouse and OLAP
2. What is Data Warehouse?
A data ware is a
Subject
orient
Integrated
Time variant(historic perspective)
Nonvolatile
collection
of data in support of management’s
decision making process(William H Inmon)
Semantically consistent data store that serve as
a physical implementation of decision support
data model
3. Data Warehouse—Subject-Oriented
3
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process
4. Data Warehouse—Integrated
4
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
5. Data Warehouse—Time Variant
5
The time horizon for the data warehouse is significantly
longer than that of operational systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain
“time element”
6. Data Warehouse—Nonvolatile
6
A physically separate store of data transformed from the
operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initia l lo a d ing
o f d a ta and a c c e s s o f d a ta
7. Differences between Operational Database
Systems and Data Warehouses
Online Operational Database Systems
Perform online transaction and query processing system
Online transaction processing (OLTP) systems
Cover the day-to-day operations of an organization
(purchasing,inventory,banking,payroll etc..)
Data warehouses
Serve users or knowledge workers in the role of
data analysis and decision making
Online analytical processing(OLAP)systems
8. Differences between OLTP and OLAP
Users and system orientation
Data contents
OLTP– customer oriented
OLAP-market-oriented
OLTP–manages current data easily used for decision making
OLAP-manages large amount of historic data, provides
facilities for summarization.
Database design
OLTP–entity-relationship(ER) data model and
application-oriented DB design
OLAP-star or snowflake model and a subjectoriented database design
9. Differences between OLTP and OLAP(contd..)
View
OLTP–manages current data within an enterprise or
department without referring to historic data or data
in different organization
OLAP-deal with information from different
organization ,integrating information from many
data stores
Access patterns
OLTP- Mainly of short, atomic transaction
OLAP-complex queries
10. OLTP vs. OLAP
10
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
11. 11
Why a Separate Data
Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
12. Data Warehouse: A Multi-Tiered
Architecture
Bottom tier
Middle tier
Warehouse database server
Backend tools &utilities are used to feed data into bottom layer
from other databases
contains metadata repository
Data are extracted using application program interfaces
known as gateways . Eg.OBDC,OLEDB,JDBC
OLAP server, Implemented using 1) relational OLAP(RALOP)
2) multi dimensional
OLAP(MOLAP)
Top tier: front end client layer which contain query and reporting
tools, analysis tools and/or data mining tools
14. Data Warehouse: A Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
14
Data Sources
Data Storage
OLAP Engine
Front-End Tools
15. Three Data Warehouse Models (Architecture point of
view)
15
Enterprise warehouse
collects all of the information about subjects spanning the entire
organization(Either detailed or summarized data in gigabytes ,
terabyte or beyond)
Implemented in mainframes, superservers or parallel architecture
platforms
Data Mart
a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as
marketing data mart(customer, item, sales)
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases for efficient query
Implemented on low cost department servers that are in UnixLinux or
Windows based
16. Extraction, Transformation, and Loading (ETL)
16
Back end tools and utilities include the following functions
Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions
Refresh
propagate the updates from the data sources to the
warehouse
17. Metadata Repository
17
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency of
data (active, archived, or purged), monitoring information (warehouse usage
statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
18. From Tables and Spreadsheets to
Data Cubes
18
A data warehouse is based on a multidimensional data model which
views data in the form of a data cube : defined by dimension and fact
A data cube allows data to be modeled and viewed in multiple
dimensions
Example: sales with dimension time, branch, Item and location
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains numeric measures (such as dollars_sold) and
keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
19. Cube: A Lattice of Cuboids
19
all
time
0-D (apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
location,supplier
item,supplier
time,location,supplier
time,item,location
time,item,supplier
1-D cuboids
2-D cuboids
3-D cuboids
item,location,supplier
4-D (base) cuboid
time, item, location, supplier
20. Conceptual Modeling of Data Warehouses
20
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
1) large central table(fact table) 2) a set of dimension tables
Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
21. Example of Star Schema
21
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
branch_key
branch_name
branch_type
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
22. 22
Example of Snowflake
Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
23. 23
time
Example of Fact
Constellation
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
Shipping Fact Table
time_key
item_key
shipper_key
from_location
branch_key
location_key
branch
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
24. A Concept Hierarchy:
Dimension (location)
24
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver
L. Chan
...
...
M. Wind
...
Toronto
Mexico
25. Data Cube Measures: Three Categories
25
Distributive: if the result derived by applying the function to
n aggregate values is the same as that derived by applying
the function on all the data without partitioning
Algebraic: if it can be computed by an algebraic function
with Marguments (where Mis a bounded integer), each of
which is obtained by applying a distributive aggregate
function
E.g., count(), sum(), min(), max()
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
E.g., median(), mode(), rank()
26. Multidimensional Data
26
Sales volume as a function of product, month,
and region
Re
g
io
n
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Category Country Quarter
Product
Product
City
Office
Month
Month
Day
Week
27. A Sample Data Cube
27
Pr
TV
PC
VCR
sum
1Qtr
2Qtr
3Qtr
4Qtr
sum
Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum
Country
od
uc
t
Date
28. 28
Cuboids Corresponding to the
Cube
all
0-D (apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
product, date, country
3-D (base) cuboid
29. Typical OLAP Operations
29
Roll up (drill-up): summarize data
Drill down (roll down): reverse of roll-up
by c lim bing up hie ra rc hy o r by d im e ns io n re d uc tio n
fro m hig he r le ve l s um m a ry to lo we r le v e l s um m a ry o r d e ta ile d d a ta , o r
intro d uc ing ne w d im e ns io ns
Slice and dice: p ro je c t a nd s e le c t
Dice: sub cube selecting 2 or more dimensions Eg: (Location =“Toronto” or
“Vancouver”
Time=“Q1” or “Q2”)
Pivot (rotate):
Slice: selection of one dimension resulting sub cube .Eg. Time=“Q1”
re o rie nt the c ube , vis ua liz a tio n, 3 D to s e rie s o f 2 D p la ne s
Other operations
d rill a c ro s s : invo lv ing (a c ro s s ) m o re tha n o ne fa c t ta ble
d rill thro ug h: thro ug h the bo tto m le ve l o f the c ube to its ba c k-e nd re la tio na l
ta ble s (us ing SQL)
31. Starnet Query model
Querying multidimensional DBs
Consists of radial lines originated from a
central point
Each line represent a concept hierarchy for
dimension
Each abstraction level in the hierarchy –
footprint
32. A Star-Net Query Model
32
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
Time
PRODUCT LINE
ANNUALY QTRLY
DAILY
CITY
Product
PRODUCT ITEM PRODUCT GROUP
SALES PERSON
COUNTRY
DISTRICT
REGION
Location
DIVISION
Each circle is
called a footprint
Promotion
Organization
34. Data W
arehouse design and usage
What goes into a data warehouse
design?
How are data warehouses used?
How do data warehousing and OLAP
relate to data mining
35. Design of Data Warehouse: A Business
Analysis Framework
35
Wha t c a n bus ine s s a na ly s ts g a in
fro m ha ving a d a ta wa re ho us e ?
Productivity
Customer relationship management
Competitive advantage
Cost reduction
Effective data warehouse design need
to understand and analysis business
needs and construct a business
36. Design of Data Warehouse: A Business
Analysis Framework(contd..)
Four views regarding the design of a data warehouse
Top-down view
Data source view
exposes the information being captured, stored, and managed
by operational systems.(modeled by E-R model or CASE)
Data warehouse view
allows selection of the relevant information necessary for the
data warehouse(information matches current and future
business needs)
consists of fact tables and dimension tables with precalculated
totals and counts(source ,date,time)
Business query view
sees the perspectives of data in the warehouse from the view
of end-user
37. Design of Data Warehouse: A Business Analysis
Framework(contd..)
Building data warehouse is complex and long-term task
requires
Business skills
Technology skills
Program management skills
Implementation scope should be clearly defined. The
goals of an initial data warehouse implementation
should be
Specific
Achievable
Measurable
38. Data Warehouse Design Process
38
Top-down, bottom-up approaches or a combination of
both
Top-down: Starts with overall design and planning
(mature)
Bottom-up: Starts with experiments and prototypes
(rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each
step before proceeding to the next
Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
39. Data Warehouse Design Process
(contd..)
Typical data warehouse design process
Choose a business process to model, e.g., orders,
invoices, etc.
Choose the g ra in (a to m ic le ve l o f d a ta ) of the
business process(eg. Individual transactions)
Choose the dimensions that will apply to each fact
table record. Typical dimensions are
time,item,customer etc..
Choose the measure that will populate each fact table
record eg. Dollars_sold or units_sold
40. Data warehouse usage for information
Processing
Information processing
Supports
reporting
query processing , statistical analysis and
Analytical processing
Support
basic OLAP operations, including slice-anddice, drill-down, roll-up and pivoting
Data mining
Support
knowledge discovery by finding hidden
patterns and association, constructing analytical
models, performing classification and predication and
presentation
41. From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
41
Integrate OLAP with data mining to uncover knowledge
in multidimensional databases .
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
Integration and swapping of multiple mining functions,
algorithms, and tasks
Notas del editor
ion
MK 08.11.09 Former title: Data Warehouse Back-End Tools and Utilities