Data Warehousing 101(and a video)

Data Warehousing 101
Everything
you never wanted
to know about
big databases
but were forced
to find out anyway
Josh Berkus
Open Source Bridge 2011

contents
covering
● concepts of DW
● some DW
techniques
● databases
not covering
● hardware
● analytics/reporting
tools

What is a
“data warehouse”?

OLTP vs DW
● many single-row
writes
● current data
● queries generated
by user activity
● < 1s response
times
● 1000's of users
● few large batch
imports
● years of data
● queries generated
by large reports
● queries can run for
minutes/hours
● 10's of users

OLTP vs DW
big data for
many
concurrent
requests to
small amounts
of data each
big data
for low
concurrency
requests to very
large amounts
of data each

archiving
WORN data: “write once, read never”
● grows indefinitely
● usually a result of regulatory compliance
● main concern: storage efficiency

data mining
the database where you don't know what's in
there, but you want to find out
● lots of data (TB to PB)
● mostly “semi-structured”
● data produced as a side effect of other
business processes
● needs CPU-intensive processing

BI: Business Intelligence
DSS: Decision Support
OLAP: Online Analytical
Processing
Analytics

BI/DSS/OLAP/Analytics
databases which support visualization of
large amounts of data
● data is fairly well understood
● most data can be reduced to categories,
geography, and taxonomy
● primarily about indexing

dimensions vs. facts
Fact
Table
customers
/ accounts
category
subcategory
sub-subcategory

dimension examples
● location/region/country/quadrant
● product categorization
● URL
● transaction type
● account heirarchy
● IP address
● OS/version/build

dimension synonyms
● facet
● taxonomy
● secondary index
● view

Extract, Transform, Load
● how you turn external raw data into useful
database data
● Apache logs → web analytics DB
● CSV POS files → financial reporting DB
● OLTP server → 10-year data warehouse
● also called ELT when the transformation is
done inside the database

Purpose of ETL/ELT
getting data into the data warehouse
● clean up garbage data
● split out attributes
● “normalize” dimensional data
● deduplication
● calculate materialized views / indexes

ELT Tips
think volume
● bulk processing or parallel processing
● no row-at-a-time, document-at-a-time
● insert into permanent storage should be
the last step
● no updates

What kind of
database should I
use for DW?

5 Types
1. Standard Relational
2. MPP
3. Column Store
4. Map/Reduce
5. Enterprise Search
` `

standard relational
the all-purpose solution for not-that-big data
● adequate for all tasks
● but not excellent at any of them
● easy to use
● low resource requirements
● well-supported by all software
● familiar
● not suitable for really big data

MySQL
PostgreSQL
DW Database
0 5 10 15 20 25 30
0 5 10 15 20 25 30
Sweet Spots

MPP
cpu-intensive data warehousing
● data mining, some analytics
● supporting complex query logic
● moderately big data (1-200TB)
● drawbacks: proprietary, expensive
● now hybridizes
● with other types

column store
inversion of a row store:
indexes become data
data becomes indexes

column stores
for aggregations and transformations of
highly structured data
● good for BI, analytics, some archiving
● moderately big data (0.5-100TB)
● bad for data mining
● slow to add new data / purge data
● usually support compression

map/reduce
// map
function(doc) {
for (var i in doc.links)
emit([doc.parent, i], null);
}
}
// reduce
function(keys, values) {
return null;
}

map/reduce
// Map
function (doc) {
emit(doc.val, doc.val)
}
// Reduce
function (keys, values, rereduce) {
// This computes the standard deviation of the mapped results
var stdDeviation=0.0;
var count=0;
var total=0.0;
var sqrTotal=0.0;
if (!rereduce) {
// This is the reduce phase, we are reducing over emitted values from
// the map functions.
for(var i in values) {
total = total + values[i];
sqrTotal = sqrTotal + (values[i] * values[i]);
}
count = values.length;
}
else {
// This is the rereduce phase, we are re-reducing previosuly
// reduced values.
for(var i in values) {
count = count + values[i].count;
total = total + values[i].total;
sqrTotal = sqrTotal + values[i].sqrTotal;
}
}
var variance = (sqrTotal - ((total * total)/count)) / count;
stdDeviation = Math.sqrt(variance);
// the reduce result. It contains enough information to be rereduced
// with other reduce results.
return {"stdDeviation":stdDeviation,"count":count,
"total":total,"sqrTotal":sqrTotal};
};

map/reduce vs. MPP
● open source
● petabytes
● write routines by
hand
● inefficient
● generic
● cheap HW / cloud
● DIY tools
● proprietary
● terabytes
● advanced query
support
● efficient
● specific
● needs good HW
● integrated tools

enterprise search
ElasticSearch

enterprise search
when you need to do DW with a huge pile of
partly processed “documents”
● does: light data mining, light BI/analytics
● best “full text” and keyword search
● supports “approximate results”
● lots of special features for web data

E.S. vs. C-Store
● batch load
● semi-structured
data
● uncompressed
● star schema
● sharding
● approximate results
● batch load
● fully normalized
data
● compressed
● snowflake schema
● parallel query
● exact results

TABLE events (
event_id INT,
event_type TEXT,
start TIMESTAMPTZ,
duration INTERVAL,
event_desc TEXT
);

SELECT MAX(concurrent)
FROM (
SELECT SUM(tally)
OVER (ORDER BY start)
AS concurrent
FROM (
SELECT start, 1::INT as tally
FROM events
UNION ALL
SELECT (start + duration), -1
FROM events )
AS event_vert) AS ec;

stream processing SQL
● replace multiple queries with a single
query
● avoid scanning large tables multiple times
● replace pages of application code
● and MB of data transmission
● SQL alternative to map/reduce
● (for some data mining tasks)

query results as table
● calculate once, read many time
● complex/expensive queries
● frequently referenced
● not necessarily a whole query
● often part of a query
● might be manually or automatically
updated
● depends on product

non-relational matviews
● CouchDB Views
● cache results of map/reduce jobs
● updated on data read
● Solr / Elastic Search “Faceted Search”
● cached indexed results of complex searches
● updated on data change

maintaining matviews
BEST: update matviews
at batch load time
GOOD: update matview according
to clock/calendar
FAIR: update matview on data
request
BAD for DW: update matviews
using a trigger

matview tips
● matviews should be small
● 1/10 to ¼ of RAM on each node
● each matview should support several
queries
● or one really really important one
● truncate + append, don't update
● index matviews like crazy
● if they are not indexes themselves

cubes
Site
R
e
p
e
a
t
V
i
s
i
t
o
r
s
B
r
o
w
s
e
r

OLAP
● OnLine Analytical Processing
● Visualization technique
● all data as a multi-dimensional space
● great for decision support
● CPU & RAM intensive
● hard to do on really big data
● Works well with column stores

Contact
● Josh Berkus: josh@pgexperts.com
● blog: blogs.ittoolbox.com/database/soup
● twitter: @fuzzychef
● PostgreSQL: www.postgresql.org
● pgexperts: www.pgexperts.com
This talk is copyright 2011 Josh Berkus and is licensed under the Creative Commons Attribution
license. Many images were taken from google images and are copyright their original creators,
whom I don't actually know. Logos are trademark their respective owners, and are used here
under fair use.

Data Warehousing 101(and a video)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Warehousing 101(and a video)

Similar to Data Warehousing 101(and a video) (20)

More from PostgreSQL Experts, Inc.

More from PostgreSQL Experts, Inc. (20)

Recently uploaded

Recently uploaded (20)

Data Warehousing 101(and a video)