For more than a decade, the evolution of database was governed largely by the incremental improvement in the major RDBMS products, and then suddenly in the past few years, a whole series of innovations started to arrive. This presentation will touch on the most significant, including these "Top 12":
The impact of SSD
Vector registers
The ARM processor
Column store databases and analytic databases
In memory architecture and database
NoSQL and the failure of SQL
Big data/machine data
Hadoop and friends
Data virtualization
Cloud database - database-as-a-service
Streaming and time series databases
A mathematics of data
1. THE DATABASE
REVOLUTION
Robin Bloor, Ph D
Tuesday, August 2, 11
2. This Presentation
Intro: The RDBMS
Computer Hardware Trends
The NoSQL trend (Either No as
in none or NO as in Not Only)
What to do...
Main Take Away:
Database is no longer a commodity
Tuesday, August 2, 11
3. A Point Of Departure
In the 1990s, Relational Database
quickly became the dominant form
of database.
The SQL language became the
dominant data access mechanism.
The RDBMS conferred mathematical
respectability on itself and even
claimed an underlying “Relational
Algebra.”
The RDBMS dominated because it
dealt effectively with transactional
and BI apps.
Tuesday, August 2, 11
4. Relational Dogma
Data and Process should be kept
separate.
The database embodies a data
model within a schema
Normalization to 3NF (or 5NF) is
the correct way to design the
schema
The query language (SQL) is part
DDL and part DML (Select,
Project, Join)
Ordering doesn’t matter
Tuesday, August 2, 11
5. The 1990s RDBMS
The RDBMS of the 1990s was
physically based on B-tree
structures and an optimizer.
This scaled up within reason but
it scaled out poorly.
It was fundamentally an index-
based data store.
It managed megabytes and
gigabytes fine.
But look what happened to
data....
Tuesday, August 2, 11
6. Moore’s Law Cubed
Moore’s Law suggests that CPU power increases
10-fold every 6 years (and other technologies have
stayed in step to some degree)
Large database volumes have grown 1000-fold:
In ~1992 measured in megabytes
In ~1998 measured in gigabytes
In ~2004 measured in terabytes
in ~2010 measured in petabytes
Exabytes by ~2016?
Tuesday, August 2, 11
11. A Database is a Cupboard
Some are transactional (for
operational systems)
Some service large queries
against large data heaps
Some are content oriented for
accessing complex objects
(object based systems mainly)
All databases need to deliver
performance
Tuesday, August 2, 11
12. A Database is a Cupboard
RDBMS ✔
Some are transactional (for
operational systems)
Some service large queries
against large data heaps
Some are content oriented for
accessing complex objects
(object based systems mainly)
All databases need to deliver
performance
Tuesday, August 2, 11
13. A Database is a Cupboard
RDBMS ✔
Some are transactional (for
operational systems)
RDBMS ??
Some service large queries
against large data heaps
Some are content oriented for
accessing complex objects
(object based systems mainly)
All databases need to deliver
performance
Tuesday, August 2, 11
14. A Database is a Cupboard
RDBMS ✔
Some are transactional (for
operational systems)
RDBMS ??
Some service large queries
against large data heaps
RDBMS ??
Some are content oriented for
accessing complex objects
(object based systems mainly)
All databases need to deliver
performance
Tuesday, August 2, 11
15. Hardware Data Points
Moore’s Law now proceeds by adding
cores rather than by increasing clock
speed. Vector registers now standard on
Intel chips
Parallelism is now on the rise and will
eventually become the normal mode of
processing
Memory is about 1 million times faster
than disk and random reads have become
very expensive in respect of latency
The Intel processor is now being
challenged by the ARM processor (it’s
about heat)
Tuesday, August 2, 11
17. Memory v Disk
The decline in memory
costs is (on current
trends) likely to have
memory cheaper than
disk around 2016
This means that non-
volatile SSDs will
prevail relatively soon.
SSDs are between
1000 and 100,000
times faster than
spinning disk
Tuesday, August 2, 11
18. Massive Scale-Out
CPUS are now
doubling cores every
18 months or so.
This trend, combined
with memory cost
trends, suggests that
massive scale out will
eventually become a
much rarer
requirement.
But we cannot know
that for sure.
Tuesday, August 2, 11
19. Consequences
SSD will replace disk - but slowly...
Many DBMS tasks can now be
handled in memory - but better
physical architectures are possible
for this.
Physical indexes are becoming
irrelevant
Scale out and parallelism are now
the driving force for large data
volume applications.
The physical architecture of the
traditional RDBMS is now an
anachronism
Tuesday, August 2, 11
22. RDBMS & SQL As Anachronisms
For big BI, RDBMS has been
superseded by column store dbms
primarily because it didn’t scale out
and indexes have become far less
important.
The use of snowflake schemas and
star schemas had already
demonstrated that 3NF was a limited
modeling technique and nothing
more.
And then came Hadoop & MapReduce
for massive scale-out - which cares
nothing for SQL or RDBMS
Tuesday, August 2, 11
23. A Fundamental Error
Actions: Add, Modify, Delete,
Archive
From day 1 there was a fundamental
error in the simple mechanics of
database and file systems.
When you update data you destroy
the old value. No audit trail.
A correct theory of data was
invented by (perhaps) Luca Pacioli.
It is the basis of accounting.
A few databases (Firebird is one)
were built so that data was only ever
added or archived.
Tuesday, August 2, 11
24. The Ordering Of Data
“A data set is an unordered
collection of unique, non-duplicated
items.”
This is an absurd constraint to place
upon data, as data is naturally
ordered by time if by nothing else.
Events are ordered by time.
Changes to entities are ordered
by time
There are lots of applications.
requiring time series capability.
This has led to TSDB products like
Streambase, Vhayu, Open TSDB,
etc.
Tuesday, August 2, 11
25. The Separation of Data and Process
The assumption was that this
separation could be enforced
But when you try to enforce it, you Process
forever encounter data and process
locked together in a guilty embrace.
It is a wrong separation of concerns.
SQL SCHEMA
In truth it cannot be enforced without
there being a true algebra of data
So many databases (object
databases and other NoSQL
databases) do not enforce it. DBMS
However their interfaces to data are
not perfect either.
Tuesday, August 2, 11
26. Relational Algebra Isn’t An Algebra
Set aside that fact that RDBMS
focus so strongly on Table structures
that they cannot naturally represent
other important data structures
(such as BOMP and MOLAP).
And that RDBMS rail against the
ordering of data (“No order”)
Ignore the stored procedures (which
violate the separation of data and
process).
Even so Relational Algebra is not
even an algebra. (NULLs?)
There is at least one algebraic
(NoSQL) database
Tuesday, August 2, 11
27. The SQL Barrier
SQL has:
DDL (for data definition) SQL
Barrier
DML (for Select, Project and Join)
Results Or results
But it has no MML or TML processing
must be done here
processing
must be done here
Usually result sets are brought to the
client for further manipulation, but
using them for further data access
SQL
becomes problematic.
Conclusions: Analytic
DBMS
This separation of data from
process is arbitrary and unhelpful
Any database to which this
doesn’t apply is NoSQL
Tuesday, August 2, 11
28. Other NDBMS Directions
Some NDBMS do not attempt to provide all ACID
properties. (Atomicity, Consistency, Isolation, Durability)
Some NDBMS deploy a distributed scale-out
architecture with data redundancy.
XML DBMS using XQuery are NDBMS.
Some documents stores are NDBMS (OrientDB,
Terrastore, etc.)
Object databases are NDBMS (Gemstone, Objectivity,
ObjectStore, etc.)
Key value stores = schema-less stores (Cassandra,
MongoDB, Berkeley DB, etc.)
Graph DBMS (DEX, OrientDB, etc.) are NDMBS
Large data pools (BigTable, Hbase, Mnesia, etc.) are
NDBMS
Tuesday, August 2, 11
30. What Is The Problem You Are
Trying To Solve?
The primary message of this presentation is that
database is no longer a commodity (if it ever
was).
Despite faults and weaknesses the General
Purpose Relations Database works fine for many
areas of application and:
It is well understood
Skills (for any popular product) are abundant
It can be inexpensive (by license or Open
Source)
Beyond such products, it is “horses for courses”
and “caveat emptor.”
Tuesday, August 2, 11
31. Other Selection Criteria
Don’t fall for fashion.
Proven performance?
Skills, both for design and for administration.
Interfaces & middleware
The hardware bill.
Product roadmap.
External support/internal support.
Calculate a TCO (note that even for expensive
DBMS the licenses fees are rarely more than
15% of the TCO)
Tuesday, August 2, 11
32. Take Aways
Hardware trends have brought change,
will bring more change
There are many RDBMS weaknesses
There are a huge number of “new”
database products both
No SQL Whatsoever, and
Not Only SQL
Select database products with caution
Main Take Away:
Database is no longer a commodity
Tuesday, August 2, 11