SciQL, Bridging the Gap between Science and Relational DBMS

SciQL
Bridging The Gap Between Science
And Relational DBMS

Martin Kersten, Ying Zhang, Milena Ivanova, Niels Nes
CWI Amsterdam

IDEAS 2011, Sep. 21-23, 2011,!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
Lisbon, Portugal
2.#(4&#$5()*+,#-&$".1(6&$&

!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9

Who needs arrays anyway?

Seismology – 1-D waveforms, 3-D spatial data

Astronomy – temporal ordered rasters

Climate simulation – temporal ordered grid

Remote sensing – images of 2-D or higher

Genomics – ordered DNA strings

Scientists love arrays:
HDF5, NETCDF, FITS, MSEED, …
but also use:
lists, tables, XML, ...

2011-09-22 IDEAS 2011 2

Arrays In DBMS
Research issues already in the 80’s OODB, multi-dimensional DBMS,
Sequence DBMS, ...
Algebraic frameworks
The Longhorn Array Database
(S)RAM, AQL, AML, ...
RasDaMan
SQL language extension Store large arrays in chunks as BLOBs

RasQL, AQuery, SRQL, … Array query (RasQL) optimisation on
top of DBMS
a notion of order Known to work up to 12 TBs!

SQL:1999, SQL:2003 PostgreSQL 8.1

collection type, C-style arrays SciDB

aggregation functions over arrays Array DBMS from scratch

Overlapping chunks for parallel
execution

2011-09-22 IDEAS 2011 3

What is the problem with RDBMS?

Appropriate array denotations?

Functional complete operation set?

Size limitations (due to BLOB representations)?

Existing foreign files?

Scale?

...

2011-09-22 IDEAS 2011 4

SciQL

An array query language based on SQL:2003

Pronounced as ‘cycle’

Distinguish features:

Arrays and tables as first class citizens of DBMSs

Seamless integration of relational and array paradigms

Named dimensions with constraints

Flexible structure-based grouping

Seismology use case

2011-09-22 IDEAS 2011 5

Array Definitions

Dimensions and cell values

Dimension range: [(start|∗) : (step|∗) : (stop|∗)]

A short cut for integer-typed dimensions: [size]

Dimension data type: scalar data types

Cells:

≽0 value(s) / cell

all data types of normal table columns

2011-09-22 IDEAS 2011 6

Array Definitions
Fixed array

CREATE ARRAY A1 (
x INT DIMENSION[0:1:4],
y INT DIMENSION[0:1:4],
v FLOAT DEFAULT 0.0);

y null

3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null

2011-09-22 IDEAS 2011 7

Array Definitions
Unbounded array

CREATE ARRAY A2 (
x INT DIMENSION,
y INT DIMENSION,

y

3
2
null
1
0
x
0 1 2 3

2011-09-22 IDEAS 2011 8

Array Definitions
Unbounded array

CREATE ARRAY A2 ( INSERT INTO A2 VALUES
x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,

y

3
2
null
1
0
x
0 1 2 3

2011-09-22 IDEAS 2011 8

Array Definitions
Unbounded array

x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,

y
null
3
2 0.0 4.5
null null
1 0.4 0.0
0 5.5 0.0
x
0 1 2 3
null

2011-09-22 IDEAS 2011 8

Array Definitions
Unbounded array

x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,
current range
y
null
3
2 0.0 4.5
null null
1 0.4 0.0
0 5.5 0.0
x
0 1 2 3
null

2011-09-22 IDEAS 2011 8

Array & Table Coercions

SELECT x, y, v FROM A1;

CREATE ARRAY A1 ( x y v
x INT DIMENSION[0:1:4],
y INT DIMENSION[0:1:4], 0 0 0.0
0 1 0.0 full materialisation!
y null 0 2 0.0
3 0.0 0.0 0.0 0.0 0 3 0.0
2 0.0 0.0 0.0 0.0
null null 1 0 0.0
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
1 1 0.0
x
0 1 2 3 1 2 0.0
null
1 3 0.0
2 0 0.0
2 1 0.0
2 2 0.0
2 3 0.0
3 0 0.0
3 1 0.0
3 2 0.0
3 3 0.0

2011-09-22 IDEAS 2011 9

Array & Table Coercions

SELECT [x], [y], v FROM T2;

dimension qualifiers: ‘[’, ‘]’
CREATE TABLE T2 (
x INT, y INT, y
null
INSERT INTO T2 VALUES 3
(1,0,5.5), (1,1,0.4),
(2,2,4.5), (1,1,1.3);
2 0.0 4.5
x y v null null
1 0 5.5
1 0.4 0.0
1 1 0.4
0 5.5 0.0
2 2 4.5
x
1 1 1.3
0 1 2 3
null

An unbounded array
dimension ranges derived from the minimal bounding box
cells values from the table or the column default
duplicates are overwritten arbitrarily

2011-09-22 IDEAS 2011 10

Array Modifications

DELETE FROM A1 WHERE x = 1;

y null

3 0.0 null 0.0 0.0
2 0.0 null 0.0 0.0
null null
1 0.0 null 0.0 0.0
0 0.0 null 0.0 0.0
x
0 1 2 3
null
creates holes in the array

2011-09-22 IDEAS 2011 11

Array Modifications

UPDATE A1 SET v = 0.5 WHERE y = 1;
INSERT INTO A1 VALUES
(0,1,0.5), (1,1,0.5), (2,1,0.5), (3,1,0.5);
y null

3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.5 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null

set/change cell values
overwrite existing values

2011-09-22 IDEAS 2011 12

Array Views

CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;

y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 0.0 0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 0.0 0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.0 0.0 0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 0.0 0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null

2011-09-22 IDEAS 2011 13

Array Views


y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null

2011-09-22 IDEAS 2011 13

Array Views


y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 1.0
0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null

2011-09-22 IDEAS 2011 13

Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 14

Array Tiling

y null

3 0.0 0.0 0.0 0.0

Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 14

Array Tiling

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.125 0.25 0.25 0.25

0 0.125 0.25 0.25 0.25
0 1 2 3
x
null

2011-09-22 IDEAS 2011 15

Array Tiling
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 16

Array Tiling
A1[x][y], A1[x+1][y], A1[x][y+1];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 17

Array Tiling
A1[x][y], A1[x+1][y], A1[x][y+1];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 18

Array Tiling
A1[x][y], A1[x+1][y], A1[x][y+1];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5

0 0.0 0.0 0.0 0.0
0 1 2 3
x
null

2011-09-22 IDEAS 2011 19

Array Tiling
A1[x][y], A1[x+1][y], A1[x][y+1];

y null

3 0.0 0.0 0.0 0.0

2 0.0 0.1 0.1 0.0
null null
1 0.125 0.2 0.3 0.25

0 0.0 0.125 0.125 0.167
0 1 2 3
x
null

2011-09-22 IDEAS 2011 20

Seismology Use Case
Recent aftershock in Chili

2TB waveform data at 100Hz

detecting seismic events using
STA/LTA (e.g., 2 sec / 15 sec)

remove false positives

window-based 3 min. cuts

further analysis: digital signal
processing operations

Current problems

accessing waveform files too slow

unpacking and positioning MSEED
data every time take too long

2011-09-22 IDEAS 2011 21

Seismology Use Case
Recent aftershock in Chili CREATE ARRAY MSeed (
station VARCHAR(5) DIMENSION [‘0’:*:‘ZZZZZ’];
time TIMESTAMP DIMENSION,
2TB waveform data at 100Hz data DECIMAL(8,6)
);
STA/LTA (e.g., 2 sec / 15 sec) station

remove false positives efg

window-based 3 min. cuts bce

further analysis: digital signal bcd
abc

Current problems time



2011-09-22 IDEAS 2011 22

Seismology Use Case
Recent aftershock in Chili --- avg of 2 sec. windows:

SELECT M.station, M.time, AVG(M.data)
2TB waveform data at 100Hz FROM MSeed AS M
GROUP BY
detecting seismic events using M[station][time - INTERVAL ‘2’ SECOND : time];
STA/LTA (e.g., 2 sec / 15 sec)

remove false positives



Current problems



2011-09-22 IDEAS 2011 23

Seismology Use Case
Recent aftershock in Chili CREATE TABLE Event(
station VARCHAR(5),
time TIMESTAMP,
2TB waveform data at 100Hz ratio FLOAT,
PRIMARY KEY (station, time));
STA/LTA (e.g., 2 sec / 15 sec) INSERT INTO Event
SELECT M1.station, M1.time,
remove false positives AVG(M1.data)/AVG(M2.data) AS ratio
FROM MSeed AS M1, MSeed AS M2
WHERE M1.station = M2.station
window-based 3 min. cuts AND M1.time = M2.time
GROUP BY
further analysis: digital signal M1[station][time - INTERVAL ‘2’ SECOND: time],
processing operations M2[station][time - INTERVAL ‘15’ SECOND: time]
HAVING AVG(M1.data)/AVG(M2.data) > ?delta;
Current problems



2011-09-22 IDEAS 2011 24

Seismology Use Case
Recent aftershock in Chili -- detect isolated errors by direct environment
-- using wave propagation statics
2TB waveform data at 100Hz CREATE TABLE Neighbors(
station1 VARCHAR(5),
detecting seismic events using station2 VARCHAR(5),
STA/LTA (e.g., 2 sec / 15 sec) mindelay INTERVAL SECOND,
maxdelay INTERVAL SECOND,
remove false positives weight FLOAT
);
-- remove the false positives from Event
processing operations DELETE FROM Event WHERE id NOT IN (
SELECT E1.id
Current problems FROM Event AS E1, Event AS E2, Neighbor AS N
WHERE E1.station = N.station1
AND E2.station = N.station2
accessing waveform files too slow AND E2.time BETWEEN E1.time + N.mindelay
AND E1.time + N.maxdelay
unpacking and positioning MSEED AND E1.ratio > E2.ratio * N.weight);

2011-09-22 IDEAS 2011 25

Seismology Use Case
Recent aftershocks in Chili -- pass time series to a UDF, written in, e.g., C:

SELECT myfunction(M[station].*)
2TB waveform data at 100Hz FROM MSeed AS M, Event AS E
WHERE M.station = E.station
detecting seismic events using AND M.time = E.time
STA/LTA (e.g., 2 sec / 15 sec) GROUP BY DISTINCT
M[station][time - INTERVAL ‘1’ MINUTE :
remove false positives time + INTERVAL ‘2’ MINUTE];



Current problems



2011-09-22 IDEAS 2011 26

Conclusion
SciQL: a first step towards a tailored scientific DBMS
A symbiosis of relational and array paradigms

Under active implementation

Open issues:
Appropriate array denotations

Functional complete operation set

Size limitations (due to BLOB representations)

Existing foreign files
!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
2.#(4&#$5()*+,#-&$".1(6&$&
Scale
!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9

2011-09-22 IDEAS 2011 27

SciQL, Bridging the Gap between Science and Relational DBMS

Recomendados

Recomendados

Más contenido relacionado

Similar a SciQL, Bridging the Gap between Science and Relational DBMS

Similar a SciQL, Bridging the Gap between Science and Relational DBMS (20)

Más de PlanetData Network of Excellence

Más de PlanetData Network of Excellence (20)

Último

Último (20)

SciQL, Bridging the Gap between Science and Relational DBMS