SciQL, Bridging the Gap between Science and Relational DBMS
1. SciQL
Bridging The Gap Between Science
And Relational DBMS
Martin Kersten, Ying Zhang, Milena Ivanova, Niels Nes
CWI Amsterdam
IDEAS 2011, Sep. 21-23, 2011,!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
Lisbon, Portugal
2.#(4&#$5()*+,#-&$".1(6&$&
!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9
2. Who needs arrays anyway?
Seismology – 1-D waveforms, 3-D spatial data
Astronomy – temporal ordered rasters
Climate simulation – temporal ordered grid
Remote sensing – images of 2-D or higher
Genomics – ordered DNA strings
Scientists love arrays:
HDF5, NETCDF, FITS, MSEED, …
but also use:
lists, tables, XML, ...
2011-09-22 IDEAS 2011 2
3. Arrays In DBMS
Research issues already in the 80’s OODB, multi-dimensional DBMS,
Sequence DBMS, ...
Algebraic frameworks
The Longhorn Array Database
(S)RAM, AQL, AML, ...
RasDaMan
SQL language extension Store large arrays in chunks as BLOBs
RasQL, AQuery, SRQL, … Array query (RasQL) optimisation on
top of DBMS
a notion of order Known to work up to 12 TBs!
SQL:1999, SQL:2003 PostgreSQL 8.1
collection type, C-style arrays SciDB
aggregation functions over arrays Array DBMS from scratch
Overlapping chunks for parallel
execution
2011-09-22 IDEAS 2011 3
4. What is the problem with RDBMS?
Appropriate array denotations?
Functional complete operation set?
Size limitations (due to BLOB representations)?
Existing foreign files?
Scale?
...
2011-09-22 IDEAS 2011 4
5. SciQL
An array query language based on SQL:2003
Pronounced as ‘cycle’
Distinguish features:
Arrays and tables as first class citizens of DBMSs
Seamless integration of relational and array paradigms
Named dimensions with constraints
Flexible structure-based grouping
Seismology use case
2011-09-22 IDEAS 2011 5
6. Array Definitions
Dimensions and cell values
Dimension range: [(start|∗) : (step|∗) : (stop|∗)]
A short cut for integer-typed dimensions: [size]
Dimension data type: scalar data types
Cells:
≽0 value(s) / cell
all data types of normal table columns
2011-09-22 IDEAS 2011 6
7. Array Definitions
Fixed array
CREATE ARRAY A1 (
x INT DIMENSION[0:1:4],
y INT DIMENSION[0:1:4],
v FLOAT DEFAULT 0.0);
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
2011-09-22 IDEAS 2011 7
8. Array Definitions
Unbounded array
CREATE ARRAY A2 (
x INT DIMENSION,
y INT DIMENSION,
v FLOAT DEFAULT 0.0);
y
3
2
null
1
0
x
0 1 2 3
2011-09-22 IDEAS 2011 8
9. Array Definitions
Unbounded array
CREATE ARRAY A2 ( INSERT INTO A2 VALUES
x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,
v FLOAT DEFAULT 0.0);
y
3
2
null
1
0
x
0 1 2 3
2011-09-22 IDEAS 2011 8
10. Array Definitions
Unbounded array
CREATE ARRAY A2 ( INSERT INTO A2 VALUES
x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,
v FLOAT DEFAULT 0.0);
y
null
3
2 0.0 4.5
null null
1 0.4 0.0
0 5.5 0.0
x
0 1 2 3
null
2011-09-22 IDEAS 2011 8
11. Array Definitions
Unbounded array
CREATE ARRAY A2 ( INSERT INTO A2 VALUES
x INT DIMENSION, (1,0,5.5), (1,1,0.4), (2,2,4.5);
y INT DIMENSION,
v FLOAT DEFAULT 0.0);
current range
y
null
3
2 0.0 4.5
null null
1 0.4 0.0
0 5.5 0.0
x
0 1 2 3
null
2011-09-22 IDEAS 2011 8
12. Array & Table Coercions
SELECT x, y, v FROM A1;
CREATE ARRAY A1 ( x y v
x INT DIMENSION[0:1:4],
y INT DIMENSION[0:1:4], 0 0 0.0
v FLOAT DEFAULT 0.0);
0 1 0.0 full materialisation!
y null 0 2 0.0
3 0.0 0.0 0.0 0.0 0 3 0.0
2 0.0 0.0 0.0 0.0
null null 1 0 0.0
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
1 1 0.0
x
0 1 2 3 1 2 0.0
null
1 3 0.0
2 0 0.0
2 1 0.0
2 2 0.0
2 3 0.0
3 0 0.0
3 1 0.0
3 2 0.0
3 3 0.0
2011-09-22 IDEAS 2011 9
13. Array & Table Coercions
SELECT [x], [y], v FROM T2;
dimension qualifiers: ‘[’, ‘]’
CREATE TABLE T2 (
x INT, y INT, y
v FLOAT DEFAULT 0.0);
null
INSERT INTO T2 VALUES 3
(1,0,5.5), (1,1,0.4),
(2,2,4.5), (1,1,1.3);
2 0.0 4.5
x y v null null
1 0 5.5
1 0.4 0.0
1 1 0.4
0 5.5 0.0
2 2 4.5
x
1 1 1.3
0 1 2 3
null
An unbounded array
dimension ranges derived from the minimal bounding box
cells values from the table or the column default
duplicates are overwritten arbitrarily
2011-09-22 IDEAS 2011 10
14. Array Modifications
DELETE FROM A1 WHERE x = 1;
y null
3 0.0 null 0.0 0.0
2 0.0 null 0.0 0.0
null null
1 0.0 null 0.0 0.0
0 0.0 null 0.0 0.0
x
0 1 2 3
null
creates holes in the array
2011-09-22 IDEAS 2011 11
15. Array Modifications
UPDATE A1 SET v = 0.5 WHERE y = 1;
INSERT INTO A1 VALUES
(0,1,0.5), (1,1,0.5), (2,1,0.5), (3,1,0.5);
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.5 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
set/change cell values
overwrite existing values
2011-09-22 IDEAS 2011 12
16. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 0.0 0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 0.0 0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.0 0.0 0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 0.0 0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
17. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 0.0 0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 0.0 0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.0 0.0 0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 0.0 0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
18. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
19. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
20. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
21. Array Views
CREATE ARRAY VIEW A2 (
x INT DIMENSION [-1:1:5],
y INT DIMENSION [-1:1:5],
w FLOAT DEFAULT 0.0) AS
SELECT x-1, y, v FROM A1 WHERE x > 1 UNION
SELECT x, y, 1.0 FROM A1 WHERE x = 3;
y null
y null 4 0.0 0.0 0.0 0.0 0.0 0.0
3 -1.0 -1.0 -1.0 -1.0 3 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
2 -1.0 -1.0 -1.0 -1.0 2 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
null null null null
1 -1.0 0.5 0.5 0.5 1 0.0 0.5
0.0 0.5
0.0 0.5
0.0 1.0
0.0 0.0
0 -1.0 -1.0 -1.0 -1.0 0 0.0 -1.0 -1.0 -1.0 0.0
0.0 0.0 0.0 1.0 0.0
0 1 2 3
x
-1 0.0 0.0 0.0 0.0 0.0 0.0
null -1 0 1 2 3 4
x
null
2011-09-22 IDEAS 2011 13
22. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
23. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
24. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
25. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
26. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
27. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 14
28. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.125 0.25 0.25 0.25
0 0.125 0.25 0.25 0.25
0 1 2 3
x
null
2011-09-22 IDEAS 2011 15
29. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 16
30. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 17
31. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 18
32. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-09-22 IDEAS 2011 19
33. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x-1][y], A1[x][y-1],
A1[x][y], A1[x+1][y], A1[x][y+1];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.1 0.1 0.0
null null
1 0.125 0.2 0.3 0.25
0 0.0 0.125 0.125 0.167
0 1 2 3
x
null
2011-09-22 IDEAS 2011 20
34. Seismology Use Case
Recent aftershock in Chili
2TB waveform data at 100Hz
detecting seismic events using
STA/LTA (e.g., 2 sec / 15 sec)
remove false positives
window-based 3 min. cuts
further analysis: digital signal
processing operations
Current problems
accessing waveform files too slow
unpacking and positioning MSEED
data every time take too long
2011-09-22 IDEAS 2011 21
35. Seismology Use Case
Recent aftershock in Chili CREATE ARRAY MSeed (
station VARCHAR(5) DIMENSION [‘0’:*:‘ZZZZZ’];
time TIMESTAMP DIMENSION,
2TB waveform data at 100Hz data DECIMAL(8,6)
);
detecting seismic events using
STA/LTA (e.g., 2 sec / 15 sec) station
remove false positives efg
window-based 3 min. cuts bce
further analysis: digital signal bcd
processing operations
abc
Current problems time
accessing waveform files too slow
unpacking and positioning MSEED
data every time take too long
2011-09-22 IDEAS 2011 22
36. Seismology Use Case
Recent aftershock in Chili --- avg of 2 sec. windows:
SELECT M.station, M.time, AVG(M.data)
2TB waveform data at 100Hz FROM MSeed AS M
GROUP BY
detecting seismic events using M[station][time - INTERVAL ‘2’ SECOND : time];
STA/LTA (e.g., 2 sec / 15 sec)
remove false positives
window-based 3 min. cuts
further analysis: digital signal
processing operations
Current problems
accessing waveform files too slow
unpacking and positioning MSEED
data every time take too long
2011-09-22 IDEAS 2011 23
37. Seismology Use Case
Recent aftershock in Chili CREATE TABLE Event(
station VARCHAR(5),
time TIMESTAMP,
2TB waveform data at 100Hz ratio FLOAT,
PRIMARY KEY (station, time));
detecting seismic events using
STA/LTA (e.g., 2 sec / 15 sec) INSERT INTO Event
SELECT M1.station, M1.time,
remove false positives AVG(M1.data)/AVG(M2.data) AS ratio
FROM MSeed AS M1, MSeed AS M2
WHERE M1.station = M2.station
window-based 3 min. cuts AND M1.time = M2.time
GROUP BY
further analysis: digital signal M1[station][time - INTERVAL ‘2’ SECOND: time],
processing operations M2[station][time - INTERVAL ‘15’ SECOND: time]
HAVING AVG(M1.data)/AVG(M2.data) > ?delta;
Current problems
accessing waveform files too slow
unpacking and positioning MSEED
data every time take too long
2011-09-22 IDEAS 2011 24
38. Seismology Use Case
Recent aftershock in Chili -- detect isolated errors by direct environment
-- using wave propagation statics
2TB waveform data at 100Hz CREATE TABLE Neighbors(
station1 VARCHAR(5),
detecting seismic events using station2 VARCHAR(5),
STA/LTA (e.g., 2 sec / 15 sec) mindelay INTERVAL SECOND,
maxdelay INTERVAL SECOND,
remove false positives weight FLOAT
);
window-based 3 min. cuts
-- remove the false positives from Event
further analysis: digital signal
processing operations DELETE FROM Event WHERE id NOT IN (
SELECT E1.id
Current problems FROM Event AS E1, Event AS E2, Neighbor AS N
WHERE E1.station = N.station1
AND E2.station = N.station2
accessing waveform files too slow AND E2.time BETWEEN E1.time + N.mindelay
AND E1.time + N.maxdelay
unpacking and positioning MSEED AND E1.ratio > E2.ratio * N.weight);
data every time take too long
2011-09-22 IDEAS 2011 25
39. Seismology Use Case
Recent aftershocks in Chili -- pass time series to a UDF, written in, e.g., C:
SELECT myfunction(M[station].*)
2TB waveform data at 100Hz FROM MSeed AS M, Event AS E
WHERE M.station = E.station
detecting seismic events using AND M.time = E.time
STA/LTA (e.g., 2 sec / 15 sec) GROUP BY DISTINCT
M[station][time - INTERVAL ‘1’ MINUTE :
remove false positives time + INTERVAL ‘2’ MINUTE];
window-based 3 min. cuts
further analysis: digital signal
processing operations
Current problems
accessing waveform files too slow
unpacking and positioning MSEED
data every time take too long
2011-09-22 IDEAS 2011 26
40. Conclusion
SciQL: a first step towards a tailored scientific DBMS
A symbiosis of relational and array paradigms
Under active implementation
Open issues:
Appropriate array denotations
Functional complete operation set
Size limitations (due to BLOB representations)
Existing foreign files
!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
2.#(4&#$5()*+,#-&$".1(6&$&
Scale
!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9
2011-09-22 IDEAS 2011 27