This presentation was presented by Ying Zhang from CWI at The ADASS XXI on November 06-10, 2011 in Paris, France.
The ever growing use of high precision experimental instruments in astronomical projects, e.g., SDSS, LSST and LOFAR, amounts to an avalanche of data to be stored, curated and analysed. Ingestion of gigabytes and even terabytes of data on a daily basis is taking place in many projects, while planned experimental devices are expected to scale ingestion up to petabytes soon. Efficient data management as part of a data exploration infrastructure has become a discriminative factor for scientific progress. Relational database management systems (RDBMSs) are the prime means to fulfill the role of application mediator for data exchange and data persistence.
Nevertheless, scientific applications are still poorly served by contemporary RDBMSs. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. To bridge the gap between the needs of the data-intensive scientific research fields like astronomy and the current DBMS technologies, we introduce SciQL (pronounced as `cycle'), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. SciQL provides a seamless symbiosis of array-, set-, and sequence- interpretation. A key innovation is the extension of value-based grouping in SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between the dimensional attributes of array cells. This leads to a generalization of window-based query processing with wide applicability in science domains.
In this talk, Ying Zhang demonstrates the usefulness of SciQL for astronomical data processing with examples from transient radio phenomena detection. The Transients Key Science Project (KSP) of the LOw Frequency ARray (LOFAR) focuses on exploring and understanding the explosive and dynamic universe by observing transient and variable radio sources. With its 2688 dipoles (LBA), 200 MHz sampling, 2 polarisations and 12 bit digitisation, the LOFAR antennas are capable of producing 138 petabyte of raw data per day. To process this massive volume of data, extraordinarily efficient query plans and algorithms are a must. The key operations in the Transient KSP project include cross-catalogue correlation and radio pulsars detection. For traditional RDBMSs, they are extremely hard to express in SQL and optimise for query execution. With SciQL, however, such array data oriented operations can be expressed easily and concisely. Furthermore, by revealing the properties of array data, SciQL enables the potentials of the RDBMSs to better optimise query plans.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data
1. Astronomical Data Processing Using
SciQL, an SQL Based Query
Language for Array Data
Ying Zhang, Bart Scheers, Martin Kersten, Milena Ivanova, Niels Nes
CWI Amsterdam
ADASS XXI, Nov. 06-10, 2011, Paris, France
!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
2.#(4&#$5()*+,#-&$".1(6&$&
!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9
2.
3. Why Not RDBMS?
SQL is difficult
No appropriate array denotations
No functional complete operation set
DBMSs are slow
Too much overhead
Size limitations (due to BLOB representations)
Existing foreign files
Scale
...
2011-11-09 ADASS XXI 3
4. SciQL
An array query language based on SQL:2003
To lower the entrance fee to RDBMSs
Distinguish features (Kersten et al. AD ’11; Zhang et al.
IDEAS2011):
Arrays and tables as first class citizens of DBMSs
Seamless integration of relational and array paradigms
Named dimensions with constraints
Flexible structure-based grouping
LOFAR Transient Key Project use case
2011-11-09 ADASS XXI 4
5. Array Definitions
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
CREATE ARRAY A1 (
x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4],
v FLOAT DEFAULT 0.0);
2011-11-09 ADASS XXI 5
6. Array Definitions
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
CREATE ARRAY A1 (
x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4],
v FLOAT DEFAULT 0.0);
dimensions,
any scalar data type
2011-11-09 ADASS XXI 5
7. Array Definitions
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
CREATE ARRAY A1 (
x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4],
v FLOAT DEFAULT 0.0);
dimensional range:
[(start|∗) : (step|∗) : (stop|∗)]
2011-11-09 ADASS XXI 6
8. Array Definitions
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0
x
0 1 2 3
null
CREATE ARRAY A1 (
x INT DIMENSION [0:1:4], y INT DIMENSION [0:1:4],
v FLOAT DEFAULT 0.0);
cell values,
any column data type
2011-11-09 ADASS XXI 7
9. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
10. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
11. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
12. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
13. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
14. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
Anchor point: 2 0.0 0.0 0.0 0.0
A1[x][y] null null
1 0.0 0.5 0.5 0.5
0 0.0 0.0 0.0 0.0
0 1 2 3
x
null
2011-11-09 ADASS XXI 8
15. Array Tiling
SELECT [x], [y], AVG(v) FROM A1
GROUP BY A1[x:x+2][y:y+2];
y null
3 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
null null
1 0.125 0.25 0.25 0.25
0 0.125 0.25 0.25 0.25
0 1 2 3
x
null
2011-11-09 ADASS XXI 9
16. LOFAR Catalogue
ra DOUBLE,
zone (Gray et al. 2006) frequency decl DOUBLE,
ra_err DOUBLE,
90 decl_err DOUBLE,
... flux DOUBLE,
... ...
ν4
2
ν3
1
ν2 V
0 U
ν1 Q
-1 I
-2 t1 t2 t3 t4 ... time
...
-90
0 1 2 3 ... 357 358 359 meridian
CREATE ARRAY LOFARsrc (
zone INT DIMENSION[-90:1:91], mrdn INT DIMENSION[0:1:360],
ts TIMESTAMP DIMENSION, freq INT DIMENSION[30:10:241],
id INT DIMENSION[0:1:*], stks CHAR(1) DIMENSION
CHECK(stks=`I' OR stks=`Q' OR stks=`U' OR stks=`V'),
ra DOUBLE, decl DOUBLE, ra_err DOUBLE, decl_err DOUBLE,
flux DOUBLE, ...);
2011-11-09 ADASS XXI 10
17. LOFAR Use Case
ra DOUBLE,
zone (Gray et al. 2006) frequency decl DOUBLE,
ra_err DOUBLE,
90 decl_err DOUBLE,
... flux DOUBLE,
... ...
ν4
2
ν3
1
ν2 V
0 U
ν1 Q
-1 I
-2 t1 t2 t3 t4 ... time
...
-90
0 1 2 3 ... 357 358 359 meridian
Similarity of the flux of a LOFAR source at frequencies 30
MHz and 200 MHz
cross-correlation of two time series
2011-11-09 ADASS XXI 11
18. Cross-Correlation
idx 0 1 2 3
F
val 4 3 6 2
idx 0 1 2
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val
2011-11-09 ADASS XXI 12
19. Cross-Correlation
Cr.idx = -3
idx 0 1 2 3
F F [3 : 4]
val 4 3 6 2
idx 0 1 2 G [0 : 1]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2
2011-11-09 ADASS XXI 13
20. Cross-Correlation
Cr.idx = -2
idx 0 1 2 3
F F [2 : 4]
val 4 3 6 2
idx 0 1 2 G [0 : 2]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2 16
2011-11-09 ADASS XXI 14
21. Cross-Correlation
Cr.idx = -1
idx 0 1 2 3
F F [1 : 4]
val 4 3 6 2
idx 0 1 2 G [0 : 3]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2 16 47
2011-11-09 ADASS XXI 15
22. Cross-Correlation
Cr.idx = 0
idx 0 1 2 3
F F [0 : 3]
val 4 3 6 2
idx 0 1 2 G [0 : 3]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2 16 47 61
2011-11-09 ADASS XXI 16
23. Cross-Correlation
Cr.idx = 1
idx 0 1 2 3
F F [0 : 2]
val 4 3 6 2
idx 0 1 2 G [1 : 3]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2 16 47 61 41
2011-11-09 ADASS XXI 17
24. Cross-Correlation
Cr.idx = 2
idx 0 1 2 3
F F [0 : 1]
val 4 3 6 2
idx 0 1 2 G [2 : 3]
G
val 1 5 7
idx -3 -2 -1 0 1 2
Cr
val 2 16 47 61 41 28
2011-11-09 ADASS XXI 18
25. LOFAR Use Case
ra DOUBLE,
zone (Gray et al. 2006) frequency decl DOUBLE,
ra_err DOUBLE,
90 decl_err DOUBLE,
... flux DOUBLE,
... ...
ν4
2
ν3
1
ν2 V
0 U
ν1 Q
-1 I
-2 t1 t2 t3 t4 ... time
...
-90
0 1 2 3 ... 357 358 359 meridian
DECLARE fcnt INT, gcnt INT;
SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’];
SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][30][11][‘I’];
CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0);
INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C
GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];
2011-11-09 ADASS XXI 19
26. LOFAR Use Case
ra DOUBLE,
zone (Gray et al. 2006) frequency decl DOUBLE,
ra_err DOUBLE,
90 decl_err DOUBLE,
... flux DOUBLE,
... ...
ν4
2
ν3
1
ν2 V
0 U
ν1 Q
-1 I
-2 t1 t2 t3 t4 ... time
...
-90
0 1 2 3 ... 357 358 359 meridian retrieve the time series
DECLARE fcnt INT, gcnt INT;
SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’];
SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][30][11][‘I’];
CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0);
INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C
GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];
2011-11-09 ADASS XXI 19
27. LOFAR Use Case
ra DOUBLE,
zone frequency decl DOUBLE,
ra_err DOUBLE,
90 decl_err DOUBLE,
... flux DOUBLE,
... ...
ν4
2
ν3
1
ν2 V
0 U
ν1 Q
-1 I
-2 t1 t2 t3 t4 ... time
...
-90
0 1 2 3 ... 357 358 359 meridian dynamic grouping for every iteration
DECLARE fcnt INT, gcnt INT;
SET fcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][30][11][‘I’];
SET gcnt = SELECT COUNT(*) FROM LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY VIEW F (idx INT DIMENSION[0:1:fcnt], flux DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][30][11][‘I’];
CREATE ARRAY VIEW G (idx INT DIMENSION[0:1:gcnt], val DOUBLE DEFAULT 0.0) AS SELECT flux FROM
LOFARsrc[*][*][*][200][11][‘I’];
CREATE ARRAY CrCorr30_200 (idx INT DIMENSION[-fcnt+1:1:gcnt], val DOUBLE DEFAULT 0.0);
INSERT INTO CrCorr SELECT SUM(F.flux * G.flux) FROM F, G, CrCorr30_200 AS C
GROUP BY F[MAX(0, -C.idx) : MIN(fcnt, gcnt-C.idx)], G[MAX(0, C.idx) : MIN(gcnt, fcnt+C.idx)];
2011-11-09 ADASS XXI 20
28. Conclusion
SciQL: a novel query language for scientific data
A symbiosis of relational and array paradigm
Simplifies expression of complex scientific algorithms
Leave optimisation to DBMS kernel
Opens opportunities to enhance scientific data mining
Under active implementation
!"#$%&'()*+,#-&$.#/(012#&+$#%3$%#,(
www.scilens.org www.monetdb.org
2.#(4&#$5()*+,#-&$".1(6&$&
!"#$%&'()&"#*+,-( ./0/123
4")*'()5"%%,%*'(*#-(( 6!7(8 9:7;;9
2011-11-09 ADASS XXI 21