Workshop on TelegraphCQ:
Concept of Data Stram Management System.
TelegraphCQ: the DSMS developped at Berkley, internal architecture.
Differences between tradition database.
Adaptive QUery Processing using the new concept of Eddies like a routing operator.
Troubles about join Streams (with no statistical data) and Relations; and the two solution: STAIR and SteMs.
STAIR: a join operator that allow internal state changing using primite function visible to eddies.
SteMs: helf-join operator that keep homogeneous tuples, internal state is decision-indipendent.
Eddies Routing Policy implemented with the (Waldspurger & Weihl [1994]) Lottery Scheduling.
1. Berkley’sTelegraphCQ Details about TelegraphCQ Adaptivity: Eddies, SteMs and STAIR Routing Policy: Lottery Scheduling 1 Friday, July 15, 2011 Alberto Minetti Advanced Data Management @ DISI Università degli Studi di Genova
2. Data Stream Management System Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes DSMS = Data Stream Management System 2
3. The Begin:Telegraph Several continuous queries Several data streams At the beginning in Java Then C-based using PostgreSQL No Distribuite Scheduling Level of adaptivity doesn’t change against overloading Ignore system resources Data managemente fully in memory 3
4. ReDesign:TelegraphCQ Developed byBerkeley University Written in C/C++ OpenBSD License Based on PostgreSQL sources Current Version: 0.6 (PostgreSQL 7.3.2) Closed Project in 2006 Several points of interest and features Software http://telegraph.cs.berkeley.edu Papers http://db.cs.berkeley.edu/telegraph Commercial spinoff Truviso 4
5. TelegraphCQ Architecture PostgreSQL backends Many TelegraphCQ front-end Only one TelegraphCQ back-end Front-End Fork for every client connection Doesn’t hold streams Parsing continuous query in the shared memory Back-End has an Eddy Joins query plans together Can be shared among queries Put results in the shared memory 5
6. TelegraphCQ Architecture Shared Memory Query Plan Queue TelegraphCQBack End TelegraphCQBack End TelegraphCQ Front End Eddy Control Queue Planner Parser Listener Modules Modules Query Result Queues Split Mini-Executor Proxy CQEddy CQEddy } Split Split Catalog Scans Scans Shared Memory Buffer Pool Wrappers TelegraphCQ Wrapper ClearingHouse Disk 6 Taken from Michael Franklin, UC Berkeley
7. Tipologie di Moduli Input and Caching (Relations and Streams) Interface to external datasource Wrapper for HTML, XML, FileSystem, proxy P2P Remote Databases with caching support Query Execution Non-blocking of classical operators (sel, proj) SteMs, STAIRs Adaptive Routing Reoptimize plan during execution Eddies: choose routing tupla per tupla Juggle: ordering on the fly (per value or timestamp), Flux: routing among computer of a cluster 7 Fjords Framework
8. Fjords Framework inJava for Operators on Remote Data Streams Interconnect modules Support queues among modules Non-blocking Support for Relazions and Streams 8
9. Stream in TelegraphCQ Unarchived Stream Never written on disk In shared memory between executor and wrapper Archived Stream Append-only method to send tuples to system No update, insert, delete; query aggregate with window tcqtimeof type TIMESTAMP for window queries With constraint TIMESTAMPCOLUMN 9
10. DDL: Create Stream 10 CREATE STREAM measurements ( tcqtime TIMESTAMP TIMESTAMPCOLUMN, stationid INTEGER, speed REAL) TYPE ARCHIVED; CREATE STREAM tinydb ( tcqtime TIMESTAMP TIMESTAMPCOLUMN, light REAL, temperature REAL) TYPE UNARCHIVED; DROP STREAM measurements;
11. Acquisizione di Dati Sources must identify before sending datas Wrapper: user-defined functions How process datas Inside Wrapper Clearinghouse process Push sources Begin a connection to TelegraphCQ Pull sources Thewrapper begin the connection Different Wrapper Data can merge in the same stream Heartbeat: Punctuated tuple without datas, only timestamp When see a Punctuatedtuple no prior datas will come 11
12. Wrappers nel WCH Non-process-blocking over network socket (TCP/IP) Wrapper funct called when there are datas on socket Or when there are datas on buffer If funct return a tuple for time (classic iterator) Ritorns array ofPostgreSQL Datum Init(WrapperState*) allocate resources and state Next(WrapperState*) tuples are in the WrapperState Done(WrapperState*) free resources and destroy state All in the infrastructured memory of PostgreSQL 12
13. DDL: Create Wrapper 13 CREATE FUNCTION measurements_init(INTEGER) RETURNS BOOLEAN AS ‘libmeasurements.so,measurements_init’ LANGUAGE ‘C’; CREATE WRAPPER mywrapper ( init=measurements_init, next=measurements_next, done=measurements_done);
14. HtmlGet and WebQueryServer HtmlGet allows the user to execute a series of Html GET or POST requests to arrive at a page, and then to extract data out of the page using a TElegraph Screen Scrapperdefinition file. Once extracted, this data can be output to a CSV file. Welcome to the TESS homepage. TESS is the TElegraph Screen Scrapper. It is part of The Telegraph Project at UC Berkeley. TESS is a program that takes data from web forms (like search engines or database queries) and turns it into a representation that is usable by a database query processor. http://telegraph.cs.berkeley.edu/tess/ 14
15. self-monitoring capability Three special streams: info about system state Support tointrospettive queries Dynamic catalog Queried as an any stream tcq_queries(tcqtime, qrynum, qid, kind, qrystr) tcq_operators(tcqtime, opnum, opid, numqrs, kind, qid, opstr, opkind, opdesc) tcq_queues(tcqtime, opid, qkind, kind) 15
16. Example 16 Welcome to psql 0.2, the PostgreSQL interactive terminal. # CREATE SCHEMA traffic; # CREATE STREAM traffic.measurements (stationid INT, speed REAL, tcqtime TIMESTAMP TIMESTAMPCOLUMN ) TYPE ARCHIVED; # ALTER STREAM traffic.measurements ADD WRAPPER csvwrapper; $ cat tcpdump.log | source.pl localhost 5533 csvwrapper,network.tcpdum Default TelegraphCQ script written in Perl to simulate sources that send CSV datas Default Port
17. Load Shedding CREATE STREAM … TYPE UNARCHIVED ON OVERLOAD ???? BLOCK: stop it(default) DROP: drop tuples KEEP COUNTS: keep the count of dropped tuples REGHIST: build a fixed-grid histog. of shedded tuples MYHIST: build a MHIST (multidimensional histog.) WAVELET wavelet params Build a wavelet histogram SAMPLE: keep a Reservoir Sample 17
18. LoadShedding:SummaryStream For a stream named schema.stream Automatically created two streams schema.__stream_dropped schema.__stream_kept For WAVELET, MYHIST, REHIST, COUNTS Schema contains: Summary datas Summary interval For SAMPLE Same schema with column __samplemult Keep the number of effective tuples rappresented 18
19. Quering TelegraphCQ:StreaQuel 19 Continuous Query for: Standard Relations inherit from PostgreSQL Data stream windowed (sliding, hopping, jumping) RANGE BY specify the window dimension SLIDE BY specify the update rate START AT specify whenthe query will begin optional SELECT stream.color, COUNT(*) FROM stream [RANGE BY ‘9’ SLIDE BY ‘1’] GROUP BY stream.color window START OUPUT! 1 1 1 1 1 2 2 1 2 1 2 2 1 2 1 2 1 2 1 Adapted from Jarle Søberg
20. Quering TelegraphCQ:StreaQuel (2) wtime(*) returns the last timestamp in the window Recoursive query using WITH [SQL 1999] StreaQuel doesn’t allow subqueries 20 SELECT S.a, wtime(*) FROM S [RANGE BY ’10 seconds’ SLIDE BY ’10 second’], R [RANGE BY ’10 seconds’ SLIDE BY ’10 second’] WHERE S.a = R.a; Data Stream Window … 10 seconds 10 seconds
21. Net Monitor Windowed Query 21 SELECT (CASE when outgoing = true then src_ip else dst_ip end) as inside_ip , (CASE when outgoing = true then dst_ip else src_ip end) as outside_ip, sum(bytes_sent) + sum(bytes_recv) as bytes FROM flow [RANGE BY $w SLIDE BY $w] GROUP BY inside_ip, outside_ip All active connections SELECT src_ip, wtime(*), COUNT(DISTINCT(dst_ip||dst_port)) AS fanout, FROM flow [RANGE BY $w SLIDE BY $w] WHERE outgoing = false GROUP BY src_ip ORDER BY fanout DESC LIMIT 100 PER WINDOW; 100 sources with the max number of connections SELECT sum(bytes_sent), src_ip, wtime(*) AS now FROM flow [RANGE BY $w SLIDE BY $w] WHERE outgoing = false GROUP BY src_ip ORDER BY sum(bytes_sent) DESC LIMIT 100 PER WINDOW; 100 most significant sources of traffic Taken from: İlkay Ozan Kay
22. Evolutionary Revolutionary Adaptive Query Processing:Evolution Static Late Inter Intra Per Plans Binding Operator Operator Tuple Traditional Dynamic QEP Query Scrambling Xjoin, DPHJ, Eddies DBMS Parametric Mid-query Reopt. Convergent QP Competitive Progressive Opt. Taken from: AmolDeshpande, Vijayshankar Raman 22
23. Adaptive Query Processing:Current Background Several plans Parametric Queries Continuous Queries Focus on incremental output Complex Queries (10-20 relazions in join) Data Stream and asyncronous datas Statistics not available Interactive Queries and user preferences Sharing of the state Several operators XML data and text Wide area federations 23
24. Adaptive Query Processing:System R Repeat Observate environment: daily/weekly (runstats) Choose behaviour (optimizer) If current plan is not the best plan (analyzer) Actuate the new plan (executor) Cost-based optimization Runstats-optimize-execute -> high coarseness Weekly Adaptivity! Goal: adaptivity per tuple Merge the 4 works Measure Actuate Analyze Plan Taken from: Avnur, Hellerstein 24
25. TelegraphCQ Executor:Eddy Idea taken from fluid mechanic continuously adaptive query processing mechanism Eddy is a routing operator Delineate which modules must visit before After a tuple visit all modules, it can output See tuples before and after each module (operator) 25 Measure Eddy Analyze Actuate Plan Taken from: Amol Deshpande, Vijayshankar Raman
26. Eddies:Correctness Every tuple has two BitVector Every position correspond to an operator Ready: identify if tuple is ready for that operator Eddy can delineate which tuples send to an operator Done: identify if tuple was processed by that operator Eddy avoids sending two times to same operator When all Done bits are setted -> output joined tuple have Ready and Donein bitwise-OR Simple selections have Ready=complement(Done) 26
27. Eddies:Simple Selection Routing 27 S SELECT *FROM SWHERE S.a > 10 AND S.b < 15 AND S.c = 15 σa σb S.b < 15 S.a > 10 Eddy σc S.c = 15 σaσbσc σaσbσc 15 ; 0 ; 15 S1 0 0 0 1 1 1 1 0 0 01 1 1 1 0 0 0 1 1 1 1 0 0 0 a 15 b 0 Ready Done c 15 Adapted from Manuel Hertlein
28. S >< T Relation Binary Join:R >< S >< T Join order (R >< S) >< T Alright with direct access to datas (index or seq.) 28 R >< S T R S Taken from Jarle Søberg
29. Stream Binary JoinR >< S >< T 29 Join order (R >< S) >< T But if data are sent by sources... S >< T R >< S Block or drop some tuples is inevitable! Taken from Jarle Søberg
30. On the fly optimization necessary S >< T 30 Stream Binary JoinR >< S >< T Often stream changes Reoptimize require lot of time Non dynamic enough! R >< S Taken from Jarle Søberg
31. Stream Binary Join:Eddies Using an Eddy Initial behaviour of Telegraph 31 S >< T eddy R >< S tuple-based adaptivity Consider dynamic changes in the stram Taken from Jarle Søberg
32. Eddies:Sheduling Join problem Sheduling on selectivity of joins doesn’t work Example 32 |S E| |EC| >> E and Carrive early; Sis delayed S E C time Taken from Amol Deshpande
33. 33 |S E| |E C| SE HashTable E.Name HashTable S.Name Eddy S E Output C HashTable C.Course HashTable E.Course Eddy decides to route E to EC EC >> E and Carrive early; Sis delayed S0 sent and received suggest S Join E is better option S S E S0 S –S0 E C time C SE S0E (S –S0)E Eddy learns the correct sizes Too Late !! Taken from Amol Deshpande
34. 34 State got embedded as a result of earlier routing decisions |S E| |EC| SE HashTable E.Name HashTable S.Name EC Eddy S E Output C HashTable C.Course HashTable E.Course SE EC >> E and Carrive early; Sis delayed S E C C SE S E Execution Plan Used Query is executed using the worse plan! Too Late !! Taken from Amol Deshpande
35. STAIRStorage, Transformation and Access for Intermediate Results 35 S.Name STAIR Build into S.Name STAIR HashTable E.Name STAIR HashTable Eddy S Output E C HashTable HashTable E.Course STAIR C.Course STAIR Probe into E.Name STAIR Show internal state of join to eddy Provide primitive function to state management Demotion Promotion Operation for Insertion (build) Lookup (probe) s1 s1 s1 s1 Taken from Amol Deshpande
36. STAIR: Demotion 36 e1 e1 e2c1 e2 s1e1 e2c1 e2 s1e1 S.Name STAIR HashTable E.Name STAIR s1 Demoting di e2c1ae2: Simple projection HashTable e1 e2c1 e2 Eddy S s1e1 E Output C HashTable e2 s1e1 e2c1 HashTable c1 Can be tought of as undoing work E.Course STAIR C.Course STAIR Adapted from Amol Deshpande
37.
38. A join to be used to promote this tupleHashTable e1 e1c1 e2c1 Eddy S E Output C HashTable e2 s1e1 HashTable c1 e1 Can be tought of as precomputation of work E.Course STAIR C.Course STAIR Adapted from Amol Deshpande
39. Demotion OR Promotion 38 Taken from Lifting the Burden of History from Adaptive Query Processing Amol Deshpande and Joseph M. Hellerstein
40. Demotion AND Promotion Taken from Lifting the Burden of History from Adaptive Query Processing Amol Deshpande and Joseph M. Hellerstein 39
41. 40 S.Name STAIR HashTable |S E| |EC| S0 E E E HashTable E E E C C C S0E Eddy decides to route E to EC E.Course STAIR >> E and Carrive early; Sis delayed S0 E.Name STAIR HashTable S E E Eddy S C E Output C time E HashTable C Eddy decides to migrateE Eddy learns the correct selectivities By promoting E using EC C.Course STAIR Adapted from Amol Deshpande
42. 41 |S E| |EC| HashTable E C S0E E.Course STAIR >> E and Carrive early; Sis delayed S.Name STAIR HashTable S S0 E.Name STAIR HashTable S S –S0 S –S0 E Eddy S C (S –S0) EC E Output C time E HashTable C C.Course STAIR Adapted from Amol Deshpande
43. 42 EC SE C S – S0 SE EC S0 E E C HashTable E C SE E.Course STAIR S.Name STAIR HashTable S E.Name STAIR HashTable UNION Eddy S E Output C E HashTable C Most of the data is processed using the correct plan C.Course STAIR Adapted from AmolDeshpande
44. STAIR:Correctness Theorem [3.1]: An eddy with STAIRs always produces the correct query result in spite of arbitrary applications of the promotion and demotion operations. STAIRs will produce wvery resul tuple There wull be no spurious duplicates 43 Taken from Lifting the Burden of History from Adaptive Query Processing Amol Deshpande and Joseph M. Hellerstein
45. State Module A kind of temporary data repository Half-Join operator that keep homogeneous tuple State inside the operators is Decisions-Indipendent Support the operations Insertion (build) Look-up (probe) Deletion (eviction) [useful for windows] Similar to a Mjoin but more adaptive Sharing of the state among other continuous queries But not storing intermediate results Increase the computation cost significant 44
46. Eddies Join with SteMs 45 T R S eddy R T S More adaptivity Eddy knows half-join Different access method Index access Scan access Simulate several join On overloading? Hash join (fast!!) Index join (mem limit) Join familty? Hash join (equi-join) B-tree join (<, <=, >) Parametrica query can be tought as a join Adapted from Jarle Søberg
47. SteMs:Correctness 46 R S Correctness problem! Possibile duplicates!! Global unique sequence number Only younger can probe Taken from Jarle Søberg
48. SteMs sliding Window 47 SELECT * FROM Str [RANGE BY 5 Second SLIDE BY 2 Second], Relation, WHERE Relation.a = Str.a A|…….|18:49:36 R B|…….|18:49:36 C|…….|18:49:37 A|…….|18:49:38 Keep the state for the sliding window (eviction) At time 40, what will happen at time 42? Instead rebuild all hash table it remove only old tuples and add new tuples B|…….|18:49:39 C|…….|18:49:39 A|…….|18:49:40 B|…….|18:49:40 C|…….|18:49:40 Eviction!18:49:37 A|…….|18:49:41 B|…….|18:49:41
49. Binary Join, STAIR, SteMComparison 48 select * from customer c, orders o, lineitem l where c.custkey = o.custkey and o.orderkey = l.orderkey and c.nationkey = 1 and c.acctbal > 9000 and l.shipdate > date ’1996-01-01’ Necessary Recomputation NO adaptive lineitemcoming with ascending shipdate Initial routing (O >< L) >< C Taken from Lifting the Burden of History from Adaptive Query Processing Amol Deshpande and Joseph M. Hellerstein
50. Eddies:Routing Policy How to choose the best plan? Using routing Every tuple is routed individually Routing policy estabilish thesystem efficiency Eddy has a tuple buffer with priorità Initially they have low priority Exiting form an operator they have higher priority A tuple is sent to output as early as possible Avoid system memory congestion Allow low memory consumption 49
52. Eddies’ Routing Policy:Lottery Scheduling Waldspurger& Weihlin1994 Algorithm to sheduling shared resources « rapidlyfocus availableresources» Every operator begin with N tickets Operator receive another ticket when take one tuple Promote operators which waste tuples fast Operator lose a ticket when returns one tuples Promote operators with lowselettività low: Operators that returns few tuples after processing many When two operators compete for a tuple The tuple is assigned to lottery winner operator Never let op without tickets + randomexploration 51
53. Eddies’ Routing Policy:Lottery Scheduling Lottery Scheduling is better than Back-Pressure 52 cost(s1) = cost(s2) sel(s2) = 50% sel(s1) varia Taken from: Avnur, Hellerstein
57. Other Works Distribuited Eddies Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Content-based Routing Partial Results for Online Query Processing Flux: An Adaptive Partitioning Operator for Continuous Query Systems Java Support for Data-Intensive Systems: Experiences Building the Telegraph Dataflow System Ripple Join for Online Aggregation Highly Available, Fault-Tolerant, Parallel Dataflows 56
58. Bibliography TelegraphCQ: An Architectural Status Report Continuous Dataflow Processing for an Uncertain World Enabling Real-Time Querying of Live and Historical Stream Data Declarative Network Monitoring with an Underprovisioned Query Processor Lifting the Burden of History from Adaptive Query Processing [STAIRs] Eddies: Continuously Adaptive Query Processing Using State Modules for Adaptive Query Processing E altri… http://telegraph.cs.berkeley.edu/papers.html Telegraph team @ UC Berkley: Mike Franklin, Joe Hellerstein, Bryan Fulton, Sirish Chandrasekaran, Amol Deshpande, Ryan Huebsch, Edwin Mach, Garrett Jacobson,Sailesh Krishnamurthy, Boon Thau Loo, Nick Lanham, Sam Madden, Fred Reiss, Mehul Shah, Eric Shan, Kyle Stanek, Owen Cooper, David Culler, Lisa Hellerstein, Wei Hong, Scott Shenker, Torsten Suel, Ion Stoica, Doug Tygar, Hal Varian, Ron Avnur, David Yu Chen, Mohan Lakhamraju, Vijayshankar Raman Lottery Scheduling: Flexible Proportional-Share Resource Management Carl A. Waldspurger & William E. Weihl @ MIT 57
Notas del editor
DBMS Open Source, PostgreSQL, da cui partire per implementare TelegraphCQ.Sviluppato alla Berkeley UniversityScritto in C/C++Licenza OpenBSDBasato sul codice di PostgreSQLVersione corrente: 2.1 su PostgreSQL 7.3.2Progetto chiuso nel 2006Importanti punti di interesse e caratteristicheSoftware http://telegraph.cs.berkeley.eduPapers http://db.cs.berkeley.edu/telegraphSpinoff commerciale Truviso
Versioni non bloccanti di operatori classici (sel, proj)Eddies: decidono routing tupla per tuplaFlux instradano le tuple fra le macchine di un cluster per supportare il parallelismo, il bilanciamento del carico e la tolleranza ai guasti
Le sorgenti devono identificarsi prima di inviare datiWrapper: funzioni user-definedCome devono essere processati i datiAll’interno del Wrapper Clearinghouse processPush sourcesIniziano una connessione con TelegraphCQPull sourcesIl wrapper inizia la connessionePull, per esempio si connette ad un server mail e controlla la posta ogni minutoDati da Wrapper differenti possono confluire nello stesso streamHeartbeat: tupla Punctuated senza dati, solo timestamp Quando arriva una tupla Punctuated non arriveranno dati antecedenti
I dati arrivati possono essere solo parti di tuple quindi è necessario bufferizzarle, ad ogni chiamata non è detto che avremo una tuplaWrapperState serve per fare comunicare funzioni utente con il WCH,Se arrivano meno campi saranno di default a NULL, e se ne arrivano troppi saranno troncati
Una sorta di repository temporaneo di datiOperatore di Half-Join che memorizza tuple omogeneeStatoindipendentedalle precedenti decisioni di routing (poiché non memorizza le tuple)Supporta le operazioniInserimento (build)Ricerca (probe)Cancellazione (eviction) utili per le windowSimili a Mjoin ma più adattiviSimile alla facile routing policy con le query con solo le selezioniSharing dello stato tra altre query continueNon memorizza i risultati intermediAumento del costo di computazione
The opportunity to impove:Optimizers pick a single plan for a queryHowever, different subsets of data may have very different statistical propertiesMay be more efficient to use different plans for different subsets of data