Incremental Recomputation: Those who cannot remember the past are condemned to recompute it

Incremental Recomputation:
Those who cannot remember the past are
condemned to recompute it
Bertram Ludäscher
ludaesch@illinois.edu
Workshop: Incremental Re-computation:
Provenance and beyond
IRPb@ProvenanceWeek
2018-07-12..13
Director, Center for Informatics Research in Science & Scholarship (CIRSS)
School of Information Sciences (iSchool@Illinois)
& National Center for Supercomputing Applications (NCSA)
& Department of Computer Science (CS@Illinois)
1

All-in-One (Teaser & Summary)
• Incremental (re-)comp ~ Deltas ~ Derivatives
• Database Queries
– Datalog evaluation: Naïve è Seminaïve è Magic Sets
– Incremental View Maintenance
• … using Provenance Semirings!
• (Scientific) Workflows (Dataflow Programming)
– Make
– … SDF (Sychronous Dataflow)
– ... PN (Process Networks) ... COMAD
• Bottom Line:
– MoP = MoC +/- ∆
– T = R – I + M
Ludäscher: Incremental Recomp 2

Computing with Deltas …
• Derivatives in Calculus (Product Rule)
(F * G)’ = F * G’ + F’ * G
• Delta Computations in Datalog:
– R(…) :- P(…), Q(…)
– Naive Eval:
• Bottom-up fixpoint
– Seminaive Eval:
∆ R(...) :- P(..), ∆ Q(...)
∆ R(...) :- ∆ P(..), Q(...)
– ... and more (Magic Sets, ... )

Incremental View Maintenance (DRed)
Ludäscher: Incremental Recomp
4

5

A language for declaring updates & change:
Statelog: Datalog + States
6

Using Provenance for Profiling:
Comparing (Abstract) Execution Traces
• Number of Facts:
DerivedFact(H) :-
g(_,_,out,H).
DerivedHeadCount(C) :-
C = count{
H : DerivedFact(H)
}.
• Number of Firings:
Firing(F) :- g(_,F,out,_).
FiringCount(C) :-
C = count{F : Firing(F)}.
e(a,b) 1
2
3
4
tc(a,b)
[1]
tc(a,c)
[2]
tc(a,d)
[3]
tc(a,e)
[4]
e(b,c) 1
2
3
tc(b,c)
[1]
tc(b,d)
[2]
tc(b,e)
[3]
e(c,d)
1
2
tc(c,d)
[1]
tc(c,e)
[2]
e(d,e) 1
tc(d,e)
[1]
3
tc(a,d)
[3]
3
3
tc(a,e)
[3]
3
tc(b,e)
[3]
3
4
4
e(a,b) 1
tc(a,b)
[1]
e(b,c) 1
tc(b,c)
[1]
e(c,d) 1
tc(c,d)
[1]
e(d,e) 1
tc(d,e)
[1]
2
2
2
tc(a,c)
[2]
tc(b,d)
[2]
tc(c,e)
[2]
7

Step 1: Capturing Rule Firings (“F-trick”)
• Capture rule firings and keep “witness info” (existential variables)
– no premature projections in the rule head please!
• Example. Instead of a given rule …
tc(X,Y) :- e(X,Z), tc(Z,Y).
… we rather use these two rules, keeping witnesses Z around:
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y).
tc(X,Y) :- fire2(X,Z,Y).
Example rule firings
8

Step 2: Graph Transformation (“G-trick”)
• Reify provenance atoms & firings in a labeled graph g/3
• Example for N = 2 subgoals and 1 head atom …
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y). % two in-edges
tc(X,Y) :- fire2(X,Z,Y). % one out-edge
… generates N+1 “reification rules” (Skolems are safe):
g( e(X,Z), in, ffire2(X,Z,Y) ) :- fire2(X,Z,Y).
g( tc(Z,Y), in, ffire2(X,Z,Y) ) :- fire2(X,Z,Y).
g( ffire2(X,Z,Y), out, tc(X,Y) ) :- fire2(X,Z,Y).
e(a,b)
ﬁre2(a,b,d)
in
tc(a,d)
out
tc(b,d)
in
Example instance generated by these rules
9

Step 3: Using Statelog (“S-Trick”)
• Use Statelog to keep record of firing rounds:
– Add state (=stage) argument to provenance rules and graph relations
– EDB facts are derived in state 0.
– Subsequently: extract earliest round for firings and IDB facts
• Example:
rin : fr(S1, X) :- B1(S, X1), … , Bn(S, Xn), next(S, S1).
rout : H(S, Y) :- fr(S, X).
e(a,b) r1 [1]
r2 [3]
tc(a,b)
[1]e(b,c)
r2 [2]
tc(b,b)
[2]
e(c,b)
r1 [1]
r2 [3]
tc(c,b)
[1]
10

[r1] tc(X,Y) :- e(X,Y)
[r2] tc(X,Y) :- e(X,Z), tc(Z,Y)
11

… from Queries/Datalog to ...
Workflows
12

Application Example: Protein 3D Structure
Resonance Assignments
(a) Sequential
(b) Side-Chain
Identification of Secondary
Structural Elements
(a) Based on Chemical Shift
(b) Based on NOE Patterns
Determine Distance
Constraints
(a) From 2D/3D NOESY Spectra
(b) Calibrate Distance from Vol
Determine Torsion
Angle Constraints (f, y, c)
(a) Based on Chemical Shift
(b) Based on J-couplings
Structure Determination
(a) Torsion Angle Dynamics
(b) Simulated Annealing
High
Resolution
Structure
Iterative
Michael Gryk: We cannot assign all of the resonances in part (1), or all of the NOESY
peaks in part (3) before doing step (5). So we run (5) with incomplete information and get
a preliminary answer. This helps rectify ambiguities in steps 1-4 and we fix that data and
run again. And again. And again. It literally can take dozens of attempts before we get
a high-resolution structure.
è Question of both efficiency and (months or years later) reproducibility

A simpler example …
• Some inputs and/or
params of the workflow
change
è “smart re-run”
• Similar to executing Make
• … on a DAG
– … eg via Datalog to compute
subworkflow to be re-executed
(“rescue-DAG”)
• So much winning! But ...
https://openprovenance.org/provenance-challenge/WebHome.html

… may be many
invocations
(in “Trace-land”)
What looks like one step …
(in “Workflow-land”)
Ludäscher: Provenance Back & Forth
15
What is the granularity of steps?

Models of Computation (MoCs) can be much
more complex (streaming, state, collections, ...)

From Models of Computation to Models of Provenance
M. Anand, S. Bowers,
et al., SSDBM’09

Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers,
et al., SSDBM’09

When workflows crash…
Use incremental recomputation
… to recover and avoid starting from zero …

20 7/20/2011“Fault Tolerance through Provenance-based Recovery”
Example: Checkpoint in SDF
• Workflow with a mix of stateful and
stateless actors .
Corresponding schedule of the workflow
with a fault during invocation B:2

Prototype Implementation in Kepler
• Upon recovery request:
– SDF director calls the recovery engine
• Recovery:
– Restore the internal state of actors
– Replay successful invocations using input tokens from
provenance
– Restore content of all queues
– Repeat faulty invocations
– Return to SDF director with information about where to
resume

22 7/20/2011UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher
Execution with Failure
• Execution of the
previous workflow
• Checkpoints for
actor B and D but
not for C
• At invocation B:2 -
Crash
• Tokens t4 and t7 -
in queue
• Token t9 - to be
restored
• Token t10 - to be
deleted

Stages of Checkpoint Recovery

Provenance Recording Overhead
Without
provenance
Standard
provenance
Extended
provenance
Worst-case scenario
If you already capture provenance …
You might as well do it right J

Fault Tolerance Solutions Compared

When workflows are slow …
Use provenance
… to understand what’s going on…

Hamming Numbers in a Dataflow Network
Compute Hamming numbers H in order, where
H = 2i · 3j · 5k, where i, j, k ≥ 0
a.k.a. regular numbers or 5-smooth numbers (numbers whose prime factors are <= 5).
27
X2
X3
X5
S2
S3
S5
Q1
Q2
Q3
M1
M2
Q4
Q5
Q6
Q7
Q8

1-Loop Hamming Workflow in Datalog
#maxint = 100
h(1).
h(Y) :- h(X), Y = 2*X.
h(Y) :- h(X), Y = 3*X.
h(Y) :- h(X), Y = 5*X.
Output:
{h(1), h(2), h(3), h(4), h(5), h(6),
h(8), h(9), h(10), h(12), h(15),
h(16), h(18), h(20), h(24), h(25),
h(27), h(30), h(32), h(36), h(40),
h(45), h(48), h(50), h(54), h(60),
h(64), h(72), h(75), h(80), h(81),
h(90), h(96), h(100)}
28
X2
X3
X5
S2
S3
S5
Q1
Q2
Q3
M1
M2
Q4
Q5
Q6
Q7
Q8

3-Loop Hamming Workflow in Datalog
#maxint=100.
h2(1).
h2(Y) :- h2(X), Y = 2*X.
h23(Y) :- h2(X).
h23(Y) :- h23(X), Y = 3*X.
h(Y) :- h23(X).
h(Y) :- h(X), Y = 5*X.
Output:
{h2(1), h2(2), h2(4), h2(8), h2(16), h2(32), h2(64), h23(1), h23(2), h23(3),
h23(4), h23(6), h23(8), h23(9), h23(12), h23(16), h23(18), h23(24), h23(27),
h23(32), h23(36), h23(48), h23(54), h23(64), h23(72), h23(81), h23(96),
h(1), h(2), h(3), h(4), h(5), h(6), h(8), h(9), h(10), h(12), h(15), h(16), h(18),
h(20), h(24), h(25), h(27), h(30), h(32), h(36), h(40), h(45), h(48), h(50),
h(54), h(60), h(64), h(72), h(75), h(80), h(81), h(90), h(96), h(100)}
29
X2
X3
X5
S2
S3
S5
Q1
Q2
Q3
M1
M2
Q4
Q5
Q6
Q7
Q8

Hamming Workflow Provenance
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000 432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000
432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
1-Loop variant ("Fish”) 3-Loop variant ("Sail”)
30

Computational / Workflow Thinking:
The limits of my language are the limits of my world …
• Vanilla Process Network
• Functional Programming
Dataflow Network
• XML Transformation
Network
• Collection-oriented
Modeling & Design
framework (COMAD)
– “Look Ma: No Shims!”

A Jupyter (& Python) MoC
32
https://dashboard.wholetale.org

module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
NW: Python Model of
Computation (MoC)
33
NW+YW:
Workflow Model of
Computation (MoC)

From MoC to MoP via Observables
• Model of Computation MoC
– specification/algorithm to compute Outputs = MoC(Wf,Params,Inputs)
– a director or scheduler implements MoC
– gives rise to formal notions of
• computation (aka run) R
– Formalisms to define M?
• Model of Provenance MoP
– associate with a MoC a “default” MoP (= MoC ± Δ)
– the MoP is a “trimmed” MoC
• T = R – I + M
– Trace = Run – Ignored-observables + Modeled-observables
• Observables (of a MoC / MoP)
– functional observables (may influence output o)
• token rate, notions of firing, …
– non-functional observables (not part of M, do not influence o)
• token timestamp, size, … (unless the MoC cares about those)

All-in-One (Summary)
• Provenance & Incremental Recomputation
– What You See (Think/Model) Is What You Get!
– WYTIWYG (“witty-wig”)
• These assembly language instructions
• … implementing these VM Instructions
• … in this programming language
• ... implementing an algorithm
• ... that schedules a workflows
• ... that applies this bioinformatics method
• … to test this scientific hypothesis ....
è Need to capture provenance at the “right level”
– … for efficiency
– ... for transparency & understanding
• Bottom Line: MoP = MoC +/- ∆
– T = R – I + M
– Provenance Trace (MoP thing) = Run (MoC thing) – “nah..” + “yeah!”

Teaser (for Vasa …)
Incremental computation in
… Games
… aka Argumentation Frameworks

Argumentation Frameworks
& Game Provenance
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
37
• Query evaluation and logic-
based argumentation can be
understood as a game!
• One logic rule to rule them all …
win(X) :- move(X,Y), not win(Y)
• node color => edge color
– good vs bad moves
• good moves = natural, new
notion of provenance!
• Implement, e.g. using Answer
Set Programming
Aside: Games ~ Argumentation Frameworks
win(X) :- move(X,Y), not win(Y)
def(X) :- attacks(Y,X), not def(Y)

Game Provenance
W
bad Dbad
L
winning
bad
drawing
n/a
delaying
n/a
n/a
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
38
Extracting Provenance:
ü Why/how win(x)?
• [x] –G.(R.G)*–> [y]
ü Why-not win(x)?
• [x] –(R.G)*–> [y]
• [x] –(Y+)–> [y]
Move types

Incremental Recomputation: Those who cannot remember the past are condemned to recompute it

Recomendados

Recomendados

Más contenido relacionado

Similar a Incremental Recomputation: Those who cannot remember the past are condemned to recompute it

Similar a Incremental Recomputation: Those who cannot remember the past are condemned to recompute it (20)

Más de Bertram Ludäscher

Más de Bertram Ludäscher (20)

Último

Último (20)

Incremental Recomputation: Those who cannot remember the past are condemned to recompute it