Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Progressive Provenance Capture Through Re-computation
1. Progressive Provenance
Capture Through Re-
computation
Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit
Amsterdam
Incremental Re-computation Workshop - Provenance Week 2018
2. What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011. 2
3. Provenance is Post-Hoc
• What if we missed something?
• Disclosed provenance systems:
– Re-apply methodology (e.g. PriME), produce new
application version.
– Time consuming.
• Observed provenance systems:
– Update the applied instrumentation.
– Instrumentation becomes progressively more intense.
3
4. Provenance is Post-Hoc
Aim: Eliminate the need for developers to know
what provenance needs to be captured.
4
5. Re-execution
• Common tactic in disclosed provenance:
– DB: Reenactment queries (Glavic ‘14)
– DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
– Workflows: Pegasus (Groth ‘09)
– PL: Slicing (Perera ‘12)
– Desktop: Excel (Asuncion ‘11)
• Can we extend this idea to observed
provenance systems?
5
8. Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
8
9. • PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
9
10. Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
10
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
11. OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from the
hardware state (RAM/registers).
11
12. 12
(1) Alice downloads the front page of example.org.
(2) Alice edits the document and fixes a link that points to the wrong page.
(3) Alice re-uploads the HTML document and the image.
(4) Bob downloads the front page of example.org.
(5) Bob removes a paragraph of text.
(6) Bob re-uploads the the HTML document.
An example
14. Thoughts
• Decoupling provenance analysis from execution is
possible by the use of VM record & replay.
• Execution traces can be used for post-hoc
provenance analysis.
• 24/7 execution recording seems possible
• Can we extend this notion of instrumentation to
other capture systems?
14
Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth:
PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM
Transactions on Internet Technology 17(4): 37:1-37:24 (2017)
Editor's Notes
A big problem for systems capturing provenance is deciding what to capture.
For disclosed provenance systems we can apply some methodology to decide what to capture.
The root of the problem is that provenance is post-hoc.
Deciding what to capture in advance will always miss something.
Ideally, we would like to…
Decouple analysis from execution.
Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
Execution Capture: happens realtime
Instrumentation: applied on the captured trace to generate provenance information
Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU.
Input includes both executed instructions and data.
RAM snapshot + ND log are enough to accurately replay the whole execution.
ND log conists of inputs to CPU/RAM and other device status is not logged we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.
Plugins are implemented as dynamic libraries.
We focus on the highlighted plugins in this presentation.
Typical information that can be retrieved through VM introspection.
In general, executing code inside the guest OS is complex.
Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.