Progressive Provenance Capture Through Re-computation

•Download as PPTX, PDF•

1 like•409 views

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Technology

Progressive Provenance
Capture Through Re-
computation
Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit
Amsterdam
Incremental Re-computation Workshop - Provenance Week 2018

What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011. 2

Provenance is Post-Hoc
• What if we missed something?
• Disclosed provenance systems:
– Re-apply methodology (e.g. PriME), produce new
application version.
– Time consuming.
• Observed provenance systems:
– Update the applied instrumentation.
– Instrumentation becomes progressively more intense.
3

Provenance is Post-Hoc
Aim: Eliminate the need for developers to know
what provenance needs to be captured.
4

Re-execution
• Common tactic in disclosed provenance:
– DB: Reenactment queries (Glavic ‘14)
– DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
– Workflows: Pegasus (Groth ‘09)
– PL: Slicing (Perera ‘12)
– Desktop: Excel (Asuncion ‘11)
• Can we extend this idea to observed
provenance systems?
5

Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
7

Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
8

• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
9

Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
10
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime

OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from the
hardware state (RAM/registers).
11

12
(1) Alice downloads the front page of example.org.
(2) Alice edits the document and fixes a link that points to the wrong page.
(3) Alice re-uploads the HTML document and the image.
(4) Bob downloads the front page of example.org.
(5) Bob removes a paragraph of text.
(6) Bob re-uploads the the HTML document.
An example

Thoughts
• Decoupling provenance analysis from execution is
possible by the use of VM record & replay.
• Execution traces can be used for post-hoc
provenance analysis.
• 24/7 execution recording seems possible
• Can we extend this notion of instrumentation to
other capture systems?
14
Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth:
PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM
Transactions on Internet Technology 17(4): 37:1-37:24 (2017)

What's hot

Tradeoffs in Automatic Provenance CapturePaul Groth

Review of micro Orm in c#Kanstantsin Harbachou

PerformanceChristophe Marchal

Spark Summit East 2015Timothy Danford

Your data isn't that big @ Big Things Meetup 2016-05-16Boaz Menuhin

Lecture 9 -_pthreads-linux_threadsPrashant Pawar

Graylog2 (MongoBerlin/MongoHamburg 2010)lennartkoopmann

Get Started with CrateDB: Sensor DataCrate.io

Network & Filesystem: Doing less cross rings memory copyScaleway

Ns3Rehmat Ullah

What's hot (10)

Tradeoffs in Automatic Provenance Capture

Review of micro Orm in c#

Performance

Spark Summit East 2015

Your data isn't that big @ Big Things Meetup 2016-05-16

Lecture 9 -_pthreads-linux_threads

Graylog2 (MongoBerlin/MongoHamburg 2010)

Get Started with CrateDB: Sensor Data

Network & Filesystem: Doing less cross rings memory copy

Ns3

Similar to Progressive Provenance Capture Through Re-computation

talks-afanasyev2013ndnsim-tutorial.pptxhazwan30

Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemTamas K Lengyel

Reproducible, Automated and Portable Computational and Data Science Experimen...Ivo Jimenez

Interactive Data Analysis for End Users on HN Science CloudHelix Nebula The Science Cloud

Preparing OpenSHMEM for Exascaleinside-BigData.com

Shaping the Future: To Globus Compute and Beyond!Globus

Linux Memory Analysis with VolatilityAndrew Case

Big data at experimental facilitiesIan Foster

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev

4055-841_Project_ShailendraSadhShailendra Sadh - CISSP

Metasploit For BeginnersRamnath Shenoy

Practical Chaos EngineeringSIGHUP

Ase2010 shangSAIL_QU

Mac Memory Analysis with VolatilityAndrew Case

Monitoring in 2017 - TIAD Camp DockerThe Incredible Automation Day

"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper

ACM Applicative System Methodology 2016Brendan Gregg

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley

Analisis Estatico y de Comportamiento de un Binario MaliciosoConferencias FIST

How to Make a Honeypot Stickier (SSH*)Jose Hernandez

Similar to Progressive Provenance Capture Through Re-computation (20)

talks-afanasyev2013ndnsim-tutorial.pptx

Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System

Reproducible, Automated and Portable Computational and Data Science Experimen...

Interactive Data Analysis for End Users on HN Science Cloud

Preparing OpenSHMEM for Exascale

Shaping the Future: To Globus Compute and Beyond!

Linux Memory Analysis with Volatility

Big data at experimental facilities

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)

4055-841_Project_ShailendraSadh

Metasploit For Beginners

Practical Chaos Engineering

Ase2010 shang

Mac Memory Analysis with Volatility

Monitoring in 2017 - TIAD Camp Docker

"Data Provenance: Principles and Why it matters for BioMedical Applications"

ACM Applicative System Methodology 2016

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...

Analisis Estatico y de Comportamiento de un Binario Malicioso

How to Make a Honeypot Stickier (SSH*)

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Developing An App To Navigate The Roads of BrazilV3cube

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Slack Application Development 101 Slidespraypatel2

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

08448380779 Call Girls In Civil Lines Women Seeking Men

Partners Life - Insurer Innovation Award 2024

Automating Google Workspace (GWS) & more with Apps Script

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Developing An App To Navigate The Roads of Brazil

Breaking the Kubernetes Kill Chain: Host Path Mount

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Exploring the Future Potential of AI-Enabled Smartphone Processors

Slack Application Development 101 Slides

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Injustice - Developers Among Us (SciFiDevCon 2024)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Data Cloud, More than a CDP by Matt Robison

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Progressive Provenance Capture Through Re-computation

1. Progressive Provenance Capture Through Re- computation Paul Groth Elsevier Labs @pgroth | pgroth.com Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit Amsterdam Incremental Re-computation Workshop - Provenance Week 2018

2. What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 2

3. Provenance is Post-Hoc • What if we missed something? • Disclosed provenance systems: – Re-apply methodology (e.g. PriME), produce new application version. – Time consuming. • Observed provenance systems: – Update the applied instrumentation. – Instrumentation becomes progressively more intense. 3

4. Provenance is Post-Hoc Aim: Eliminate the need for developers to know what provenance needs to be captured. 4

5. Re-execution • Common tactic in disclosed provenance: – DB: Reenactment queries (Glavic ‘14) – DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) – Workflows: Pegasus (Groth ‘09) – PL: Slicing (Perera ‘12) – Desktop: Excel (Asuncion ‘11) • Can we extend this idea to observed provenance systems? 5

6. Full-system logging and replay 6

7. Methodology Selection Provenance analysis Instrumentation Execution Capture 7

8. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 8

9. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 9

10. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 10 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime

11. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 11

12. 12 (1) Alice downloads the front page of example.org. (2) Alice edits the document and fixes a link that points to the wrong page. (3) Alice re-uploads the HTML document and the image. (4) Bob downloads the front page of example.org. (5) Bob removes a paragraph of text. (6) Bob re-uploads the the HTML document. An example

13. 13 Select Replay

14. Thoughts • Decoupling provenance analysis from execution is possible by the use of VM record & replay. • Execution traces can be used for post-hoc provenance analysis. • 24/7 execution recording seems possible • Can we extend this notion of instrumentation to other capture systems? 14 Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth: PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM Transactions on Internet Technology 17(4): 37:1-37:24 (2017)

Editor's Notes

A big problem for systems capturing provenance is deciding what to capture. For disclosed provenance systems we can apply some methodology to decide what to capture.
The root of the problem is that provenance is post-hoc. Deciding what to capture in advance will always miss something. Ideally, we would like to…
Decouple analysis from execution. Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
Typical information that can be retrieved through VM introspection. In general, executing code inside the guest OS is complex. Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.

Progressive Provenance Capture Through Re-computation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Progressive Provenance Capture Through Re-computation

Similar to Progressive Provenance Capture Through Re-computation (20)

More from Paul Groth

More from Paul Groth (20)

Recently uploaded

Recently uploaded (20)

Progressive Provenance Capture Through Re-computation

Editor's Notes