1. Using Provenance for
Repeatability
Quan Pham1, Tanu Malik2, Ian Foster1,2
Department of Computer Science1,§ and Computation
Institute2,¶ University of Chicago§,¶ and Argonne National
Laboratory¶
TaPP 2013
2. Publication Process
• Traditional academic publication process
• Submit paper • Review ideas • Learn novel
&experiment methods.
s
• Emerging academic publication process
• Submit paper • Review ideas • Are we reading
&experiments something that is
• Validate repeatable and
software reproducible?
3. Repeatability Testing
• Scientific progress relies on novel claims and verifiable
results
• Scientific paper reviewers
• Validate announced results
• Validate for different
data and parameters
• Validate under different
conditions and environments
• Challenge: Work under
time & budget constraints
Image: from http://catsandtheirmews.blogspot.com/2012/05/update-on-computer-crash.html
4. Repeatability Testing
Challenges & Constraints
• Repeatability requirements
• Hardware : Single machine/Clusters
• Software
• Operating System : Which operating system was used?
(Ubuntu/RedHat/Debian/Gentoo)
• Environment: How to capture all environment variables?
• Tools & libraries installation: How to precisely know all the dependencies?
• Knowledge constraints
• Experiment setup: how to setup the experiment?
• Experiment usage: how the experiment is run?
• Resource constraints
• Requires massive processing power.
• Operates on large amounts of data.
• Performs significant network communication.
• Is long-running.
5. An Approach to Repeatability
Testing
Challenges & Constraints Possible Solution
• Repeatability • Provide a virtual
requirements machine
• Hardware requirement • Provide a portable
• Software requirement software
• Knowledge constraints Provide a reference
• Experiment setup execution
• Experiment usage
• Resource constraints Provide selective
replay
6. PTU – Provenance-To-
Use
• PTU
• Minimizes computation time during repeatability testing
• Guarantees that events are processed in the same order
using the same data
• Authors build a package that includes:
• Software program
• Input data
• Provenance trace
• Testers may select a subset of the package’s
processes for a partial deterministic replay
7. PTU Functionalities
• ptu-audit tool
• Build a package of authors’ source code, data, and
environment variables
• Record process- and file-level details about a reference
execution % ptu-audit java TextAnalyzer
news.txt
• PTU package
• Display the provenance graph and accompanying run-time
details
• ptu-exec tool
• Re-execute specified part of the provenance graph
% ptu-exec java TextAnalyzer
news.txt
8. ptu-audit
• Uses ptrace to monitor
system calls
• execve, sys_fork
• read, write, sys_io
• bind, connect, socket
• Collects provenance
• Collects runtime
information
• Makes package
9. ptu-audit
• Use ptrace to monitor
system calls
• execve, sys_fork
• read, write, sys_io
• bind, connect, socket
• Collect provenance
• Collect runtime info
• Make package
10. PTU Package
• [Figure 2. The PTU package. The tester chooses
to run the sub-graph rooted at /bin/calculate ]
11. ptu-exec
• [Figure 3. ptu-exec re-runs part of the application
from /bin/calculate. It uses CDE to re-route file
dependencies]
12. Current PTU Components
• Uses CDE (Code-Data-Environment) tool to
create a package
• Details CDE
• Uses ptrace to create a provenance graph
representing a reference run-time execution
• Uses SQLite to store the provenance graph
• Uses graphviz for graph presentation
• Enhances CDE to run the package
13. PEEL0
• Best, N., et. al., Synthesis of a Complete Land
Use/Land Cover Dataset for the Conterminous
United States. RDCEP Working Paper, 2012.
12(08).
• Wget • R • R
• Bash • Raster • Geo
script • Rgdal algorithm
• Reclassify
14. PEEL0
• [Figure 4: Time reduction in testing PEEL0 using
PTU]
• Or use the actual execution graph??
15. TextAnalyzer
• Murphy, J., et. al., Textual Hydraulics: Mining
Online Newspapers to Detect Physical, Social,
and Institutional Water Management
Infrastructure, 2013, Technical Report, Argonne
National Lab.
• runs a named-entity recognition analysis program
using several data dictionaries
• splits the input file into multiple input files on which
it runs a parallel analysis
17. Conclusion
• PTU is a step toward testing software programs
that are submitted to conference proceedings and
journals to conduct repeatability tests
• Easy and attractive for authors
• Fine control, efficient way for testers
18. Future Works
• Other workflow types
• Distributed workflows.
• Improve performance
• Decide how to store provenance compactly in a
packge.
• Presentation
• Improve graphic-user-interface and presentation
19. Acknowledgements
• Neil Best
• Jonathan Ozik
• Center for Robust Decision making on Climate
and Energy Policy (NSF grant number 0951576)
• Contractors of the US Government under contract
number DEAC02-06CH11357
Notas del editor
Hi everyone,My name is QP,In this presentation, I’d like to introduce a system that use provenance for repeatability.The work is done with TMalik @ CI, UoC and Ifoster @ ANL.----- Meeting Notes (3/28/13 14:37) -----no abbreviationlet's the slide talk: n & v should be there
What is the problem with repeatability in scientific community?Process of publication:+ I submit + Reviewers find interesting claims+ want to verify by re-run, with different data/param, different cond/envSo many things to validate, so little time and budget (hardware)===Problems: r u facing this? 1sldWhy? 1sldChallenges 1sld10-15 sld totalThis is what I presentSteps & stepsHigh lvl overview----- Meeting Notes (3/28/13 14:37) -----1 slide: what is a "new" publication process (authors -> (tester) -> readers)concept of author, tester and repeatability
+ uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
+ uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
Reason: why choose /bin/cal (mem intensive, run-time)+ When the entire reference run finishes, PTU builds a reference execution file consisting of the topological sort of the provenance graph. The nodes of the graph enumerate run-time details, such as process memory consumption, and file sizes. The tester, as described next, can utilize the database and the graph for efficient re-execution. + Testers can - examine the provenance graph contained in a package to study the accompanying reference execution. - request a re-execution, either by specifying nodes in the provenance graph or by modifying a run configuration file that is included in the package.
+ obtains run configuration and environment variables for each process from the SQLite database + monitors it via ptrace and re-executes CDE functionality of replacing path argument(s) to refer to the corresponding path within the package cde-package/cde-root/. +provides fast audit and re-execution independent of the application. The profiling of processes enables testers to choose the processes to run.
PEEL0Three step workflow processImplemented as an R-program Classification is memory intensive[pic: PEEL0 workflowget -> reclassify -> calculate -> final resulttesters look at (calculate) with question marks]
five process nodes, 10000 exclusive file reads based on the number of files in the dataset, and 422 exclusive file writes for the aggregated dataset Slowdownwhenuse PTU: ~35% for PEEL0
+ eight process nodes that in aggregate conduct 616 exclusive file reads, 124 exclusive file writes, and 50 file nodes that are read and written again. + Slowdownwhenuse PTU: ~15% for TextAnalyzer+ TextAnalyzer has a particularly large improvement (>98%) since the entire process is run on a much smaller file.
PTU is a step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests. Peer reviewers often must review these programs in a short period of time. By providing one utility for packaging the software pro- grams and its reference execution without modifying the application, we have made it easy and attractive for authors to use it and a fine control, efficient way for testers to use PTU.