SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Digital Document
Preservation Simulation
Richard Landau
(late of Digital Equipment,
Compaq, Dell)
- Research Affiliate, MIT Libraries
Program on Information Science
- http://informatics.mit.edu/people/
richard-landau
2014-07-22
The Problem
• How do you preserve digital documents long-term?
• Surely not on CDs, tapes, etc., with short lifetimes
• LOCKSS: Lots Of Copies Keep Stuff Safe
• What's the threat model?
– Failures: media, format obsolescence, fires, floods, earthquakes,
institutional failures, mergers, funding cuts, malicious insiders,….
• Some data exists on disk media reliability, RAID, etc.
• Little data on reliability of storage strategies
BPG - Digital Document Preservation Simulation 20140722 RBL 2
The Project
• Digital Document Preservation Simulation
• Part of Program on Information Science, MIT Libraries:
Dr. Micah Altman, Director of Research
• Develop empirical data that real libraries can use to
make decisions about storage strategies and costs
– Simulate a range of situations, let the clients decide what level of
risk they are willing to accept
• I do this for fun: volunteer intern, 1-2 days/week
BPG - Digital Document Preservation Simulation 20140722 RBL 3
Questions to Be Answered
• Start small: 10,000 documents for 10 years, 1-20 copies
• (Short term questions, lots more in the long term)
• Question 0: If I place a number of copies out into the
network, how many will I lose over time?
– For various error rates and copies, what's the risk?
• Question 1: If I audit the remote copies and repair them if
they're broken, how many will I lose over time?
– For various auditing strategies and frequencies, what's the risk?
BPG - Digital Document Preservation Simulation 20140722 RBL 4
Tools
• Windows 7 on largish PCs
• Cygwin for Windows (like Linux on Windows)
• Python 2.7
– SimPy library for discrete event simulations
– argparse module for CLI
– csv module for reading parameter files
– logging module for recording events
– itertools for generating serial numbers
– random to generate exponentials, uniforms, Gaussians
• Random seed values from random.org
BPG - Digital Document Preservation Simulation 20140722 RBL 5
Approach
• Very traditional object and service oriented design
– Client, Document, Collection of documents
– Server, Copy of document, Storage Shelf
– Auditor
• Asynchronous processes managed by SimPy
– Sector failure, shelf failure, at exponential intervals
– Audit a collection on a server, at regular intervals
• SimPy resource, network mutex for auditors
• Slightly perverse code (e.g., Hungarian naming)
BPG - Digital Document Preservation Simulation 20140722 RBL 6
How SimPy Works
• Library for discrete event simulation, V3
• Discrete event queue sorted by time
• Asynchronous processes, resources, events, timers
– Process is a Python generator, yield to wait for event(s)
• Timeouts, discrete events, resource requests
• Very simple to use, very efficient, good tutorial examples
• Time is floating point, choose your units
BPG - Digital Document Preservation Simulation 20140722 RBL 7
Early Results for Q0
BPG - Digital Document Preservation Simulation 20140722 RBL 8
Is Python Fast Enough?
• Yes, if one is careful
– Code it all in C++? No. Get results faster with Python.
– "Premature optimization is the root of all evil" -- Knuth
• Numerous small optimizations on the squeaky wheels
– Just a few reduced CPU and I/O by 98%
• If it's still not fast enough, use a bigger computer
BPG - Digital Document Preservation Simulation 20140722 RBL 9
Burn Those CPUs!
• One test only
2 sec to 8 min
• 2000-4000 tests
• Scheduler runs
5-7 programs at
once
• All 4 cores
• (Poor man's
parallelism)
• 99% CPU use
BPG - Digital Document Preservation Simulation 20140722 RBL 10
Class Instances Should Have Names
• <client.CDocument object at 0x6ffff9f6690> is
imperfectly informative
– D127 is much better for humans: name, not address
– With zillions of instances, identifying one during debugging is
hard
• Suggestion:
– Assign a unique ID to every instance that denotes class and a
serial number
– Always always always pass IDs as arguments rather than
instance pointers
BPG - Digital Document Preservation Simulation 20140722 RBL 11
How I Assign IDs
• Every class begins like this:
• itertools.count(1).next function returns a unique
integer for this class, starting at 1, when you invoke it
• Global dictionary dID2Xxxx translates from an ID string
to an instance of class Xxxx
• Pass ID as argument; when a function needs the pointer,
it can use a single fast dictionary lookup
BPG - Digital Document Preservation Simulation 20140722 RBL 12
class CDocument(object):
getID = itertools.count(1).next
(in __init__)
self.ID = "D" + str(self.getID())
G.dID2Document[self.ID] = self
Lessons Learned
• Python is fast enough
• SimPy is really dandy
• argparse is a great lib for CLIs
• csv.DictReader makes everything a string, beware
• item-in-list is really slow, O(n/2), use dict or set instead
• Lots of comments in the code!
BPG - Digital Document Preservation Simulation 20140722 RBL 13
Code is on github
• On github: MIT-Informatics/PreservationSimulation
• Questions: mailto:landau@ricksoft.com
BPG - Digital Document Preservation Simulation 20140722 RBL 14
Further Details (Not Tonight)
• Many small optimizations add up
– Fast find of victim document on a shelf
– Shorten log file by removing many or all details
– Minimize just-in-time checks during auditing
– Change item-in-list checks to item-in-dict
• Wide variety of scenarios covered
– 1000X span on document size, 100X on storage shelf size,
10000X span on failure rates, 20X span on number of copies
– A test run takes from 2 seconds to 8 minutes (CPU)
• Running external programs
• Post-processing of structured log files
BPG - Digital Document Preservation Simulation 20140722 RBL 15
Forming and Running
External Commands
• Substitute multiple arguments into a string with format()
• Run command and capture output into string (or list)
BPG - Digital Document Preservation Simulation 20140722 RBL 16
TemplateCmd = "ps | grep {Name} | wc -l"
RealCmd = TemplateCmd.format(**dictArgs)
ResultString = ""
for Line in os.popen(RealCmd):
ResultString += Line.strip()
nProcesses = int(ResultString)
if nProcesses < nProcessMax:
. . .
A More Complicated Command
• Read a dict of parameters with csv.DictReader
• And then substitute lots of arguments with format()
• Execute and capture output with os.popen()
BPG - Digital Document Preservation Simulation 20140722 RBL 17
dir,specif,len,seed,cop,ber,berm,extra1
../q1,v3d50,0,919028296,01,0050,0050000,
python main.py {family} {specif} {len} {seed} 
--ncopies={cop} --ber={berm} {extra1} > 
{dir}/{specif}/log{ber}/c{cop}b{ber}s{seed}.log 
2>&1 &
Subprocess Module Instead
• os.popen is being replaced by more general functions
– subprocess.check_output
– subprocess.Popen
– subprocess.Pipe
– subprocess.communicate
BPG - Digital Document Preservation Simulation 20140722 RBL 18
Supersede That Old Function
• Have a complicated function that works, but you want to
replace it with a spiffier version?
– Edit in place?
• Might break the whole thing for a while
– Comment it out with ''' or """?
• Not perfectly reliable, like "if 0" in C
• Supersede it: add the new version after the old
– Last version replaces previous one in module or class dict
BPG - Digital Document Preservation Simulation 20140722 RBL 19
def Foo(…):
(moldy old code)
def Foo(…):
(shiny new code)

Más contenido relacionado

La actualidad más candente

Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
 
Introduction to InfluxDB
Introduction to InfluxDBIntroduction to InfluxDB
Introduction to InfluxDBJorn Jambers
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...Big Data Spain
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using FlinkDongwon Kim
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
 
Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP Sriskandarajah Suhothayan
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitterlohitvijayarenu
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...Burman Noviansyah
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Alexandre Vasseur
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overviewIstván Dávid
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIWhole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIPhil Ewels
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with EsperAntónio Alegria
 

La actualidad más candente (20)

Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
Introduction to InfluxDB
Introduction to InfluxDBIntroduction to InfluxDB
Introduction to InfluxDB
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP Intelligent integration with WSO2 ESB & WSO2 CEP
Intelligent integration with WSO2 ESB & WSO2 CEP
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
Increasing Security Awareness in Enterprise Using Automated Feature Extractio...
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGIWhole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 

Similar a Digital Document Preservation Simulation - Boston Python User's Group

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best PracticesLewis Lin 🦊
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...Dataconomy Media
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Anand Narayanan
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scaleDataScienceConferenc1
 

Similar a Digital Document Preservation Simulation - Boston Python User's Group (20)

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Server Tips
Server TipsServer Tips
Server Tips
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 

Más de Micah Altman

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesMicah Altman
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset ConversationMicah Altman
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Micah Altman
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Micah Altman
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset ConversationMicah Altman
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer ReviewMicah Altman
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer ReviewMicah Altman
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An OverviewMicah Altman
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral DistrictingMicah Altman
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk Micah Altman
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Micah Altman
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Micah Altman
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsMicah Altman
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...Micah Altman
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenaryMicah Altman
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanMicah Altman
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...Micah Altman
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceMicah Altman
 

Más de Micah Altman (20)

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset Conversation
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset Conversation
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer Review
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer Review
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An Overview
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral Districting
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenary
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental Scan
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Digital Document Preservation Simulation - Boston Python User's Group

  • 1. Digital Document Preservation Simulation Richard Landau (late of Digital Equipment, Compaq, Dell) - Research Affiliate, MIT Libraries Program on Information Science - http://informatics.mit.edu/people/ richard-landau 2014-07-22
  • 2. The Problem • How do you preserve digital documents long-term? • Surely not on CDs, tapes, etc., with short lifetimes • LOCKSS: Lots Of Copies Keep Stuff Safe • What's the threat model? – Failures: media, format obsolescence, fires, floods, earthquakes, institutional failures, mergers, funding cuts, malicious insiders,…. • Some data exists on disk media reliability, RAID, etc. • Little data on reliability of storage strategies BPG - Digital Document Preservation Simulation 20140722 RBL 2
  • 3. The Project • Digital Document Preservation Simulation • Part of Program on Information Science, MIT Libraries: Dr. Micah Altman, Director of Research • Develop empirical data that real libraries can use to make decisions about storage strategies and costs – Simulate a range of situations, let the clients decide what level of risk they are willing to accept • I do this for fun: volunteer intern, 1-2 days/week BPG - Digital Document Preservation Simulation 20140722 RBL 3
  • 4. Questions to Be Answered • Start small: 10,000 documents for 10 years, 1-20 copies • (Short term questions, lots more in the long term) • Question 0: If I place a number of copies out into the network, how many will I lose over time? – For various error rates and copies, what's the risk? • Question 1: If I audit the remote copies and repair them if they're broken, how many will I lose over time? – For various auditing strategies and frequencies, what's the risk? BPG - Digital Document Preservation Simulation 20140722 RBL 4
  • 5. Tools • Windows 7 on largish PCs • Cygwin for Windows (like Linux on Windows) • Python 2.7 – SimPy library for discrete event simulations – argparse module for CLI – csv module for reading parameter files – logging module for recording events – itertools for generating serial numbers – random to generate exponentials, uniforms, Gaussians • Random seed values from random.org BPG - Digital Document Preservation Simulation 20140722 RBL 5
  • 6. Approach • Very traditional object and service oriented design – Client, Document, Collection of documents – Server, Copy of document, Storage Shelf – Auditor • Asynchronous processes managed by SimPy – Sector failure, shelf failure, at exponential intervals – Audit a collection on a server, at regular intervals • SimPy resource, network mutex for auditors • Slightly perverse code (e.g., Hungarian naming) BPG - Digital Document Preservation Simulation 20140722 RBL 6
  • 7. How SimPy Works • Library for discrete event simulation, V3 • Discrete event queue sorted by time • Asynchronous processes, resources, events, timers – Process is a Python generator, yield to wait for event(s) • Timeouts, discrete events, resource requests • Very simple to use, very efficient, good tutorial examples • Time is floating point, choose your units BPG - Digital Document Preservation Simulation 20140722 RBL 7
  • 8. Early Results for Q0 BPG - Digital Document Preservation Simulation 20140722 RBL 8
  • 9. Is Python Fast Enough? • Yes, if one is careful – Code it all in C++? No. Get results faster with Python. – "Premature optimization is the root of all evil" -- Knuth • Numerous small optimizations on the squeaky wheels – Just a few reduced CPU and I/O by 98% • If it's still not fast enough, use a bigger computer BPG - Digital Document Preservation Simulation 20140722 RBL 9
  • 10. Burn Those CPUs! • One test only 2 sec to 8 min • 2000-4000 tests • Scheduler runs 5-7 programs at once • All 4 cores • (Poor man's parallelism) • 99% CPU use BPG - Digital Document Preservation Simulation 20140722 RBL 10
  • 11. Class Instances Should Have Names • <client.CDocument object at 0x6ffff9f6690> is imperfectly informative – D127 is much better for humans: name, not address – With zillions of instances, identifying one during debugging is hard • Suggestion: – Assign a unique ID to every instance that denotes class and a serial number – Always always always pass IDs as arguments rather than instance pointers BPG - Digital Document Preservation Simulation 20140722 RBL 11
  • 12. How I Assign IDs • Every class begins like this: • itertools.count(1).next function returns a unique integer for this class, starting at 1, when you invoke it • Global dictionary dID2Xxxx translates from an ID string to an instance of class Xxxx • Pass ID as argument; when a function needs the pointer, it can use a single fast dictionary lookup BPG - Digital Document Preservation Simulation 20140722 RBL 12 class CDocument(object): getID = itertools.count(1).next (in __init__) self.ID = "D" + str(self.getID()) G.dID2Document[self.ID] = self
  • 13. Lessons Learned • Python is fast enough • SimPy is really dandy • argparse is a great lib for CLIs • csv.DictReader makes everything a string, beware • item-in-list is really slow, O(n/2), use dict or set instead • Lots of comments in the code! BPG - Digital Document Preservation Simulation 20140722 RBL 13
  • 14. Code is on github • On github: MIT-Informatics/PreservationSimulation • Questions: mailto:landau@ricksoft.com BPG - Digital Document Preservation Simulation 20140722 RBL 14
  • 15. Further Details (Not Tonight) • Many small optimizations add up – Fast find of victim document on a shelf – Shorten log file by removing many or all details – Minimize just-in-time checks during auditing – Change item-in-list checks to item-in-dict • Wide variety of scenarios covered – 1000X span on document size, 100X on storage shelf size, 10000X span on failure rates, 20X span on number of copies – A test run takes from 2 seconds to 8 minutes (CPU) • Running external programs • Post-processing of structured log files BPG - Digital Document Preservation Simulation 20140722 RBL 15
  • 16. Forming and Running External Commands • Substitute multiple arguments into a string with format() • Run command and capture output into string (or list) BPG - Digital Document Preservation Simulation 20140722 RBL 16 TemplateCmd = "ps | grep {Name} | wc -l" RealCmd = TemplateCmd.format(**dictArgs) ResultString = "" for Line in os.popen(RealCmd): ResultString += Line.strip() nProcesses = int(ResultString) if nProcesses < nProcessMax: . . .
  • 17. A More Complicated Command • Read a dict of parameters with csv.DictReader • And then substitute lots of arguments with format() • Execute and capture output with os.popen() BPG - Digital Document Preservation Simulation 20140722 RBL 17 dir,specif,len,seed,cop,ber,berm,extra1 ../q1,v3d50,0,919028296,01,0050,0050000, python main.py {family} {specif} {len} {seed} --ncopies={cop} --ber={berm} {extra1} > {dir}/{specif}/log{ber}/c{cop}b{ber}s{seed}.log 2>&1 &
  • 18. Subprocess Module Instead • os.popen is being replaced by more general functions – subprocess.check_output – subprocess.Popen – subprocess.Pipe – subprocess.communicate BPG - Digital Document Preservation Simulation 20140722 RBL 18
  • 19. Supersede That Old Function • Have a complicated function that works, but you want to replace it with a spiffier version? – Edit in place? • Might break the whole thing for a while – Comment it out with ''' or """? • Not perfectly reliable, like "if 0" in C • Supersede it: add the new version after the old – Last version replaces previous one in module or class dict BPG - Digital Document Preservation Simulation 20140722 RBL 19 def Foo(…): (moldy old code) def Foo(…): (shiny new code)