TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
To Preserve Or Not To Preserve?
1. To Preserve Or Not
To Preserve?
The Challenges in
Appraising
Electronic Records
ect o c eco ds
Peter Bajcsy, PhD
- Research Scientist, NCSA
- Adjunct Assistant Professor ECE & CS at
UIUC
- Associate Director Center for Humanities,
Social Sciences and Arts (CHASS), Illinois
Informatics Institute (I3), UIUC
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Date: January 21st, 2009
2. Acknowledgement
• This research was partially supported by a National
Archive and Records Administration (NARA) supplement
( ) pp
to NSF PACI cooperative agreement CA #SCI-9619019
and NCSA Industrial Partners.
• The views and conclusions contained in this doc ment
ie s concl sions document
are those of the authors and should not be interpreted as
representing the official policies, either expressed or
implied, of the National Archive and Records
Administration, or the U.S. government.
• Contributions by: Peter Bajcsy Kenton McHenry Rob
Bajcsy, McHenry,
Kooper, Michal Ondrejcek, William McFadden, Sang-
Chul Lee, David Clutter and Alex Yahja
Imaginations unbound
3. Outline
• Introduction
• Stakeholders
• Conceptual Challenges
• Some Open Problems
• Research Examples Illustrating Open
Problems
• Summary Observations and Future
Summary,
Vision
4. Introduction
• Two Trends in the Context of Decision Processes
(Government, Medical, Natural Disasters, …)
• Decision processes are moving from paper based
to electronic record based (~ computer assisted
decision processes)
• Electronic records depend on rapidly changing
information technology
• Decisions are optimal depending on knowledge
• Any learning from electronic records depends on
preservation and reconstruction of the records, as
well as on quality and granularity of the information
National Center for Supercomputing Applications
5. Fundamental Problems
• Limited learning from historical records
today
• It is often due to missing information and
high uncertainty/ low quality of historical
records.
• Lack of understanding how to preserve and
reconstruct data and decision processes.
• It is due to insufficient
forecasting/simulation capabilities.
National Center for Supercomputing Applications
6. To Be Preserved!
Digital
representation of Preservation
information
i f ti
& knowledge
Information
transfer ?
AGENCY ARCHIVES
Imaginations unbound
7. Motivation
• The problems related to preservation of electronic records
are only going to become more serious
• Information becomes more heterogeneous and complex
• More data types
• Higher dimensional data
• N
New fil f
file formats
t
• Volumes of electronic records have been increasing and will
continue to grow
• The model of a paperless office (4 years of Bush’s email > 8
years of Clinton’s email)
• The paradigm shift to eScience
• Digital information technology has been changing faster than
any previous preservation media
• The time scale of electronic media is ephemeral in comparison
p p
with paper or clay tablets
Imaginations unbound
8. Example of Preservation Needs in Medicine
• Short term:
• Medical practice requires comparing patients’
records acquired today with the patients’
records f
d from 5 10 50 or 70 years i order t
5, 10, 50, in d to
assess functional, structural or low level
biological changes due to diseases
diseases,
treatments and/or aging.
• Long term:
• Genealogy studies compare data sets over
several hundreds and thousands of years
y
National Center for Supercomputing Applications
9. Who Are the Stakeholders?
• Multiple institutions and organizations are active in the area
of medical record preservation
• National Library of Medicine (NLM)
y ( )
• Research Information Network (RIN)
• Medical Research Council (MRC) in UK
• National Archives and Record Administration (NARA)
• Identified common goals:
• S
Seamless, uninterrupted access t expanding collections
l i t t d to di ll ti
of biomedical data, medical knowledge, and health
information
• Preserve medical record collections in highly usable
forms and contribute to comprehensive strategies for
preservation of biomedical information in the U S and
U.S.
worldwide.
National Center for Supercomputing Applications
10. Other Stakeholders
• Government agencies
• Prediction of patterns signaling natural disasters
based on hi t i l measurements
b d historical t
• Detection of terrorist attacks based on past
experience
• Learning about other planets from past space shuttle
missions
• Preservation of cultural heritage
• Companies
• P
Preservation of engineering d
ti f i i drawings and
i d
architectural designs – Boeing, John Deere, GM
• Preservation of simulation results – Caterpillar, Ford
p ,
• Backward compatibility of hardware/software - GE
Imaginations unbound
11. NARA as One of the Key Stakeholders
• According to The Strategic Plan of The
National Archives and Records
Administration 2006–2016. “Preserving th
Ad i i t ti 2006 2016 “P i the
Past to Protect the Future”
• “Strategic Goal: We will preserve and
Strategic
process records to ensure access by the
p
public as soon as legally p
g y possible”
• “D. We will improve the efficiency with
which we manage our holdings from
the time th are scheduled th
th ti they h d l d through h
accessioning, processing, storage,
preservation, and public use.”
use.
12. Conceptual Challenges
• Learning Requires Reusing Electronic Records
• How to enable and support preservation and
reconstruction of electronic records?
• Advancing Sensors and Instruments Leads to New
Types of High Dimensional Data and Large Volumes
• How to design preservation methodologies that
scale well?
• Process to Enable Learning over Time from
Electronic Records Requires Large Financial
Investments
• How to minimize computational hardware,
software,
software and storage cost and maximize the
amount of preserved information?
National Center for Supercomputing Applications
13. What Are The Key Open Problems?
Imaginations unbound
14. Some Open Problems -> Intellectual Merit
• Appraisal Methodology
• Appraisal by Visual Exploration
• Support of Appraisals by Enabling Comparisons
• Scalability of Appraisals with Increasing Heterogeneity of
Information, Dimensionality of Data and Volume of Electronic
Records
• Support of Archival Decisions
• Simulate Preservation Costs as a Function of Information
Granularity and I f
G l it d Information Technology
ti T h l
• Optimal Utilization of Computational and Human Resources
• Automation of Processing for Preservation
g
• Discovery of Relationships Among Electronic Records
• Information Preserving Conversions of Electronic Records
• Sampling Authenticity and Integrity Verification of a Collection of
Sampling,
Temporally Changing Records
Imaginations unbound
15. Broader Impacts
Process to Enable
Learning Over Time
Electronic +$ Knowledge
Records
-$
Optimal Decision Making
National Center for Supercomputing Applications
17. Open Problems Related to Appraisal
Methodology
1. Appraisal by Visual Exploration
2. Support of Appraisals by Enabling Comparisons
3. Scalability of Appraisals with Increasing Heterogeneity of
Information, Dimensionality of Data and Volume of Electronic
Records
Imaginations unbound
18. Definition of Appraisal in Archival Context
• Appraisal -- the process of determining the value and thus
the final disposition of Federal records making them either
records,
temporary or permanent.
• See http://www.archives.gov/records-
p g
mgmt/initiatives/appraisal.html
• The basis of appraisal decisions may include
• th records'' provenance and content,
the d d t t
• the records' authenticity and reliability,
• the records‘ order and completeness,
records completeness
• the records‘ condition and costs to preserve them, and
• the records‘ intrinsic value
records
Imaginations unbound
19. Open Problem 1: Appraisal by Visual
Exploration
• How to visualize the transition from raw data to information?
• Raw data (Byte stream) -> Information 0F0 ->(R.G,B)->GREEN
• How to encode and represent heterogeneous information for
visual exploration and for computer assisted operations?
computer-assisted
• Encoding (e.g., shape consisting of a set of Bezier
curves is encoded by a set of straight lines)
• Representation (e.g., colors are represented by an
ordered sequence of intensity values from all bands)
• H
How t summarize representations for visual exploration?
to i t ti f i l l ti ?
• Frequency of occurrence of primitives
• Local and global summarizations
Imaginations unbound
20. Example: Adobe Portable Document
Format (PDF)
• Why PDF? - PDF is just an example of a container
• Office environment (Adobe PDF PS, MS Word, HTML …)
PDF, PS Word HTML, )
• Satellite measurements (HDF, netCDF, …)
3D
Adobe Library 6.0
Movie
Adobe Lib
Ad b Library 7 0
7.0
Imaginations unbound
21. Exploration of PDF Documents Using PDF
Viewer
• PDF Viewer presents information as a set of pages with
their layouts
• PDF Viewer renders layers of internal objects
(components) and hence only the top layer is visible
22. Needed Exploration of PDF Components
p p
• There is no support for archival appraisals that would
include visual exploration of components in a document
(a container of components)
• Needed viewers for appraisal analyses that present
information stored in a container (e.g., PDF) as a set of
components and their characteristics
• Text – word frequency
• Images (rasters) – color frequency (histogram)
• Vector graphics – line frequency
• Exploration for appraisal analyses needs to include
visible and invisible objects
23. Exploration of Text Components
LOADED FILES
Occurrence of words Occurrence of numbers
“Ignore” words
24. Exploration of Image Components
LOADED FILES “Ignore” colors
List of images Occurrence of colors Preview
25. Exploration of Vector Graphics
Components
LOADED FILES
Preview Occurrence of v/h lines
Imaginations unbound
26. Exploration of Visible And Invisible Objects
Objects intersected at the
mouse click location
27. Open Problem 2: Support of Appraisals
by Enabling Comparisons
• How to compare containers with heterogeneous
information?
i f ti ?
• Methodology
• Metrics
• Weighting factors for fusion
• How to quantify differences between the same
type of information?
• Encodings and Representations
• Metrics
• Local versus global differences
Imaginations unbound
30. Experimental Example
INPUT = 10 PDF docs (4 & 6 Groups)
UNIQUE ID= 1,2,3,4 UNIQUE ID= 5,6,7,8,9,10
Imaginations unbound
31. Comparative Experimental Results
INPUT = 10 PDF docs
(6 & 4 members in each Group)
Vector-based similarity
V b d i il i
Text-based similarity Image-based similarity
32. Comparative Experimental Results
Vector Graphics Similarity Portion of Document Surface
and Word Similarity Combined Allotted to Each Document Feature
Comparison Using
Combination of Document
Features in Proportion to
Coverage
33. Accuracy Comparisons
Method Average Average Average
Similarity of Similarity of Similarity Across
Group 1 Group 2 Group 1 & 2
TEXT ONLY 1 0.489 0
TEXT & IMAGE & 0.906
0 906 0.520
0 520 0.075
0 075
GRAPHICS
One refers to high similarity & zero refers to low similarity
g y y
Conclusions:
•Differences in similarity are up to 10% of the score
•Documents in Group 2 would likely be misclassified as 0.5
similarity would be the threshold between similar and
dissimilar documents
Imaginations unbound
34. Open Problem 3: Scalability of
Appraisals
• Scalability of appraisals with increasing
heterogeneity of information,
dimensionality of data and volume of
electronic records
• H
How should appraisal process change
h ld i l h
as 3D data is added to file containers?
• H
How should appraisal process change
h ld i l h
as 3D+time, 2D+spectrum,
3D+time+spectrum, nD,
3D+time+spectrum nD …
• How should appraisal operations be
designed to accommodate growing
volume of electronic records?
Imaginations unbound
35. Approaches to Computational Scalability of
Document Appraisals
• Options for parallel processing
• message-passing interface (MPI)
• MPI is d i
i designed f the coordination of a program running as multiple
d for h di i f i li l
processes in a distributed memory environment by using passing
control messages.
• open multi-processing (OpenMP)
multi processing
• OpenMP is intended for shared memory machines. It uses a
multithreading approach where the master threads forks any
number of slave threads
threads.
• Map Reduce parallel programming paradigm for commodity
clusters
• It l t programmers write simple Map function and Reduce
lets it i l M f ti dR d
function, which are then automatically parallelized without
requiring the programmers to code the details of parallel
processes and communications
• Specialized Hardware: FPGA, Cell processors, GPU
Imaginations unbound
37. Hardware & Software Dependencies with
Hadoop
• Test data: 15 PDF files from the Columbia investigation
p g
web site at http://caib.nasa.gov/.
• Software configuration: Linux OS (Ubuntu flavor) and
the Hadoop implementation of Map and Reduce
functionalities
f nctionalities
• Hardware configuration: homogeneous &
heterogeneous machines
g
Hadoop Average Speed
60
50
nds
40
secon
30 average speed
20
10
0
1 2 3 4 5
#machines
Homogeneous Hardware Heterogeneous Hardware
Imaginations unbound
38. Open Problems Related to Archival
Decisions
•Simulate Preservation Costs as a Function of Information
Granularity and Information Technology
•Optimal Utilization of Computational and Human
Resources
Imaginations unbound
39. Open Problem: Archival Decision Support
• Decision support for forecasting preservation
costs
• How to predict computational and storage
p p g
requirements of preservation as a function
of technology variables and information
gy
granularity?
• How to optimize computational hardware,
software, storage, and networking
investments?
Imaginations unbound
40. Basic Questions About Information to be
Preserved
National Center for Supercomputing Applications
41. Challenges in Forecasting
• Volatility of software/hardware/storage media
• Updates: Windows operating systems since 2000: Two major new
releases, two minor service pack updates, around fifty security
, p p , y y
patches since SP2
• Upgrades: Microsoft Office Pro for Windows
95/98/ME/2000/XP/2003/2007
• Media life expectancy: Optical ~5 years Disk ~ 15 years Microfiche ~
5 years, years,
100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life
expectancy vs. information density – [P. Conway, 1996] )
• Cost of software/hardware/storage media
• Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows
95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista =
$399->$319 (2008)
• 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200 250 due to
$120 > > $200-250
Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96
(flash card) - www.pricewatch.com (1TB ~$109.95 as of 01/15/2009)
• High performance computers: 2006: DARPA awards approximately
$500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM
National Center for Supercomputing Applications
42. Archival Decision Support
• Lack of forecasting models to predict preservation costs
• Our work: Understand the tradeoffs between information
value and computational/storage costs by providing
simulation frameworks
• Information granularity, organization, compression, encryption,
document format, ...
• Versus
• Cost of CPU for gathering information, for processing and for
input/output operations; cost of storage media, upgrades, storage
p p p ; g , pg , g
room, …
• Prototype simulation framework: Image Provenance To
Learn available for downloading from
http://isda.ncsa.uiuc.edu
43. Simulation Framework
Information Information
Gathering and Retrieval and
Decision Maker Storage Process Learning
Preservation
Reconstruction
Value
Provenance Provenance
Information Information
Value
linear
Value
observed
Cost (memory, CPU)
Cost / Information Granularity
Analysis
Image Viewer Process Reconstruction System
Information Gathering System
National Center for Supercomputing Applications
46. Storage vs. Information Organization
Tradeoffs: Test Case
• Information granules include interpreted, raw and snapshots
• Files were not compressed
Event Name
Saved Size
Change Auto Zoom
Change Gray Scale
Change RGB Band
Add Annotation
Mouse Clicked
Mouse Clicked -RDF= Resource
Magnification Description
Change Selection
Window Hidden RDF
Framework
Change Gamma
Key Pair
Metadata Model
Window Shown
New Image
Change Visible Region -Key pair = XML
Change Zoom Factor Metadata Model
Window Created
1 10 100 1000 10000 100000 1000000 10000000
Bytes (log scale)
National Center for Supercomputing Applications
47. Open Problems Related to Automating
Archival Processing for Preservation
1. Discovery of Relationships Among Electronic Records
2. Information Preserving Conversions of Electronic Records
3. Sampling, Authenticity and Integrity Verification of a Collection
of Temporally Changing Records
Imaginations unbound
48. Open Problem 1: Discovering
Relationships Among Files
• How should one establish relationships among electronic
records coming from disparate sources or from the same
source at multiple time instances?
• How to extract metadata?
• What ontology to use to represent the extracted
metadata?
• H
How t automate metadata extraction from multiple data
to t t t d t t ti f lti l d t
types, e.g., 2D drawings and 3D CAD models?
• How to discover relationships between electronic records
corresponding to the same physical objects but different
multidimensional observations?
• Need to Understand the Complexity of the Problem
Imaginations unbound
49. Metadata Extraction: Complexity & Size
the Crandon Mine Reports
p
from 1981 till 2003
http://digicoll.library.wisc.edu/cgi-bin/EcoNatRes/EcoNatRes-
idx?type=browse&scope=ECONATRES.CRANDONMINE
RDF t i l extracted using A t
triples t t d i Aperture and visualized using RDF
d i li d i RDF-
Gravity (red – edges, green-literal values, violet – properties)
Imaginations unbound
50. Relationships Among Multiple Data Types
• Example Data: Torpedo Weapon Retriever 841
• 784 existing 2D image drawings and N>22 3D CAD
models
• How to establish relationships among the 3D
CAD models and 2D image drawings during a
product lifecycle?
Hypothetical Distribution of 3D CAD models for
TWR 841
Imaginations unbound
51. Understanding Challenges in Automation
ry
Relationship Discover
D
OCR
Descriptors (metadata)
Representation
Imaginations unbound
52. Open Problem 2: Conversions of
Electronic Records
• Conversions of electronic records are needed because
• Visual exploration depends on various software
packages
• Many formats are retired (deprecated) over time
• A subset of formats is selected for preservation
purposes
• How to measure the degree of information
g
preservation when files are converted from format A to
format B?
• During conversions, information could be lost added or modified
conversions lost,
• What is the importance of each byte, object, etc. ?
• How to introduce a framework for measuring the
quality of conversion and visualization software?
Imaginations unbound
53. Example: Conversion of X3D to STEP to X3D
Software:
X3dToVrml97
X3D Software: WRL
A3D Reviewer
Software:
A3D Reviewer
Software: Nothing!
Vrml97ToX3d
STEP WRL X3D
54. Automation of 3D File Format Mapping &
Conversion
Imaginations unbound
55. Open Problem 3: Sampling,
Integrity and Authenticity
g y y
• Given finite resources and increasing amounts of electronic
records, automation of sampling, integrity and authenticity
verification is very much needed
• What are the criteria for sampling a collection of temporally
changing versions of ‘the same’ document?
• Authenticity
• Integrity
• Information content
• How to measure a degree of authenticity?
• Computers might assign inaccurate time stamps to records
• How to detect integrity failures?
• A record containing a female patient with prostate cancer
• How to incorporate constraints into sampling?
• Storage space, compression computational cost, etc.
Imaginations unbound
56. Example:Temporal Ranking and Integrity
Verification
• Chronological ranking
based on time stamps of
files
fil
• Last modification (current
implementation)
• Ranking can be
changed by a human
• Content referring to
dates can be used for
integrity verification
TIME
Imaginations unbound
57. Rules and Attributes for Integrity Verification
• Document integrity attributes?
• appearance or disappearance of document images
• appearance and disappearance of dates embedded in
documents
• file size
• count of image groups
• number of sentences
• average value of dates found in a document
• Rules?
Imaginations unbound
58. Summary
• Introduced a set of open problems
related to
•AAppraisal of electronic records
i l f l t i d
• Archival forecasting of preservation
costs
• Automation of processing for
preservation
• Examples used for illustrating the open
problems from our research just
scratch the surface of some of the open
problems
bl
59. Observations
• Many stakeholders are already aware of some of the
open problems including government agencies and
companies
• As all government agencies have been
computerized, the continuity and functioning of the
agencies depend on preservation and reconstruction
of electronic records
• Right now, we are at the beginning of the
exponential growth of electronic records (many more
electronic records will be coming)
• Some scientific fields are already facing real time
decisions about preserving electronic records (e.g.,
astronomers)
t )
60. Future Vision
• It is envisioned that the preservation and
reconstruction of electronic records have to
follow different paradigms that incorporate
• Scalability (heterogeneity, dimensionality
and volume) )
• Forecasting of preservation costs
• New level of automation and quality
control in processing for preservation
purposes
• The field of electronic record management
and preservation needs forward looking
solutions to stay abreast with the dynamics
y y
of digital information
Imaginations unbound
61. References to Presented Research
• -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer
Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008).
• -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography
Bajcsy A Methodologies,
Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: http://www3.interscience.wiley.com/cgi-
bin/fulltext/121478978/PDFSTART
• -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In:
Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc.,
2009; URL: https://www.novapublishers.com/catalog/product_info.php?products_id=8011
; p p gp p p p
• - McHenry K. and P. Bajcsy quot;An Overview of 3D Data Content, File Formats and Viewers.quot;, Technical Report NCSA-
ISDA08-002, October 31, 2008
• -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for
Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12,
2008, Indianapolis, IN.
, p ,
• -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary
Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San
Diego, CA.
• -Bajcsy P. and S-C Lee, quot;Computer Assisted Appraisal of Contemporary PDF Documentsquot; ARCHIVES 2008: Archival
R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA.
g g g , , ,
• -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical
Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007
International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007.
• -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial
Electronic Records, the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference,
Records,” ( Federation )
poster, January 4-6, 2006 in Washington, DC.
Imaginations unbound
62. Questions
• Project URL:
j
http://isda.ncsa.uiuc.edu/NARA/index.html
and http://isda.ncsa.uiuc.edu/CompTradeoffs/
• Publications – see our URL at
http://isda.ncsa.uiuc.edu/publications
http://isda ncsa uiuc edu/publications
• Peter Bajcsy; email: pbajcsy@ncsa uiuc edu
pbajcsy@ncsa.uiuc.edu