SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
To Preserve Or Not
To Preserve?
The Challenges in
Appraising
Electronic Records
  ect o c eco ds
 Peter Bajcsy, PhD
 - Research Scientist, NCSA
 - Adjunct Assistant Professor ECE & CS at
 UIUC
 - Associate Director Center for Humanities,
 Social Sciences and Arts (CHASS), Illinois
 Informatics Institute (I3), UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

      Date: January 21st, 2009
Acknowledgement

   • This research was partially supported by a National
     Archive and Records Administration (NARA) supplement
                                             (      ) pp
     to NSF PACI cooperative agreement CA #SCI-9619019
     and NCSA Industrial Partners.
   • The views and conclusions contained in this doc ment
           ie s      concl sions                     document
     are those of the authors and should not be interpreted as
     representing the official policies, either expressed or
     implied, of the National Archive and Records
     Administration, or the U.S. government.
   • Contributions by: Peter Bajcsy Kenton McHenry Rob
                              Bajcsy,           McHenry,
     Kooper, Michal Ondrejcek, William McFadden, Sang-
     Chul Lee, David Clutter and Alex Yahja


Imaginations unbound
Outline

• Introduction
• Stakeholders
• Conceptual Challenges
• Some Open Problems
• Research Examples Illustrating Open
  Problems
• Summary Observations and Future
  Summary,
  Vision
Introduction
• Two Trends in the Context of Decision Processes
  (Government, Medical, Natural Disasters, …)
   • Decision processes are moving from paper based
     to electronic record based (~ computer assisted
     decision processes)
   • Electronic records depend on rapidly changing
     information technology
   • Decisions are optimal depending on knowledge
• Any learning from electronic records depends on
  preservation and reconstruction of the records, as
  well as on quality and granularity of the information

National Center for Supercomputing Applications
Fundamental Problems

• Limited learning from historical records
  today
   • It is often due to missing information and
     high uncertainty/ low quality of historical
     records.
• Lack of understanding how to preserve and
  reconstruct data and decision processes.
   • It is due to insufficient
     forecasting/simulation capabilities.

National Center for Supercomputing Applications
To Be Preserved!
                        Digital
                        representation of   Preservation
                        information
                        i f     ti
                        & knowledge




Information
transfer ?


 AGENCY                                      ARCHIVES
 Imaginations unbound
Motivation
 • The problems related to preservation of electronic records
   are only going to become more serious
      • Information becomes more heterogeneous and complex
          • More data types
          • Higher dimensional data
          • N
            New fil f
                 file formats
                           t
      • Volumes of electronic records have been increasing and will
        continue to grow
          • The model of a paperless office (4 years of Bush’s email > 8
            years of Clinton’s email)
          • The paradigm shift to eScience
      • Digital information technology has been changing faster than
        any previous preservation media
          • The time scale of electronic media is ephemeral in comparison
                                                   p              p
            with paper or clay tablets

Imaginations unbound
Example of Preservation Needs in Medicine

• Short term:
   • Medical practice requires comparing patients’
     records acquired today with the patients’
     records f
          d from 5 10 50 or 70 years i order t
                   5, 10, 50,             in d to
     assess functional, structural or low level
     biological changes due to diseases
                                diseases,
     treatments and/or aging.
• Long term:
   • Genealogy studies compare data sets over
     several hundreds and thousands of years
                                           y

National Center for Supercomputing Applications
Who Are the Stakeholders?
 • Multiple institutions and organizations are active in the area
   of medical record preservation
    • National Library of Medicine (NLM)
                       y            (     )
    • Research Information Network (RIN)
    • Medical Research Council (MRC) in UK
    • National Archives and Record Administration (NARA)
 • Identified common goals:
    • S
      Seamless, uninterrupted access t expanding collections
             l        i t    t d        to       di       ll ti
      of biomedical data, medical knowledge, and health
      information
    • Preserve medical record collections in highly usable
      forms and contribute to comprehensive strategies for
      preservation of biomedical information in the U S and
                                                     U.S.
      worldwide.
National Center for Supercomputing Applications
Other Stakeholders
• Government agencies
   • Prediction of patterns signaling natural disasters
     based on hi t i l measurements
     b    d     historical             t
   • Detection of terrorist attacks based on past
     experience
   • Learning about other planets from past space shuttle
     missions
   • Preservation of cultural heritage
• Companies
   • P
     Preservation of engineering d
              ti     f     i    i drawings and
                                        i      d
     architectural designs – Boeing, John Deere, GM
   • Preservation of simulation results – Caterpillar, Ford
                                                p    ,
   • Backward compatibility of hardware/software - GE
Imaginations unbound
NARA as One of the Key Stakeholders
• According to The Strategic Plan of The
  National Archives and Records
  Administration 2006–2016. “Preserving th
  Ad i i t ti 2006 2016 “P             i the
  Past to Protect the Future”
  • “Strategic Goal: We will preserve and
     Strategic
    process records to ensure access by the
    p
    public as soon as legally p
                        g y possible”
     • “D. We will improve the efficiency with
       which we manage our holdings from
       the time th are scheduled th
       th ti    they       h d l d through h
       accessioning, processing, storage,
       preservation, and public use.”
                                use.
Conceptual Challenges
• Learning Requires Reusing Electronic Records
   • How to enable and support preservation and
     reconstruction of electronic records?
• Advancing Sensors and Instruments Leads to New
  Types of High Dimensional Data and Large Volumes
   • How to design preservation methodologies that
     scale well?
• Process to Enable Learning over Time from
  Electronic Records Requires Large Financial
  Investments
   • How to minimize computational hardware,
     software,
     software and storage cost and maximize the
     amount of preserved information?
National Center for Supercomputing Applications
What Are The Key Open Problems?




Imaginations unbound
Some Open Problems -> Intellectual Merit
• Appraisal Methodology
      • Appraisal by Visual Exploration
      • Support of Appraisals by Enabling Comparisons
      • Scalability of Appraisals with Increasing Heterogeneity of
        Information, Dimensionality of Data and Volume of Electronic
        Records
• Support of Archival Decisions
      • Simulate Preservation Costs as a Function of Information
        Granularity and I f
        G    l it      d Information Technology
                                 ti T h l
      • Optimal Utilization of Computational and Human Resources
• Automation of Processing for Preservation
                         g
      • Discovery of Relationships Among Electronic Records
      • Information Preserving Conversions of Electronic Records
      • Sampling Authenticity and Integrity Verification of a Collection of
        Sampling,
        Temporally Changing Records
Imaginations unbound
Broader Impacts
                                 Process to Enable
                                 Learning Over Time
          Electronic                              +$        Knowledge
          Records




                                                       -$

                                   Optimal Decision Making



National Center for Supercomputing Applications
Concrete Research Examples Illustrating
  Open Problems
   p




Imaginations unbound
Open Problems Related to Appraisal
 Methodology
       1. Appraisal by Visual Exploration
       2. Support of Appraisals by Enabling Comparisons
       3. Scalability of Appraisals with Increasing Heterogeneity of
          Information, Dimensionality of Data and Volume of Electronic
          Records


Imaginations unbound
Definition of Appraisal in Archival Context

• Appraisal -- the process of determining the value and thus
  the final disposition of Federal records making them either
                                   records,
  temporary or permanent.
   • See http://www.archives.gov/records-
              p                 g
     mgmt/initiatives/appraisal.html
• The basis of appraisal decisions may include
   • th records'' provenance and content,
     the        d                 d     t t
   • the records' authenticity and reliability,
   • the records‘ order and completeness,
         records              completeness
   • the records‘ condition and costs to preserve them, and
   • the records‘ intrinsic value
         records

Imaginations unbound
Open Problem 1: Appraisal by Visual
   Exploration

• How to visualize the transition from raw data to information?
   • Raw data (Byte stream) -> Information 0F0 ->(R.G,B)->GREEN
• How to encode and represent heterogeneous information for
  visual exploration and for computer assisted operations?
                             computer-assisted
   • Encoding (e.g., shape consisting of a set of Bezier
     curves is encoded by a set of straight lines)
   • Representation (e.g., colors are represented by an
     ordered sequence of intensity values from all bands)
• H
  How t summarize representations for visual exploration?
       to          i            t ti   f   i    l    l ti ?
   • Frequency of occurrence of primitives
   • Local and global summarizations

Imaginations unbound
Example: Adobe Portable Document
   Format (PDF)
 • Why PDF? - PDF is just an example of a container
      • Office environment (Adobe PDF PS, MS Word, HTML …)
                                  PDF, PS     Word HTML, )
      • Satellite measurements (HDF, netCDF, …)




                                                             3D
                                                        Adobe Library 6.0


                                                             Movie
                                                        Adobe Lib
                                                        Ad b Library 7 0
                                                                     7.0




Imaginations unbound
Exploration of PDF Documents Using PDF
Viewer
• PDF Viewer presents information as a set of pages with
  their layouts
• PDF Viewer renders layers of internal objects
  (components) and hence only the top layer is visible
Needed Exploration of PDF Components
         p                   p
• There is no support for archival appraisals that would
  include visual exploration of components in a document
  (a container of components)

• Needed viewers for appraisal analyses that present
  information stored in a container (e.g., PDF) as a set of
  components and their characteristics
   • Text – word frequency
   • Images (rasters) – color frequency (histogram)
   • Vector graphics – line frequency
• Exploration for appraisal analyses needs to include
  visible and invisible objects
Exploration of Text Components



              LOADED FILES
Occurrence of words   Occurrence of numbers
                                              “Ignore” words
Exploration of Image Components



                 LOADED FILES                      “Ignore” colors

List of images    Occurrence of colors   Preview
Exploration of Vector Graphics
   Components


                       LOADED FILES
       Preview                Occurrence of v/h lines




Imaginations unbound
Exploration of Visible And Invisible Objects

Objects intersected at the
mouse click location
Open Problem 2: Support of Appraisals
by Enabling Comparisons
   • How to compare containers with heterogeneous
     information?
     i f     ti ?
      • Methodology
      • Metrics
      • Weighting factors for fusion
   • How to quantify differences between the same
     type of information?
      • Encodings and Representations
      • Metrics
      • Local versus global differences
Imaginations unbound
Comparisons




Imaginations unbound
Methodology
Partial
solutions in
literature
-Ref.
                     +…
CAPTCHA




 Open
 problems


                     +…
 Relationship to
 Permanent Records
Experimental Example
                       INPUT = 10 PDF docs (4 & 6 Groups)
   UNIQUE ID= 1,2,3,4                         UNIQUE ID= 5,6,7,8,9,10




Imaginations unbound
Comparative Experimental Results

                                                   INPUT = 10 PDF docs
                                                   (6 & 4 members in each Group)




                         Vector-based similarity
                         V      b   d i il i




 Text-based similarity                             Image-based similarity
Comparative Experimental Results




Vector Graphics Similarity     Portion of Document Surface
and Word Similarity Combined   Allotted to Each Document Feature


                                  Comparison Using
                                  Combination of Document
                                  Features in Proportion to
                                  Coverage
Accuracy Comparisons

  Method               Average         Average         Average
                       Similarity of   Similarity of   Similarity Across
                       Group 1         Group 2         Group 1 & 2
  TEXT ONLY            1               0.489           0
  TEXT & IMAGE &       0.906
                       0 906           0.520
                                       0 520           0.075
                                                       0 075
  GRAPHICS

 One refers to high similarity & zero refers to low similarity
                 g           y                               y

 Conclusions:
 •Differences in similarity are up to 10% of the score
 •Documents in Group 2 would likely be misclassified as 0.5
 similarity would be the threshold between similar and
 dissimilar documents
Imaginations unbound
Open Problem 3: Scalability of
     Appraisals
• Scalability of appraisals with increasing
  heterogeneity of information,
  dimensionality of data and volume of
  electronic records
   • H
     How should appraisal process change
            h ld        i l            h
     as 3D data is added to file containers?
   • H
     How should appraisal process change
            h ld        i l            h
     as 3D+time, 2D+spectrum,
     3D+time+spectrum, nD,
     3D+time+spectrum nD …
   • How should appraisal operations be
     designed to accommodate growing
     volume of electronic records?
Imaginations unbound
Approaches to Computational Scalability of
   Document Appraisals
 • Options for parallel processing
      • message-passing interface (MPI)
            • MPI is d i
                  i designed f the coordination of a program running as multiple
                            d for h       di i    f                i        li l
              processes in a distributed memory environment by using passing
              control messages.
      • open multi-processing (OpenMP)
               multi processing
          • OpenMP is intended for shared memory machines. It uses a
            multithreading approach where the master threads forks any
            number of slave threads
                              threads.
      • Map Reduce parallel programming paradigm for commodity
        clusters
          • It l t programmers write simple Map function and Reduce
               lets                it i l M f         ti      dR d
            function, which are then automatically parallelized without
            requiring the programmers to code the details of parallel
            processes and communications
 • Specialized Hardware: FPGA, Cell processors, GPU
Imaginations unbound
Computational
Requirements for
Executing the
Methodology


 Yellow indicates
 computations



  Relationship to
  Permanent Records




Appraisal & Sampling
Hardware & Software Dependencies with
   Hadoop
  • Test data: 15 PDF files from the Columbia investigation
                    p            g
    web site at http://caib.nasa.gov/.
  • Software configuration: Linux OS (Ubuntu flavor) and
    the Hadoop implementation of Map and Reduce
    functionalities
    f nctionalities
  • Hardware configuration: homogeneous &
    heterogeneous machines
           g
                        Hadoop Average Speed

                   60
                   50
             nds




                   40
         secon




                   30                        average speed
                   20
                   10
                    0
                         1   2   3   4   5

                             #machines



      Homogeneous Hardware                                   Heterogeneous Hardware
Imaginations unbound
Open Problems Related to Archival
 Decisions
    •Simulate Preservation Costs as a Function of Information
    Granularity and Information Technology
    •Optimal Utilization of Computational and Human
    Resources
Imaginations unbound
Open Problem: Archival Decision Support

• Decision support for forecasting preservation
  costs
   • How to predict computational and storage
             p           p                  g
     requirements of preservation as a function
     of technology variables and information
                gy
     granularity?
   • How to optimize computational hardware,
     software, storage, and networking
     investments?
Imaginations unbound
Basic Questions About Information to be
   Preserved




National Center for Supercomputing Applications
Challenges in Forecasting
•   Volatility of software/hardware/storage media
     • Updates: Windows operating systems since 2000: Two major new
       releases, two minor service pack updates, around fifty security
                  ,                   p     p      ,         y          y
       patches since SP2
     • Upgrades: Microsoft Office Pro for Windows
       95/98/ME/2000/XP/2003/2007
     • Media life expectancy: Optical ~5 years Disk ~ 15 years Microfiche ~
                                          5 years,         years,
       100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life
       expectancy vs. information density – [P. Conway, 1996] )
•   Cost of software/hardware/storage media
     • Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows
       95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista =
       $399->$319 (2008)
     • 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200 250 due to
                                          $120 >     > $200-250
       Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96
       (flash card) - www.pricewatch.com (1TB ~$109.95 as of 01/15/2009)
     • High performance computers: 2006: DARPA awards approximately
       $500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM

National Center for Supercomputing Applications
Archival Decision Support

• Lack of forecasting models to predict preservation costs

• Our work: Understand the tradeoffs between information
  value and computational/storage costs by providing
  simulation frameworks
   • Information granularity, organization, compression, encryption,
     document format, ...
   • Versus
   • Cost of CPU for gathering information, for processing and for
     input/output operations; cost of storage media, upgrades, storage
       p      p    p         ;              g       , pg      ,      g
     room, …
• Prototype simulation framework: Image Provenance To
  Learn available for downloading from
  http://isda.ncsa.uiuc.edu
Simulation Framework
                         Information                     Information
                         Gathering and                     Retrieval and
  Decision Maker             Storage                          Process                      Learning
                                            Preservation
                                                          Reconstruction
                                                                                             Value

                         Provenance                         Provenance
                         Information                        Information


                                                                                              Value



                                                                                                      linear




                                                                                   Value
                                                                                                      observed


                                                                                           Cost (memory, CPU)

                                                                                 Cost / Information Granularity
                                                                                            Analysis


  Image Viewer                                            Process Reconstruction System

                           Information Gathering System
National Center for Supercomputing Applications
Image Event Category Tracker



  Events

 Summary
 of Events

 Viewed
 Area


 Storage

               Time
Information Granularity




National Center for Supercomputing Applications
Storage vs. Information Organization
                      Tradeoffs: Test Case
 • Information granules include interpreted, raw and snapshots
 • Files were not compressed

         Event Name
                                     Saved Size
     Change Auto Zoom
      Change Gray Scale
      Change RGB Band
        Add Annotation
         Mouse Clicked
         Mouse Clicked                                                                         -RDF= Resource
          Magnification                                                                        Description
       Change Selection
        Window Hidden                                                               RDF
                                                                                               Framework
        Change Gamma
                                                                                    Key Pair
                                                                                               Metadata Model
        Window Shown
             New Image
   Change Visible Region                                                                       -Key pair = XML
    Change Zoom Factor                                                                         Metadata Model
       Window Created

                           1   10   100   1000      10000     100000 1000000 10000000

                                          Bytes (log scale)


National Center for Supercomputing Applications
Open Problems Related to Automating
 Archival Processing for Preservation
       1. Discovery of Relationships Among Electronic Records
       2. Information Preserving Conversions of Electronic Records
       3. Sampling, Authenticity and Integrity Verification of a Collection
          of Temporally Changing Records



Imaginations unbound
Open Problem 1: Discovering
   Relationships Among Files
• How should one establish relationships among electronic
  records coming from disparate sources or from the same
  source at multiple time instances?
   • How to extract metadata?
   • What ontology to use to represent the extracted
     metadata?
   • H
     How t automate metadata extraction from multiple data
          to t       t     t d t   t ti f          lti l d t
     types, e.g., 2D drawings and 3D CAD models?
   • How to discover relationships between electronic records
     corresponding to the same physical objects but different
     multidimensional observations?
• Need to Understand the Complexity of the Problem
Imaginations unbound
Metadata Extraction: Complexity & Size

                                                       the Crandon Mine Reports
                                                                           p
                                                       from 1981 till 2003
                                                       http://digicoll.library.wisc.edu/cgi-bin/EcoNatRes/EcoNatRes-
                                                       idx?type=browse&scope=ECONATRES.CRANDONMINE




                       RDF t i l extracted using A t
                            triples t t d i Aperture and visualized using RDF
                                                                d i     li d i RDF-
                       Gravity (red – edges, green-literal values, violet – properties)
Imaginations unbound
Relationships Among Multiple Data Types
  • Example Data: Torpedo Weapon Retriever 841
       • 784 existing 2D image drawings and N>22 3D CAD
         models
  • How to establish relationships among the 3D
    CAD models and 2D image drawings during a
    product lifecycle?




   Hypothetical Distribution of 3D CAD models for
   TWR 841
Imaginations unbound
Understanding Challenges in Automation




                                           ry
                       Relationship Discover
                                    D
                 OCR
                                                Descriptors (metadata)
                                                Representation


Imaginations unbound
Open Problem 2: Conversions of
   Electronic Records
  • Conversions of electronic records are needed because
     • Visual exploration depends on various software
       packages
     • Many formats are retired (deprecated) over time
     • A subset of formats is selected for preservation
       purposes
  • How to measure the degree of information
                             g
    preservation when files are converted from format A to
    format B?
       • During conversions, information could be lost added or modified
                conversions                       lost,
       • What is the importance of each byte, object, etc. ?
  • How to introduce a framework for measuring the
    quality of conversion and visualization software?
Imaginations unbound
Example: Conversion of X3D to STEP to X3D

                          Software:
                          X3dToVrml97




   X3D                     Software:                    WRL
                           A3D Reviewer




           Software:
           A3D Reviewer
                                          Software:     Nothing!
                                          Vrml97ToX3d




 STEP                             WRL                    X3D
Automation of 3D File Format Mapping &
   Conversion




Imaginations unbound
Open Problem 3: Sampling,
 Integrity and Authenticity
     g y                  y
• Given finite resources and increasing amounts of electronic
  records, automation of sampling, integrity and authenticity
  verification is very much needed
• What are the criteria for sampling a collection of temporally
  changing versions of ‘the same’ document?
      • Authenticity
      • Integrity
      • Information content
• How to measure a degree of authenticity?
      • Computers might assign inaccurate time stamps to records
• How to detect integrity failures?
      • A record containing a female patient with prostate cancer
• How to incorporate constraints into sampling?
      • Storage space, compression computational cost, etc.
Imaginations unbound
Example:Temporal Ranking and Integrity
   Verification
• Chronological ranking
  based on time stamps of
  files
  fil
     • Last modification (current
       implementation)
• Ranking can be
  changed by a human
• Content referring to
  dates can be used for
  integrity verification



                                    TIME
Imaginations unbound
Rules and Attributes for Integrity Verification
   • Document integrity attributes?
        • appearance or disappearance of document images
        • appearance and disappearance of dates embedded in
          documents
        • file size
        • count of image groups
        • number of sentences
        • average value of dates found in a document
   • Rules?




Imaginations unbound
Summary
• Introduced a set of open problems
  related to
   •AAppraisal of electronic records
           i l f l t i            d
   • Archival forecasting of preservation
     costs
   • Automation of processing for
     preservation

• Examples used for illustrating the open
  problems from our research just
  scratch the surface of some of the open
  problems
     bl
Observations
• Many stakeholders are already aware of some of the
  open problems including government agencies and
  companies

• As all government agencies have been
  computerized, the continuity and functioning of the
  agencies depend on preservation and reconstruction
  of electronic records

• Right now, we are at the beginning of the
  exponential growth of electronic records (many more
  electronic records will be coming)

• Some scientific fields are already facing real time
  decisions about preserving electronic records (e.g.,
  astronomers)
    t         )
Future Vision
  • It is envisioned that the preservation and
    reconstruction of electronic records have to
    follow different paradigms that incorporate
     • Scalability (heterogeneity, dimensionality
        and volume) )
     • Forecasting of preservation costs
     • New level of automation and quality
        control in processing for preservation
        purposes
  • The field of electronic record management
    and preservation needs forward looking
    solutions to stay abreast with the dynamics
                      y                  y
    of digital information
Imaginations unbound
References to Presented Research

•   -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer
    Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008).
•   -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography
     Bajcsy       A                                                                            Methodologies,
    Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: http://www3.interscience.wiley.com/cgi-
    bin/fulltext/121478978/PDFSTART
•   -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In:
    Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc.,
    2009; URL: https://www.novapublishers.com/catalog/product_info.php?products_id=8011
          ;         p            p                      gp           p p p
•   - McHenry K. and P. Bajcsy quot;An Overview of 3D Data Content, File Formats and Viewers.quot;, Technical Report NCSA-
    ISDA08-002, October 31, 2008
•   -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for
    Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12,
    2008, Indianapolis, IN.
          ,        p ,
•   -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary
    Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San
    Diego, CA.
•   -Bajcsy P. and S-C Lee, quot;Computer Assisted Appraisal of Contemporary PDF Documentsquot; ARCHIVES 2008: Archival
    R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA.
                                                g                    g         g         ,      ,              ,
•   -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical
    Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007
    International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007.
•   -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial
    Electronic Records, the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference,
                 Records,”                                                                 ( Federation )
    poster, January 4-6, 2006 in Washington, DC.



Imaginations unbound
Questions


• Project URL:
      j
  http://isda.ncsa.uiuc.edu/NARA/index.html
  and http://isda.ncsa.uiuc.edu/CompTradeoffs/

• Publications – see our URL at
  http://isda.ncsa.uiuc.edu/publications
  http://isda ncsa uiuc edu/publications

• Peter Bajcsy; email: pbajcsy@ncsa uiuc edu
                       pbajcsy@ncsa.uiuc.edu

Más contenido relacionado

Similar a To Preserve Or Not To Preserve?

Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationStuart Shulman
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Chemistry Librarianship Cinf 3 16 09
Chemistry Librarianship Cinf 3 16 09Chemistry Librarianship Cinf 3 16 09
Chemistry Librarianship Cinf 3 16 09Elizabeth Brown
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive TechnologyIBM Watson
 
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...Tom Moritz
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemSubhasis Dasgupta
 
Why manage research data?
Why manage research data?Why manage research data?
Why manage research data?Graham Pryor
 
Research Data Census
Research Data CensusResearch Data Census
Research Data CensusJerry Sheehan
 
Paul Henning Krogh A New Dawn For E Collaboration In Science
Paul Henning Krogh   A New Dawn For E Collaboration In SciencePaul Henning Krogh   A New Dawn For E Collaboration In Science
Paul Henning Krogh A New Dawn For E Collaboration In ScienceVincenzo Barone
 
Data Landscapes: The Neuroscience Information Framework
Data Landscapes:  The Neuroscience Information FrameworkData Landscapes:  The Neuroscience Information Framework
Data Landscapes: The Neuroscience Information FrameworkMaryann Martone
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012J T "Tom" Johnson
 
Preservation Challenge Radioactive Waste Ian Upshall
Preservation Challenge Radioactive Waste Ian UpshallPreservation Challenge Radioactive Waste Ian Upshall
Preservation Challenge Radioactive Waste Ian UpshallDigitalPreservationEurope
 
Workshop humphrey watkins-boyko-dmp workshop
Workshop humphrey watkins-boyko-dmp workshopWorkshop humphrey watkins-boyko-dmp workshop
Workshop humphrey watkins-boyko-dmp workshopCASRAI
 

Similar a To Preserve Or Not To Preserve? (20)

Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Chemistry Librarianship Cinf 3 16 09
Chemistry Librarianship Cinf 3 16 09Chemistry Librarianship Cinf 3 16 09
Chemistry Librarianship Cinf 3 16 09
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive Technology
 
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
California Ocean Science Trust " Building a Sustainable Knowledge Base for ...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Wiser2009 Luis Martinez
Wiser2009 Luis MartinezWiser2009 Luis Martinez
Wiser2009 Luis Martinez
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific System
 
Why manage research data?
Why manage research data?Why manage research data?
Why manage research data?
 
ARLIS-NY Presentation
ARLIS-NY PresentationARLIS-NY Presentation
ARLIS-NY Presentation
 
Research Data Census
Research Data CensusResearch Data Census
Research Data Census
 
Paul Henning Krogh A New Dawn For E Collaboration In Science
Paul Henning Krogh   A New Dawn For E Collaboration In SciencePaul Henning Krogh   A New Dawn For E Collaboration In Science
Paul Henning Krogh A New Dawn For E Collaboration In Science
 
Data Landscapes: The Neuroscience Information Framework
Data Landscapes:  The Neuroscience Information FrameworkData Landscapes:  The Neuroscience Information Framework
Data Landscapes: The Neuroscience Information Framework
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
 
Namande
NamandeNamande
Namande
 
Preservation Challenge Radioactive Waste Ian Upshall
Preservation Challenge Radioactive Waste Ian UpshallPreservation Challenge Radioactive Waste Ian Upshall
Preservation Challenge Radioactive Waste Ian Upshall
 
Workshop humphrey watkins-boyko-dmp workshop
Workshop humphrey watkins-boyko-dmp workshopWorkshop humphrey watkins-boyko-dmp workshop
Workshop humphrey watkins-boyko-dmp workshop
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

To Preserve Or Not To Preserve?

  • 1. To Preserve Or Not To Preserve? The Challenges in Appraising Electronic Records ect o c eco ds Peter Bajcsy, PhD - Research Scientist, NCSA - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: January 21st, 2009
  • 2. Acknowledgement • This research was partially supported by a National Archive and Records Administration (NARA) supplement ( ) pp to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archive and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, William McFadden, Sang- Chul Lee, David Clutter and Alex Yahja Imaginations unbound
  • 3. Outline • Introduction • Stakeholders • Conceptual Challenges • Some Open Problems • Research Examples Illustrating Open Problems • Summary Observations and Future Summary, Vision
  • 4. Introduction • Two Trends in the Context of Decision Processes (Government, Medical, Natural Disasters, …) • Decision processes are moving from paper based to electronic record based (~ computer assisted decision processes) • Electronic records depend on rapidly changing information technology • Decisions are optimal depending on knowledge • Any learning from electronic records depends on preservation and reconstruction of the records, as well as on quality and granularity of the information National Center for Supercomputing Applications
  • 5. Fundamental Problems • Limited learning from historical records today • It is often due to missing information and high uncertainty/ low quality of historical records. • Lack of understanding how to preserve and reconstruct data and decision processes. • It is due to insufficient forecasting/simulation capabilities. National Center for Supercomputing Applications
  • 6. To Be Preserved! Digital representation of Preservation information i f ti & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  • 7. Motivation • The problems related to preservation of electronic records are only going to become more serious • Information becomes more heterogeneous and complex • More data types • Higher dimensional data • N New fil f file formats t • Volumes of electronic records have been increasing and will continue to grow • The model of a paperless office (4 years of Bush’s email > 8 years of Clinton’s email) • The paradigm shift to eScience • Digital information technology has been changing faster than any previous preservation media • The time scale of electronic media is ephemeral in comparison p p with paper or clay tablets Imaginations unbound
  • 8. Example of Preservation Needs in Medicine • Short term: • Medical practice requires comparing patients’ records acquired today with the patients’ records f d from 5 10 50 or 70 years i order t 5, 10, 50, in d to assess functional, structural or low level biological changes due to diseases diseases, treatments and/or aging. • Long term: • Genealogy studies compare data sets over several hundreds and thousands of years y National Center for Supercomputing Applications
  • 9. Who Are the Stakeholders? • Multiple institutions and organizations are active in the area of medical record preservation • National Library of Medicine (NLM) y ( ) • Research Information Network (RIN) • Medical Research Council (MRC) in UK • National Archives and Record Administration (NARA) • Identified common goals: • S Seamless, uninterrupted access t expanding collections l i t t d to di ll ti of biomedical data, medical knowledge, and health information • Preserve medical record collections in highly usable forms and contribute to comprehensive strategies for preservation of biomedical information in the U S and U.S. worldwide. National Center for Supercomputing Applications
  • 10. Other Stakeholders • Government agencies • Prediction of patterns signaling natural disasters based on hi t i l measurements b d historical t • Detection of terrorist attacks based on past experience • Learning about other planets from past space shuttle missions • Preservation of cultural heritage • Companies • P Preservation of engineering d ti f i i drawings and i d architectural designs – Boeing, John Deere, GM • Preservation of simulation results – Caterpillar, Ford p , • Backward compatibility of hardware/software - GE Imaginations unbound
  • 11. NARA as One of the Key Stakeholders • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving th Ad i i t ti 2006 2016 “P i the Past to Protect the Future” • “Strategic Goal: We will preserve and Strategic process records to ensure access by the p public as soon as legally p g y possible” • “D. We will improve the efficiency with which we manage our holdings from the time th are scheduled th th ti they h d l d through h accessioning, processing, storage, preservation, and public use.” use.
  • 12. Conceptual Challenges • Learning Requires Reusing Electronic Records • How to enable and support preservation and reconstruction of electronic records? • Advancing Sensors and Instruments Leads to New Types of High Dimensional Data and Large Volumes • How to design preservation methodologies that scale well? • Process to Enable Learning over Time from Electronic Records Requires Large Financial Investments • How to minimize computational hardware, software, software and storage cost and maximize the amount of preserved information? National Center for Supercomputing Applications
  • 13. What Are The Key Open Problems? Imaginations unbound
  • 14. Some Open Problems -> Intellectual Merit • Appraisal Methodology • Appraisal by Visual Exploration • Support of Appraisals by Enabling Comparisons • Scalability of Appraisals with Increasing Heterogeneity of Information, Dimensionality of Data and Volume of Electronic Records • Support of Archival Decisions • Simulate Preservation Costs as a Function of Information Granularity and I f G l it d Information Technology ti T h l • Optimal Utilization of Computational and Human Resources • Automation of Processing for Preservation g • Discovery of Relationships Among Electronic Records • Information Preserving Conversions of Electronic Records • Sampling Authenticity and Integrity Verification of a Collection of Sampling, Temporally Changing Records Imaginations unbound
  • 15. Broader Impacts Process to Enable Learning Over Time Electronic +$ Knowledge Records -$ Optimal Decision Making National Center for Supercomputing Applications
  • 16. Concrete Research Examples Illustrating Open Problems p Imaginations unbound
  • 17. Open Problems Related to Appraisal Methodology 1. Appraisal by Visual Exploration 2. Support of Appraisals by Enabling Comparisons 3. Scalability of Appraisals with Increasing Heterogeneity of Information, Dimensionality of Data and Volume of Electronic Records Imaginations unbound
  • 18. Definition of Appraisal in Archival Context • Appraisal -- the process of determining the value and thus the final disposition of Federal records making them either records, temporary or permanent. • See http://www.archives.gov/records- p g mgmt/initiatives/appraisal.html • The basis of appraisal decisions may include • th records'' provenance and content, the d d t t • the records' authenticity and reliability, • the records‘ order and completeness, records completeness • the records‘ condition and costs to preserve them, and • the records‘ intrinsic value records Imaginations unbound
  • 19. Open Problem 1: Appraisal by Visual Exploration • How to visualize the transition from raw data to information? • Raw data (Byte stream) -> Information 0F0 ->(R.G,B)->GREEN • How to encode and represent heterogeneous information for visual exploration and for computer assisted operations? computer-assisted • Encoding (e.g., shape consisting of a set of Bezier curves is encoded by a set of straight lines) • Representation (e.g., colors are represented by an ordered sequence of intensity values from all bands) • H How t summarize representations for visual exploration? to i t ti f i l l ti ? • Frequency of occurrence of primitives • Local and global summarizations Imaginations unbound
  • 20. Example: Adobe Portable Document Format (PDF) • Why PDF? - PDF is just an example of a container • Office environment (Adobe PDF PS, MS Word, HTML …) PDF, PS Word HTML, ) • Satellite measurements (HDF, netCDF, …) 3D Adobe Library 6.0 Movie Adobe Lib Ad b Library 7 0 7.0 Imaginations unbound
  • 21. Exploration of PDF Documents Using PDF Viewer • PDF Viewer presents information as a set of pages with their layouts • PDF Viewer renders layers of internal objects (components) and hence only the top layer is visible
  • 22. Needed Exploration of PDF Components p p • There is no support for archival appraisals that would include visual exploration of components in a document (a container of components) • Needed viewers for appraisal analyses that present information stored in a container (e.g., PDF) as a set of components and their characteristics • Text – word frequency • Images (rasters) – color frequency (histogram) • Vector graphics – line frequency • Exploration for appraisal analyses needs to include visible and invisible objects
  • 23. Exploration of Text Components LOADED FILES Occurrence of words Occurrence of numbers “Ignore” words
  • 24. Exploration of Image Components LOADED FILES “Ignore” colors List of images Occurrence of colors Preview
  • 25. Exploration of Vector Graphics Components LOADED FILES Preview Occurrence of v/h lines Imaginations unbound
  • 26. Exploration of Visible And Invisible Objects Objects intersected at the mouse click location
  • 27. Open Problem 2: Support of Appraisals by Enabling Comparisons • How to compare containers with heterogeneous information? i f ti ? • Methodology • Metrics • Weighting factors for fusion • How to quantify differences between the same type of information? • Encodings and Representations • Metrics • Local versus global differences Imaginations unbound
  • 29. Methodology Partial solutions in literature -Ref. +… CAPTCHA Open problems +… Relationship to Permanent Records
  • 30. Experimental Example INPUT = 10 PDF docs (4 & 6 Groups) UNIQUE ID= 1,2,3,4 UNIQUE ID= 5,6,7,8,9,10 Imaginations unbound
  • 31. Comparative Experimental Results INPUT = 10 PDF docs (6 & 4 members in each Group) Vector-based similarity V b d i il i Text-based similarity Image-based similarity
  • 32. Comparative Experimental Results Vector Graphics Similarity Portion of Document Surface and Word Similarity Combined Allotted to Each Document Feature Comparison Using Combination of Document Features in Proportion to Coverage
  • 33. Accuracy Comparisons Method Average Average Average Similarity of Similarity of Similarity Across Group 1 Group 2 Group 1 & 2 TEXT ONLY 1 0.489 0 TEXT & IMAGE & 0.906 0 906 0.520 0 520 0.075 0 075 GRAPHICS One refers to high similarity & zero refers to low similarity g y y Conclusions: •Differences in similarity are up to 10% of the score •Documents in Group 2 would likely be misclassified as 0.5 similarity would be the threshold between similar and dissimilar documents Imaginations unbound
  • 34. Open Problem 3: Scalability of Appraisals • Scalability of appraisals with increasing heterogeneity of information, dimensionality of data and volume of electronic records • H How should appraisal process change h ld i l h as 3D data is added to file containers? • H How should appraisal process change h ld i l h as 3D+time, 2D+spectrum, 3D+time+spectrum, nD, 3D+time+spectrum nD … • How should appraisal operations be designed to accommodate growing volume of electronic records? Imaginations unbound
  • 35. Approaches to Computational Scalability of Document Appraisals • Options for parallel processing • message-passing interface (MPI) • MPI is d i i designed f the coordination of a program running as multiple d for h di i f i li l processes in a distributed memory environment by using passing control messages. • open multi-processing (OpenMP) multi processing • OpenMP is intended for shared memory machines. It uses a multithreading approach where the master threads forks any number of slave threads threads. • Map Reduce parallel programming paradigm for commodity clusters • It l t programmers write simple Map function and Reduce lets it i l M f ti dR d function, which are then automatically parallelized without requiring the programmers to code the details of parallel processes and communications • Specialized Hardware: FPGA, Cell processors, GPU Imaginations unbound
  • 36. Computational Requirements for Executing the Methodology Yellow indicates computations Relationship to Permanent Records Appraisal & Sampling
  • 37. Hardware & Software Dependencies with Hadoop • Test data: 15 PDF files from the Columbia investigation p g web site at http://caib.nasa.gov/. • Software configuration: Linux OS (Ubuntu flavor) and the Hadoop implementation of Map and Reduce functionalities f nctionalities • Hardware configuration: homogeneous & heterogeneous machines g Hadoop Average Speed 60 50 nds 40 secon 30 average speed 20 10 0 1 2 3 4 5 #machines Homogeneous Hardware Heterogeneous Hardware Imaginations unbound
  • 38. Open Problems Related to Archival Decisions •Simulate Preservation Costs as a Function of Information Granularity and Information Technology •Optimal Utilization of Computational and Human Resources Imaginations unbound
  • 39. Open Problem: Archival Decision Support • Decision support for forecasting preservation costs • How to predict computational and storage p p g requirements of preservation as a function of technology variables and information gy granularity? • How to optimize computational hardware, software, storage, and networking investments? Imaginations unbound
  • 40. Basic Questions About Information to be Preserved National Center for Supercomputing Applications
  • 41. Challenges in Forecasting • Volatility of software/hardware/storage media • Updates: Windows operating systems since 2000: Two major new releases, two minor service pack updates, around fifty security , p p , y y patches since SP2 • Upgrades: Microsoft Office Pro for Windows 95/98/ME/2000/XP/2003/2007 • Media life expectancy: Optical ~5 years Disk ~ 15 years Microfiche ~ 5 years, years, 100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life expectancy vs. information density – [P. Conway, 1996] ) • Cost of software/hardware/storage media • Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows 95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista = $399->$319 (2008) • 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200 250 due to $120 > > $200-250 Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96 (flash card) - www.pricewatch.com (1TB ~$109.95 as of 01/15/2009) • High performance computers: 2006: DARPA awards approximately $500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM National Center for Supercomputing Applications
  • 42. Archival Decision Support • Lack of forecasting models to predict preservation costs • Our work: Understand the tradeoffs between information value and computational/storage costs by providing simulation frameworks • Information granularity, organization, compression, encryption, document format, ... • Versus • Cost of CPU for gathering information, for processing and for input/output operations; cost of storage media, upgrades, storage p p p ; g , pg , g room, … • Prototype simulation framework: Image Provenance To Learn available for downloading from http://isda.ncsa.uiuc.edu
  • 43. Simulation Framework Information Information Gathering and Retrieval and Decision Maker Storage Process Learning Preservation Reconstruction Value Provenance Provenance Information Information Value linear Value observed Cost (memory, CPU) Cost / Information Granularity Analysis Image Viewer Process Reconstruction System Information Gathering System National Center for Supercomputing Applications
  • 44. Image Event Category Tracker Events Summary of Events Viewed Area Storage Time
  • 45. Information Granularity National Center for Supercomputing Applications
  • 46. Storage vs. Information Organization Tradeoffs: Test Case • Information granules include interpreted, raw and snapshots • Files were not compressed Event Name Saved Size Change Auto Zoom Change Gray Scale Change RGB Band Add Annotation Mouse Clicked Mouse Clicked -RDF= Resource Magnification Description Change Selection Window Hidden RDF Framework Change Gamma Key Pair Metadata Model Window Shown New Image Change Visible Region -Key pair = XML Change Zoom Factor Metadata Model Window Created 1 10 100 1000 10000 100000 1000000 10000000 Bytes (log scale) National Center for Supercomputing Applications
  • 47. Open Problems Related to Automating Archival Processing for Preservation 1. Discovery of Relationships Among Electronic Records 2. Information Preserving Conversions of Electronic Records 3. Sampling, Authenticity and Integrity Verification of a Collection of Temporally Changing Records Imaginations unbound
  • 48. Open Problem 1: Discovering Relationships Among Files • How should one establish relationships among electronic records coming from disparate sources or from the same source at multiple time instances? • How to extract metadata? • What ontology to use to represent the extracted metadata? • H How t automate metadata extraction from multiple data to t t t d t t ti f lti l d t types, e.g., 2D drawings and 3D CAD models? • How to discover relationships between electronic records corresponding to the same physical objects but different multidimensional observations? • Need to Understand the Complexity of the Problem Imaginations unbound
  • 49. Metadata Extraction: Complexity & Size the Crandon Mine Reports p from 1981 till 2003 http://digicoll.library.wisc.edu/cgi-bin/EcoNatRes/EcoNatRes- idx?type=browse&scope=ECONATRES.CRANDONMINE RDF t i l extracted using A t triples t t d i Aperture and visualized using RDF d i li d i RDF- Gravity (red – edges, green-literal values, violet – properties) Imaginations unbound
  • 50. Relationships Among Multiple Data Types • Example Data: Torpedo Weapon Retriever 841 • 784 existing 2D image drawings and N>22 3D CAD models • How to establish relationships among the 3D CAD models and 2D image drawings during a product lifecycle? Hypothetical Distribution of 3D CAD models for TWR 841 Imaginations unbound
  • 51. Understanding Challenges in Automation ry Relationship Discover D OCR Descriptors (metadata) Representation Imaginations unbound
  • 52. Open Problem 2: Conversions of Electronic Records • Conversions of electronic records are needed because • Visual exploration depends on various software packages • Many formats are retired (deprecated) over time • A subset of formats is selected for preservation purposes • How to measure the degree of information g preservation when files are converted from format A to format B? • During conversions, information could be lost added or modified conversions lost, • What is the importance of each byte, object, etc. ? • How to introduce a framework for measuring the quality of conversion and visualization software? Imaginations unbound
  • 53. Example: Conversion of X3D to STEP to X3D Software: X3dToVrml97 X3D Software: WRL A3D Reviewer Software: A3D Reviewer Software: Nothing! Vrml97ToX3d STEP WRL X3D
  • 54. Automation of 3D File Format Mapping & Conversion Imaginations unbound
  • 55. Open Problem 3: Sampling, Integrity and Authenticity g y y • Given finite resources and increasing amounts of electronic records, automation of sampling, integrity and authenticity verification is very much needed • What are the criteria for sampling a collection of temporally changing versions of ‘the same’ document? • Authenticity • Integrity • Information content • How to measure a degree of authenticity? • Computers might assign inaccurate time stamps to records • How to detect integrity failures? • A record containing a female patient with prostate cancer • How to incorporate constraints into sampling? • Storage space, compression computational cost, etc. Imaginations unbound
  • 56. Example:Temporal Ranking and Integrity Verification • Chronological ranking based on time stamps of files fil • Last modification (current implementation) • Ranking can be changed by a human • Content referring to dates can be used for integrity verification TIME Imaginations unbound
  • 57. Rules and Attributes for Integrity Verification • Document integrity attributes? • appearance or disappearance of document images • appearance and disappearance of dates embedded in documents • file size • count of image groups • number of sentences • average value of dates found in a document • Rules? Imaginations unbound
  • 58. Summary • Introduced a set of open problems related to •AAppraisal of electronic records i l f l t i d • Archival forecasting of preservation costs • Automation of processing for preservation • Examples used for illustrating the open problems from our research just scratch the surface of some of the open problems bl
  • 59. Observations • Many stakeholders are already aware of some of the open problems including government agencies and companies • As all government agencies have been computerized, the continuity and functioning of the agencies depend on preservation and reconstruction of electronic records • Right now, we are at the beginning of the exponential growth of electronic records (many more electronic records will be coming) • Some scientific fields are already facing real time decisions about preserving electronic records (e.g., astronomers) t )
  • 60. Future Vision • It is envisioned that the preservation and reconstruction of electronic records have to follow different paradigms that incorporate • Scalability (heterogeneity, dimensionality and volume) ) • Forecasting of preservation costs • New level of automation and quality control in processing for preservation purposes • The field of electronic record management and preservation needs forward looking solutions to stay abreast with the dynamics y y of digital information Imaginations unbound
  • 61. References to Presented Research • -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008). • -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography Bajcsy A Methodologies, Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: http://www3.interscience.wiley.com/cgi- bin/fulltext/121478978/PDFSTART • -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In: Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc., 2009; URL: https://www.novapublishers.com/catalog/product_info.php?products_id=8011 ; p p gp p p p • - McHenry K. and P. Bajcsy quot;An Overview of 3D Data Content, File Formats and Viewers.quot;, Technical Report NCSA- ISDA08-002, October 31, 2008 • -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12, 2008, Indianapolis, IN. , p , • -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San Diego, CA. • -Bajcsy P. and S-C Lee, quot;Computer Assisted Appraisal of Contemporary PDF Documentsquot; ARCHIVES 2008: Archival R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA. g g g , , , • -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007 International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007. • -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial Electronic Records, the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference, Records,” ( Federation ) poster, January 4-6, 2006 in Washington, DC. Imaginations unbound
  • 62. Questions • Project URL: j http://isda.ncsa.uiuc.edu/NARA/index.html and http://isda.ncsa.uiuc.edu/CompTradeoffs/ • Publications – see our URL at http://isda.ncsa.uiuc.edu/publications http://isda ncsa uiuc edu/publications • Peter Bajcsy; email: pbajcsy@ncsa uiuc edu pbajcsy@ncsa.uiuc.edu