SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Emancipating Digital
 Data: The Lincoln
 Digitization Project
 Di iti ti    P j t
  Peter Bajcsy, PhD
  -RResearch S i ti t NCSA
             h Scientist,
  - Adjunct Assistant Professor ECE
  & CS at UIUC
  - Associate Director Center for
  Humanities, Social Sciences and
  Arts (CHASS), Illinois Informatics
  Institute (I3), UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Outline
• Introduction to Lincoln project
   • Emancipating Digital Data: The Lincoln Digitization Project
   • From Large Volumes Of Scanned Lincoln Papers To Virtual
     Observatories
• Image Cropping
      g      pp g
• Georeferencing and Re-Projections of Historical Maps
• System Architecture for Web-based Delivery of
  Information and Services
  I f     ti    dS     i
• Delivering Layered Information and Providing Services
• Summary
Acknowledgement
•   Funding Agencies:
     • NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial
       Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC
       Provost,
       Provost NCSA International Partners Google Summer Code
                                   Partners,
•   Full Time Employees:
     • Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry,
       Jason Kastner and Luigi Marini
•   Students:
     • Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William
       McFadden,
       McFadden Chandra Ramachandran, Ben Raichel, Maryam
                            Ramachandran         Raichel
       Moslemi Naeini
•   Collaborators on Lincoln Project:
     • Daniel Stowell & Stacy McDermott Lincoln Library in Springfield
                              McDermott,                    Springfield,
       IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin
       Casares, and Jose Castro from Instituto Tecnológico de Costa
       Rica (ITCR); Piotr Wendykier and James Nagy from Emory
            (     );           y                  gy            y
       University Atlanta
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  INTRODUCTION




Imaginations unbound
Background

• The Papers of Abraham Lincoln is a research
  initiative with the ultimate goal of making all
  writings by America's 16th president available
  on-line.

• Complex workflow process from paper
  documents to on-line virtual observatories
• Multiple end users
   • R
     Researchers
             h
   • General public
Input: Paper Copies of Docs & Metadata
Output: Multi-dimensional Views




The Lincoln Log, A Chronology compiled by
               g,             gy p      y
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/


DIMENSIONS
Hyperlink to the Lincoln Log (temporal
representation)
Hyperlink to the Markers (spatial
representation)
Hyperlinks to the Image scans
(document content).
Output: Hyperlinked Multi-media Views

   •   Audio (e.g., music of Lincoln’s time)
   •   Images and maps
       I         d
   •   Video
   •   3D objects (e g musical instruments)
                   (e.g.,

                                               IMAGES




                       SONGS

Imaginations unbound
Output: Services to Search, Display and
  Transcribe Digital Data
                                            Google service




The Lincoln Log, A Chronology compiled by
               g,             gy p      y                        Transcription service
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/




                                                Search service
Output: On-line Virtual Observatory
  • Digital Information Organization
       • Multi-dimensional views in time, space and document
         dimensions
       • Hyperlinked multi-media views including all existing n-
         dimensional data
  • Computational Services to Operate on Digital Data
       • Search
       • Layered display with third party data
       • Transcription of documents
  • Educational Services to Enable Learning
       • Simple demonstrations
       • Homework exercises
       • Support of forensic studies


Imaginations unbound
From Input to Output: A Few Key Components
1. Cropping of scanned documents (algorithm, accuracy &
   robustness, scalability, computational resources).
2.
2 Cleaning and parsing of metadata obtained from The
   Lincoln Log and The Papers of Abraham Lincoln in
   Springfield (Lat, Lng, places, ASCII characters, populating
   MySQL Database etc.)
3. Designing an underlying architecture of information
   storage and retrieval
         g
4. Geo-referencing and re-projection of historical maps.
5. Building web-based interfaces and providing services
   (Programming against Google Maps API Database
                                         API,
   Ajax/Javascript requests using PHP and mySQL).
   http://isda.ncsa.uiuc.edu/lpapers/index.html
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  IMAGE CROPPING




Imaginations unbound
Image Cropping: Understanding Variability
    of Document Scans

•   Background paper color and
    intensity
•   Ink color and intensity
•   Density of writing
•   Color scale bar position


• Task: Automatically
  classify images for pre-
  p
  processing and
             g
  remove the Kodak
  color scale bar if
  needed.
  needed
Image Cropping Approach


         Training




         Classify            Crop




                    Output
Humanities & High Performance Computing

   • Assuming that the world is perfect ….
        • Image cropping 300 000 files times 60 seconds per file = 5 000
                 cropping: 300,000                                 5,000
          hours = 208.3 days
        • Other operations such as file format conversions (TIFF->PDF),
          pyramid construction for web deployment
        • Storage requirements for original (100K-300K images ~ 45
          Terabytes), cropped (?) and pyramid representation for fast
          retrieval over the Internet (?)
   • Need to joint forces and form interdisciplinary teams
        • The storage requirements and p
                   g    q              preservation – NCSA mass
          storage
        • The CPU requirements – parallel codes to utilize HPC



Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  GEO-REFERENCING AND RE-PROJECTION
  OF HISTORICAL MAPS




Imaginations unbound
Georeferencing Historical Maps

 • Goal: to overlay historical maps on top of Google
   Maps
 • Challenges: Geodetic information is not always
   available.
   available
      • The geodetic coordinate system consists of a datum, a
        projection, an origin, a unit system and two axis.
 • T
   Target Projection: G
        t P j ti      Google M
                          l Maps uses WGS84
                                          WGS84,
   Mercator projection and a pixel unit system.
      • Most of the maps of the United States are in conical projection
                                                             projection,
        Lambert Conformal Conic and Albers Equal Area or in Molweide
        Pseudocylindrical Projection.



Imaginations unbound
Layered Geospatial Information: Google Map
Example
Geospatial Characteristics: Neighborhoods
Example of Map Georeferencing
   • Software: Used Global Mapper




        Albers equal-area conic
        Lambert's conformal conic                Mercator cylindrical
        Mollweide pseudocylindrical


In our case the projection does not have to be exact. For
small areas in Molweide projection, for example a simple
perspective correction can be sufficient for the map of the
US 1861-1865
Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  DESIGNING AN UNDERLYING
  ARCHITECTURE OF VIRTUAL
  OBSERVATORIES




Imaginations unbound
Software Architecture Design




The front-end consists of a HTML file with Google Map loaded, a JavaScript
  script, and a search form with pre-defined data sets. The client-side HTML
  and JavaScript files make requests to the server. The server-side consists
  of a PHP file which bridges the gap between Ajax and connects to MySQL
  database. The result is returned as an XML response to the Ajax engine.
 Imaginations unbound
Data Storage and Organization




Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  DELIVERING LAYERED INFORMATION
  AND PROVIDING SERVICES




Imaginations unbound
Multi Dimensional View of Lincoln Papers

• Delivering Layers of Information (geospatial – historical
  maps and current maps, temporal – Lincoln log, relational –
     p                  p ,     p                g,
  source & destination links, content – document scans
User Interface
                 Information in
                 time, space and
                 document
                 dimensions.

                      Time
                      Ti

                      Space




                      Search
Providing Search Services
Providing Transcription Services
Safety Guards for Transcription Services


          RFC Valid e-mail addresses
               abc@example.com
               Abc@example.com
               aBC@example.com
             abc.123@example.com
             abc 123@example com
            "abc@def"@example.com
            "Abc@def"@example.com
           1234567890@example.com                               Standards for email addresses: RFC822
             _______@example.com
abc+mailbox/department=shipping@example.com
abc mailbox/department shipping@example.com
                                                                (published in 1982) defines, amongst other
       !#$%&'*+-/=?^_`.{|}~@example.com                         things, the f format for internet text message
                                                                                       f
     "Fred "quota" Bloggs"@example.com                        (email) addresses.
           "Abc@def"@example.com
          "Fred Bloggs"@example.com
           "JoeBlow"@example.com
 customer/department=shipping@example.com
             $A12345@example.com
                                                       RFC Invalid e-mail addresses
          !def!xyz%abc@example.com
                                                 Abc.example.com (character @ is missing)
           _somename@example.com
                                           Abc.@example.com (character dot(.) is last in local part)
                                             Abc..123@example.com (
                                                     @       p        (character dot(.) is double)
                                                                                    ()           )
                                   A@b@c@example.com (only one @ is allowed outside quotations marks)
                       ()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
What Would You Learn ?




In this example a letter was sent from Fort Randall to President Abraham
Lincoln on October 26, 1862. The bits of information about the document
(metadata) namely the time, the location of a sender and the location of
President Lincoln are known. The letter path is visualized in Google
Maps, the document can be retrieved from the database and edited.
Additionally, user can overlay one of the historical maps. The markers are
positioned with hi h accuracy b
    iti   d ith high             based on th l tit d and l
                                       d     the latitude   d longitude of
                                                                  it d    f
historical sites.
Summary
  • Design and implementation of automated document
    cropping.
  • Integration of spatial, temporal and document
    information.
  • Design and prototype a web-based user interface to
    heterogeneous data.
  • ---------------------------------------------------------------------
  • The system is available at
    http://isda.ncsa.uiuc.edu/lpapers/search.html
  • W would b excited if you would fi d th system useful
    We         ld be       it d               ld find the       t         f l
    in your research or education!



Imaginations unbound

Más contenido relacionado

Similar a Overview of Lincoln Paper Design

g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
Graeme earl introduction
Graeme earl introductionGraeme earl introduction
Graeme earl introductionGraeme Earl
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be WonJeni Tennison
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUInfinIT - Innovationsnetværket for it
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curationblalbritton
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramRobert Frech
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web AppsGIS in the Rockies
 
Core Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureCore Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureWiLS
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingTom-Cramer
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Esri UK
 

Similar a Overview of Lincoln Paper Design (20)

g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
Graeme earl introduction
Graeme earl introductionGraeme earl introduction
Graeme earl introduction
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be Won
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
 
IMIA Chiang Spatial Computing - 2016
IMIA Chiang Spatial Computing - 2016IMIA Chiang Spatial Computing - 2016
IMIA Chiang Spatial Computing - 2016
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curation
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital Program
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
 
Core Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureCore Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the Future
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership Meeting
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Overview of Lincoln Paper Design

  • 1. Emancipating Digital Data: The Lincoln Digitization Project Di iti ti P j t Peter Bajcsy, PhD -RResearch S i ti t NCSA h Scientist, - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  • 2. Outline • Introduction to Lincoln project • Emancipating Digital Data: The Lincoln Digitization Project • From Large Volumes Of Scanned Lincoln Papers To Virtual Observatories • Image Cropping g pp g • Georeferencing and Re-Projections of Historical Maps • System Architecture for Web-based Delivery of Information and Services I f ti dS i • Delivering Layered Information and Providing Services • Summary
  • 3. Acknowledgement • Funding Agencies: • NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC Provost, Provost NCSA International Partners Google Summer Code Partners, • Full Time Employees: • Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry, Jason Kastner and Luigi Marini • Students: • Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William McFadden, McFadden Chandra Ramachandran, Ben Raichel, Maryam Ramachandran Raichel Moslemi Naeini • Collaborators on Lincoln Project: • Daniel Stowell & Stacy McDermott Lincoln Library in Springfield McDermott, Springfield, IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin Casares, and Jose Castro from Instituto Tecnológico de Costa Rica (ITCR); Piotr Wendykier and James Nagy from Emory ( ); y gy y University Atlanta
  • 4. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES INTRODUCTION Imaginations unbound
  • 5. Background • The Papers of Abraham Lincoln is a research initiative with the ultimate goal of making all writings by America's 16th president available on-line. • Complex workflow process from paper documents to on-line virtual observatories • Multiple end users • R Researchers h • General public
  • 6. Input: Paper Copies of Docs & Metadata
  • 7. Output: Multi-dimensional Views The Lincoln Log, A Chronology compiled by g, gy p y the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ DIMENSIONS Hyperlink to the Lincoln Log (temporal representation) Hyperlink to the Markers (spatial representation) Hyperlinks to the Image scans (document content).
  • 8. Output: Hyperlinked Multi-media Views • Audio (e.g., music of Lincoln’s time) • Images and maps I d • Video • 3D objects (e g musical instruments) (e.g., IMAGES SONGS Imaginations unbound
  • 9. Output: Services to Search, Display and Transcribe Digital Data Google service The Lincoln Log, A Chronology compiled by g, gy p y Transcription service the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ Search service
  • 10. Output: On-line Virtual Observatory • Digital Information Organization • Multi-dimensional views in time, space and document dimensions • Hyperlinked multi-media views including all existing n- dimensional data • Computational Services to Operate on Digital Data • Search • Layered display with third party data • Transcription of documents • Educational Services to Enable Learning • Simple demonstrations • Homework exercises • Support of forensic studies Imaginations unbound
  • 11. From Input to Output: A Few Key Components 1. Cropping of scanned documents (algorithm, accuracy & robustness, scalability, computational resources). 2. 2 Cleaning and parsing of metadata obtained from The Lincoln Log and The Papers of Abraham Lincoln in Springfield (Lat, Lng, places, ASCII characters, populating MySQL Database etc.) 3. Designing an underlying architecture of information storage and retrieval g 4. Geo-referencing and re-projection of historical maps. 5. Building web-based interfaces and providing services (Programming against Google Maps API Database API, Ajax/Javascript requests using PHP and mySQL). http://isda.ncsa.uiuc.edu/lpapers/index.html
  • 12. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES IMAGE CROPPING Imaginations unbound
  • 13. Image Cropping: Understanding Variability of Document Scans • Background paper color and intensity • Ink color and intensity • Density of writing • Color scale bar position • Task: Automatically classify images for pre- p processing and g remove the Kodak color scale bar if needed. needed
  • 14. Image Cropping Approach Training Classify Crop Output
  • 15. Humanities & High Performance Computing • Assuming that the world is perfect …. • Image cropping 300 000 files times 60 seconds per file = 5 000 cropping: 300,000 5,000 hours = 208.3 days • Other operations such as file format conversions (TIFF->PDF), pyramid construction for web deployment • Storage requirements for original (100K-300K images ~ 45 Terabytes), cropped (?) and pyramid representation for fast retrieval over the Internet (?) • Need to joint forces and form interdisciplinary teams • The storage requirements and p g q preservation – NCSA mass storage • The CPU requirements – parallel codes to utilize HPC Imaginations unbound
  • 16. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES GEO-REFERENCING AND RE-PROJECTION OF HISTORICAL MAPS Imaginations unbound
  • 17. Georeferencing Historical Maps • Goal: to overlay historical maps on top of Google Maps • Challenges: Geodetic information is not always available. available • The geodetic coordinate system consists of a datum, a projection, an origin, a unit system and two axis. • T Target Projection: G t P j ti Google M l Maps uses WGS84 WGS84, Mercator projection and a pixel unit system. • Most of the maps of the United States are in conical projection projection, Lambert Conformal Conic and Albers Equal Area or in Molweide Pseudocylindrical Projection. Imaginations unbound
  • 18. Layered Geospatial Information: Google Map Example
  • 20. Example of Map Georeferencing • Software: Used Global Mapper Albers equal-area conic Lambert's conformal conic Mercator cylindrical Mollweide pseudocylindrical In our case the projection does not have to be exact. For small areas in Molweide projection, for example a simple perspective correction can be sufficient for the map of the US 1861-1865 Imaginations unbound
  • 21. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DESIGNING AN UNDERLYING ARCHITECTURE OF VIRTUAL OBSERVATORIES Imaginations unbound
  • 22. Software Architecture Design The front-end consists of a HTML file with Google Map loaded, a JavaScript script, and a search form with pre-defined data sets. The client-side HTML and JavaScript files make requests to the server. The server-side consists of a PHP file which bridges the gap between Ajax and connects to MySQL database. The result is returned as an XML response to the Ajax engine. Imaginations unbound
  • 23. Data Storage and Organization Imaginations unbound
  • 24. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DELIVERING LAYERED INFORMATION AND PROVIDING SERVICES Imaginations unbound
  • 25. Multi Dimensional View of Lincoln Papers • Delivering Layers of Information (geospatial – historical maps and current maps, temporal – Lincoln log, relational – p p , p g, source & destination links, content – document scans
  • 26. User Interface Information in time, space and document dimensions. Time Ti Space Search
  • 29. Safety Guards for Transcription Services RFC Valid e-mail addresses abc@example.com Abc@example.com aBC@example.com abc.123@example.com abc 123@example com "abc@def"@example.com "Abc@def"@example.com 1234567890@example.com Standards for email addresses: RFC822 _______@example.com abc+mailbox/department=shipping@example.com abc mailbox/department shipping@example.com (published in 1982) defines, amongst other !#$%&'*+-/=?^_`.{|}~@example.com things, the f format for internet text message f "Fred "quota" Bloggs"@example.com (email) addresses. "Abc@def"@example.com "Fred Bloggs"@example.com "JoeBlow"@example.com customer/department=shipping@example.com $A12345@example.com RFC Invalid e-mail addresses !def!xyz%abc@example.com Abc.example.com (character @ is missing) _somename@example.com Abc.@example.com (character dot(.) is last in local part) Abc..123@example.com ( @ p (character dot(.) is double) () ) A@b@c@example.com (only one @ is allowed outside quotations marks) ()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
  • 30. What Would You Learn ? In this example a letter was sent from Fort Randall to President Abraham Lincoln on October 26, 1862. The bits of information about the document (metadata) namely the time, the location of a sender and the location of President Lincoln are known. The letter path is visualized in Google Maps, the document can be retrieved from the database and edited. Additionally, user can overlay one of the historical maps. The markers are positioned with hi h accuracy b iti d ith high based on th l tit d and l d the latitude d longitude of it d f historical sites.
  • 31. Summary • Design and implementation of automated document cropping. • Integration of spatial, temporal and document information. • Design and prototype a web-based user interface to heterogeneous data. • --------------------------------------------------------------------- • The system is available at http://isda.ncsa.uiuc.edu/lpapers/search.html • W would b excited if you would fi d th system useful We ld be it d ld find the t f l in your research or education! Imaginations unbound