This set of slides has been presented to the Illinois Program for Research in the Humanities at the University of Illinois at Urbana-Champaign on 02-27-2009
1. Emancipating Digital
Data: The Lincoln
Digitization Project
Di iti ti P j t
Peter Bajcsy, PhD
-RResearch S i ti t NCSA
h Scientist,
- Adjunct Assistant Professor ECE
& CS at UIUC
- Associate Director Center for
Humanities, Social Sciences and
Arts (CHASS), Illinois Informatics
Institute (I3), UIUC
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
2. Outline
• Introduction to Lincoln project
• Emancipating Digital Data: The Lincoln Digitization Project
• From Large Volumes Of Scanned Lincoln Papers To Virtual
Observatories
• Image Cropping
g pp g
• Georeferencing and Re-Projections of Historical Maps
• System Architecture for Web-based Delivery of
Information and Services
I f ti dS i
• Delivering Layered Information and Providing Services
• Summary
3. Acknowledgement
• Funding Agencies:
• NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial
Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC
Provost,
Provost NCSA International Partners Google Summer Code
Partners,
• Full Time Employees:
• Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry,
Jason Kastner and Luigi Marini
• Students:
• Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William
McFadden,
McFadden Chandra Ramachandran, Ben Raichel, Maryam
Ramachandran Raichel
Moslemi Naeini
• Collaborators on Lincoln Project:
• Daniel Stowell & Stacy McDermott Lincoln Library in Springfield
McDermott, Springfield,
IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin
Casares, and Jose Castro from Instituto Tecnológico de Costa
Rica (ITCR); Piotr Wendykier and James Nagy from Emory
( ); y gy y
University Atlanta
4. FROM LARGE VOLUMES OF SCANNED
LINCOLN PAPERS TO VIRTUAL
OBSERVATORIES
INTRODUCTION
Imaginations unbound
5. Background
• The Papers of Abraham Lincoln is a research
initiative with the ultimate goal of making all
writings by America's 16th president available
on-line.
• Complex workflow process from paper
documents to on-line virtual observatories
• Multiple end users
• R
Researchers
h
• General public
7. Output: Multi-dimensional Views
The Lincoln Log, A Chronology compiled by
g, gy p y
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/
DIMENSIONS
Hyperlink to the Lincoln Log (temporal
representation)
Hyperlink to the Markers (spatial
representation)
Hyperlinks to the Image scans
(document content).
8. Output: Hyperlinked Multi-media Views
• Audio (e.g., music of Lincoln’s time)
• Images and maps
I d
• Video
• 3D objects (e g musical instruments)
(e.g.,
IMAGES
SONGS
Imaginations unbound
9. Output: Services to Search, Display and
Transcribe Digital Data
Google service
The Lincoln Log, A Chronology compiled by
g, gy p y Transcription service
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/
Search service
10. Output: On-line Virtual Observatory
• Digital Information Organization
• Multi-dimensional views in time, space and document
dimensions
• Hyperlinked multi-media views including all existing n-
dimensional data
• Computational Services to Operate on Digital Data
• Search
• Layered display with third party data
• Transcription of documents
• Educational Services to Enable Learning
• Simple demonstrations
• Homework exercises
• Support of forensic studies
Imaginations unbound
11. From Input to Output: A Few Key Components
1. Cropping of scanned documents (algorithm, accuracy &
robustness, scalability, computational resources).
2.
2 Cleaning and parsing of metadata obtained from The
Lincoln Log and The Papers of Abraham Lincoln in
Springfield (Lat, Lng, places, ASCII characters, populating
MySQL Database etc.)
3. Designing an underlying architecture of information
storage and retrieval
g
4. Geo-referencing and re-projection of historical maps.
5. Building web-based interfaces and providing services
(Programming against Google Maps API Database
API,
Ajax/Javascript requests using PHP and mySQL).
http://isda.ncsa.uiuc.edu/lpapers/index.html
12. FROM LARGE VOLUMES OF SCANNED
LINCOLN PAPERS TO VIRTUAL
OBSERVATORIES
IMAGE CROPPING
Imaginations unbound
13. Image Cropping: Understanding Variability
of Document Scans
• Background paper color and
intensity
• Ink color and intensity
• Density of writing
• Color scale bar position
• Task: Automatically
classify images for pre-
p
processing and
g
remove the Kodak
color scale bar if
needed.
needed
15. Humanities & High Performance Computing
• Assuming that the world is perfect ….
• Image cropping 300 000 files times 60 seconds per file = 5 000
cropping: 300,000 5,000
hours = 208.3 days
• Other operations such as file format conversions (TIFF->PDF),
pyramid construction for web deployment
• Storage requirements for original (100K-300K images ~ 45
Terabytes), cropped (?) and pyramid representation for fast
retrieval over the Internet (?)
• Need to joint forces and form interdisciplinary teams
• The storage requirements and p
g q preservation – NCSA mass
storage
• The CPU requirements – parallel codes to utilize HPC
Imaginations unbound
16. FROM LARGE VOLUMES OF SCANNED
LINCOLN PAPERS TO VIRTUAL
OBSERVATORIES
GEO-REFERENCING AND RE-PROJECTION
OF HISTORICAL MAPS
Imaginations unbound
17. Georeferencing Historical Maps
• Goal: to overlay historical maps on top of Google
Maps
• Challenges: Geodetic information is not always
available.
available
• The geodetic coordinate system consists of a datum, a
projection, an origin, a unit system and two axis.
• T
Target Projection: G
t P j ti Google M
l Maps uses WGS84
WGS84,
Mercator projection and a pixel unit system.
• Most of the maps of the United States are in conical projection
projection,
Lambert Conformal Conic and Albers Equal Area or in Molweide
Pseudocylindrical Projection.
Imaginations unbound
20. Example of Map Georeferencing
• Software: Used Global Mapper
Albers equal-area conic
Lambert's conformal conic Mercator cylindrical
Mollweide pseudocylindrical
In our case the projection does not have to be exact. For
small areas in Molweide projection, for example a simple
perspective correction can be sufficient for the map of the
US 1861-1865
Imaginations unbound
21. FROM LARGE VOLUMES OF SCANNED
LINCOLN PAPERS TO VIRTUAL
OBSERVATORIES
DESIGNING AN UNDERLYING
ARCHITECTURE OF VIRTUAL
OBSERVATORIES
Imaginations unbound
22. Software Architecture Design
The front-end consists of a HTML file with Google Map loaded, a JavaScript
script, and a search form with pre-defined data sets. The client-side HTML
and JavaScript files make requests to the server. The server-side consists
of a PHP file which bridges the gap between Ajax and connects to MySQL
database. The result is returned as an XML response to the Ajax engine.
Imaginations unbound
24. FROM LARGE VOLUMES OF SCANNED
LINCOLN PAPERS TO VIRTUAL
OBSERVATORIES
DELIVERING LAYERED INFORMATION
AND PROVIDING SERVICES
Imaginations unbound
25. Multi Dimensional View of Lincoln Papers
• Delivering Layers of Information (geospatial – historical
maps and current maps, temporal – Lincoln log, relational –
p p , p g,
source & destination links, content – document scans
26. User Interface
Information in
time, space and
document
dimensions.
Time
Ti
Space
Search
29. Safety Guards for Transcription Services
RFC Valid e-mail addresses
abc@example.com
Abc@example.com
aBC@example.com
abc.123@example.com
abc 123@example com
"abc@def"@example.com
"Abc@def"@example.com
1234567890@example.com Standards for email addresses: RFC822
_______@example.com
abc+mailbox/department=shipping@example.com
abc mailbox/department shipping@example.com
(published in 1982) defines, amongst other
!#$%&'*+-/=?^_`.{|}~@example.com things, the f format for internet text message
f
"Fred "quota" Bloggs"@example.com (email) addresses.
"Abc@def"@example.com
"Fred Bloggs"@example.com
"JoeBlow"@example.com
customer/department=shipping@example.com
$A12345@example.com
RFC Invalid e-mail addresses
!def!xyz%abc@example.com
Abc.example.com (character @ is missing)
_somename@example.com
Abc.@example.com (character dot(.) is last in local part)
Abc..123@example.com (
@ p (character dot(.) is double)
() )
A@b@c@example.com (only one @ is allowed outside quotations marks)
()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
30. What Would You Learn ?
In this example a letter was sent from Fort Randall to President Abraham
Lincoln on October 26, 1862. The bits of information about the document
(metadata) namely the time, the location of a sender and the location of
President Lincoln are known. The letter path is visualized in Google
Maps, the document can be retrieved from the database and edited.
Additionally, user can overlay one of the historical maps. The markers are
positioned with hi h accuracy b
iti d ith high based on th l tit d and l
d the latitude d longitude of
it d f
historical sites.
31. Summary
• Design and implementation of automated document
cropping.
• Integration of spatial, temporal and document
information.
• Design and prototype a web-based user interface to
heterogeneous data.
• ---------------------------------------------------------------------
• The system is available at
http://isda.ncsa.uiuc.edu/lpapers/search.html
• W would b excited if you would fi d th system useful
We ld be it d ld find the t f l
in your research or education!
Imaginations unbound