SlideShare una empresa de Scribd logo
1 de 46
Digitising Hansard

            Edward Wood
Director of Information Management
        House of Commons
                16.6.08
digitising Hansard

• digitising Hansard: scanning and OCR
• the policy context
• database and front end
Hansard
• the official report of debates in Parliament
• actually an unofficial private enterprise at first
• “nationalised” in 1909
• early reports written in the third person
• eventually developed into a (nearly) verbatim
  account
• volumes from 1803 – 2005 were digitised
• nearly 3 million pages
“though not strictly verbatim, [it] is substantially
the verbatim report, with repetitions and
redundancies omitted and with obvious mistakes
corrected, but [...] on the other hand leaves out
nothing that adds to the meaning of the speech
or illustrates the argument.”
why digitise?
•   enable preservation
•   conservation is expensive
•   increase access
•   increase usability
•   improve business processes
•   re-use physical storage space
•   costs have fallen significantly
•   quality improving steadily
preservation vs. conservation

conservation
  direct intervention to prevent/make good damage to
  materials

preservation
  a broader term than conservation. It includes all
  managerial and financial considerations including
  storage and accommodation provision, staffing levels,
  policies, techniques, and methods involved in preserving
  library and archive materials and the information
  contained therein
preservation
•   originals printed on poor quality paper
•   starting to deteriorate
•   reduce wear and tear from daily use
•   keep in a controlled environment
•   conservation is expensive
improve access
• internal
   – extensive day to day business use across a very large
     site
• public
   – national heritage and birthright
   – disposal by libraries
   – international interest
increase usability
•   search
•   print
•   share
•   novel uses/mash-ups
quality of digitisation techniques
improving steadily
costs
• costs have fallen significantly
• alternative funding models
• reduce physical storage needs
  – dispose of surplus copies
  – locate in less valuable space
• but beware the hidden costs…
ongoing costs

•   developing a front-end and database
•   hosting
•   storing images
•   digital preservation
•   format migration
alternatives

• microfilm
• conservation
• facsimile
why not leave it to the big boys?

in a word, control
•   subject matter
•   quality
•   value added
•   use
funding models

•   self-funding
•   commercial funding
•   joint funding
•   grants
doing the work
• In house or contractor?
• method
  –   image only
  –   re-keying (single, double, triple...)
  –   OCR (optical character recognition)
  –   image plus text
  –   metadata capture
  –   manual intervention increases quality and costs!
scanning from...

•   microfilm
•   loose originals
•   bound originals
•   dis-bound originals
OCR

•   how accurate does it need to be?
•   mass vs batch capture
•   double or triple compare
•   diminishing returns
QA (quality assurance)
• automate where possible
• contractor
   – 100% proof reading
• client
   – heavy sampling of images
   – 1% sampling of text
• third party?
the need for a policy framework
• Hansard was the first major digitisation project in
  the UK parliament
• an earlier project to digitise Local and Private
  Acts captured images only
• we needed a digitisation policy for parliament to
  ensure consistency and learning from
  experience
policy aims

• ensure that individual projects:
   – take into account the wider information context both
     inside and outside Parliament
   – deliver their target benefits
   – offer value for money
• ensure the resources created can be:
   – exploited fully
   – used for as long as is required
policy scope

•   publications
•   photographs
•   archival documents
•   business records
policy principles

• digitisation needs to be seen as an integral part
  of the information work carried out by parliament
• use of appropriate technical standards
• scan once for many purposes
• business cases should take account of all
  relevant costs
selection criteria

• measurable user demand (for public use)
• business need (for internal use)
• the potential for learning and educational use
• cost and the availability of other resources
• technical considerations
• the uniqueness of the items
• conservation requirements
• intellectual property rights and copyright issues
• the availability of digitised versions of the same material
  elsewhere
• the potential for revenue raising
• the feasibility of long-term preservation, where required
other aspects of the policy
• the delivery method will be planned at the outset
• the preservation master will be an
  uncompressed TIFF file
• metadata will be created, to support resource
  discovery, use, storage and digital preservation
• we will adopt international standards where
  possible
• we will work with partners where possible
developing a digitisation strategy


• a project board has been created
• an integral part of an online parliamentary
  history programme for parliament
• will use the criteria set out in the digitisation
  policy to prioritise future digitisation work
practical guidelines


• guidelines have been developed for all parts of
  parliament which need to create digitised assets:
  –   a checklist for doing the work
  –   glossary
  –   details of file formats, OCR options
  –   describes popular myths on costs
hosting
•   text and images
•   text only
•   navigation
•   search
•   web 2.0
•   funding models
•   give it away!?
    http://www.parliament.uk/publications/archives.cfm
developing a web interface
drivers
• keep costs down
• work closely with users
• meaningful search across a large amount of data
solution
• experimental approach
• open source
methodology and progress
• small team of developers from Parliamentary
  ICT working closely with users (inside and
  outside Parliament)
• uses “micro formats” approach
• XML is parsed into HTML before loading into the
  database
• JPEGs not currently being used
• half of the data has been loaded (mainly 20th
  century)
• public discussion group and issues log
http://hansard.millbanksystems.com
faceted classification
•   faceted approach to browsing and searching
•   assignment of multiple classifications to an object
•   classifications can be to be ordered in a variety of ways
•   facets include
    –   date
    –   volume number
    –   monarch
    –   chamber
    –   content type (debates or questions)
    –   constituencies
    –   Members of Parliament
    –   offices held.
other features
• references using the standard format can be located
  using the search box
  HC Deb Vol 385 13 May 2002 c498
• predictable URLs

  http://hansard.millbanksystems.com/commons/1941/may/07/w
• pages created for:
   –   individual Members of Parliament
   –   constituencies
   –   acts
   –   bills
   –   divisions
http://hansard.millbanksystems.com

Más contenido relacionado

La actualidad más candente

Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
lljohnston
 

La actualidad más candente (20)

Investigating the PROMISE of a Belgian web archive
Investigating the PROMISE of a Belgian web archive Investigating the PROMISE of a Belgian web archive
Investigating the PROMISE of a Belgian web archive
 
A new approach to aggregation
A new approach to aggregation A new approach to aggregation
A new approach to aggregation
 
Clare Lanigan - Presentation to IES Students
Clare Lanigan - Presentation to IES StudentsClare Lanigan - Presentation to IES Students
Clare Lanigan - Presentation to IES Students
 
Naple presentation danish digital library
Naple presentation danish digital libraryNaple presentation danish digital library
Naple presentation danish digital library
 
Pitts Library Digitization Initiatives
Pitts Library Digitization InitiativesPitts Library Digitization Initiatives
Pitts Library Digitization Initiatives
 
Europeana Research Panel DH Benelux 2017
Europeana Research Panel DH Benelux 2017Europeana Research Panel DH Benelux 2017
Europeana Research Panel DH Benelux 2017
 
Sensitive Data Workshop
Sensitive Data WorkshopSensitive Data Workshop
Sensitive Data Workshop
 
Green gupta 20 years of mmc
Green gupta 20 years of mmcGreen gupta 20 years of mmc
Green gupta 20 years of mmc
 
Integrating collections data to build sustainable online resources
Integrating collections data to build sustainable online resourcesIntegrating collections data to build sustainable online resources
Integrating collections data to build sustainable online resources
 
Presentation susan isiko_strba_copyright and related rights_ip capital for ec...
Presentation susan isiko_strba_copyright and related rights_ip capital for ec...Presentation susan isiko_strba_copyright and related rights_ip capital for ec...
Presentation susan isiko_strba_copyright and related rights_ip capital for ec...
 
Overview of the EOSC¶
Overview of the EOSC¶Overview of the EOSC¶
Overview of the EOSC¶
 
20180705 challanges for researchers in digital humanities liber 2018 lille(rw)
20180705 challanges for researchers in digital humanities liber 2018 lille(rw)20180705 challanges for researchers in digital humanities liber 2018 lille(rw)
20180705 challanges for researchers in digital humanities liber 2018 lille(rw)
 
Digital Preservation (UWE)
Digital Preservation (UWE)Digital Preservation (UWE)
Digital Preservation (UWE)
 
The Mint Mapping tool
The Mint Mapping toolThe Mint Mapping tool
The Mint Mapping tool
 
Ariadne Services
Ariadne ServicesAriadne Services
Ariadne Services
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
 
OpenAIRE@FIRM19 march_2019
OpenAIRE@FIRM19 march_2019OpenAIRE@FIRM19 march_2019
OpenAIRE@FIRM19 march_2019
 
DLCS
DLCSDLCS
DLCS
 
EDF2014: Franck Cotton & Kamel Gadouche, France: TeraLab - A Secure Big Data...
EDF2014: Franck Cotton  & Kamel Gadouche, France: TeraLab - A Secure Big Data...EDF2014: Franck Cotton  & Kamel Gadouche, France: TeraLab - A Secure Big Data...
EDF2014: Franck Cotton & Kamel Gadouche, France: TeraLab - A Secure Big Data...
 
Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914
 

Similar a Digitising Hansard

CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
Buttes
 

Similar a Digitising Hansard (20)

Hamooya
HamooyaHamooya
Hamooya
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Jisc Shared Service requirements presentation - 18th November 2015
Jisc Shared Service requirements presentation - 18th November 2015Jisc Shared Service requirements presentation - 18th November 2015
Jisc Shared Service requirements presentation - 18th November 2015
 
Corrado -- Establishing the Landscape
Corrado -- Establishing the LandscapeCorrado -- Establishing the Landscape
Corrado -- Establishing the Landscape
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Special material-powerpoint
Special material-powerpointSpecial material-powerpoint
Special material-powerpoint
 
Special material-powerpoint
Special material-powerpointSpecial material-powerpoint
Special material-powerpoint
 
Digital Asset Management and Archival Preservation
Digital Asset Management and Archival PreservationDigital Asset Management and Archival Preservation
Digital Asset Management and Archival Preservation
 
Digital practice guidelines : the new generation presented by Scott Wajon
Digital practice guidelines : the new generation presented by Scott WajonDigital practice guidelines : the new generation presented by Scott Wajon
Digital practice guidelines : the new generation presented by Scott Wajon
 
Metadata
MetadataMetadata
Metadata
 
Newman, DAM + Image Intellectual Property Management
Newman, DAM + Image Intellectual Property ManagementNewman, DAM + Image Intellectual Property Management
Newman, DAM + Image Intellectual Property Management
 
2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and Humanities2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and Humanities
 
Of Communities and Practices: Digital Preservation Innovation & Research
Of Communities  and Practices: Digital Preservation Innovation & ResearchOf Communities  and Practices: Digital Preservation Innovation & Research
Of Communities and Practices: Digital Preservation Innovation & Research
 
"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 

Más de ALISS

July2015cooke.
July2015cooke.July2015cooke.
July2015cooke.
ALISS
 
ALISS AGM Minutes 2015
ALISS AGM Minutes 2015ALISS AGM Minutes 2015
ALISS AGM Minutes 2015
ALISS
 
Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project - Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project -
ALISS
 
Useful resources for student training and orientation
Useful resources for student training and orientationUseful resources for student training and orientation
Useful resources for student training and orientation
ALISS
 

Más de ALISS (20)

Library champions for disability Meeting Notes January 22nd 2021
Library champions for disability Meeting Notes January 22nd 2021Library champions for disability Meeting Notes January 22nd 2021
Library champions for disability Meeting Notes January 22nd 2021
 
Disability- higher education, libraries, teaching and learning bibliography m...
Disability- higher education, libraries, teaching and learning bibliography m...Disability- higher education, libraries, teaching and learning bibliography m...
Disability- higher education, libraries, teaching and learning bibliography m...
 
What is crowdsourcing?
What is crowdsourcing?What is crowdsourcing?
What is crowdsourcing?
 
Creating Digital Collections Through Crowdsourcing
Creating Digital Collections Through CrowdsourcingCreating Digital Collections Through Crowdsourcing
Creating Digital Collections Through Crowdsourcing
 
The sound of the Crowd: David Tomkins, Bodleian Digital Library
The sound of the Crowd: David Tomkins, Bodleian Digital Library The sound of the Crowd: David Tomkins, Bodleian Digital Library
The sound of the Crowd: David Tomkins, Bodleian Digital Library
 
Incorporating student content at city- Diane Bell, City University
Incorporating student content at city- Diane Bell, City UniversityIncorporating student content at city- Diane Bell, City University
Incorporating student content at city- Diane Bell, City University
 
July2015cooke.
July2015cooke.July2015cooke.
July2015cooke.
 
ALISS AGM Minutes 2015
ALISS AGM Minutes 2015ALISS AGM Minutes 2015
ALISS AGM Minutes 2015
 
Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project - Developing digital literacies in undergraduate students: SADL project -
Developing digital literacies in undergraduate students: SADL project -
 
News media at the British Library
News media at the British LibraryNews media at the British Library
News media at the British Library
 
How SCIE supports the information needs of health and social care professionals
How SCIE supports the information needs of health and social care professionalsHow SCIE supports the information needs of health and social care professionals
How SCIE supports the information needs of health and social care professionals
 
Searching systematically: supporting authors of Cochrane reviews.
Searching systematically: supporting authors of Cochrane reviews.  Searching systematically: supporting authors of Cochrane reviews.
Searching systematically: supporting authors of Cochrane reviews.
 
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
Jo Wood, Cafcass –Build it and they will come: developing an in-house service...
 
Speedy professional conversations around learning and teaching in higher educ...
Speedy professional conversations around learning and teaching in higher educ...Speedy professional conversations around learning and teaching in higher educ...
Speedy professional conversations around learning and teaching in higher educ...
 
The Digital Documents Harvesting and Processing Tool (Document Harvester)
The Digital Documents Harvesting and Processing Tool (Document Harvester)The Digital Documents Harvesting and Processing Tool (Document Harvester)
The Digital Documents Harvesting and Processing Tool (Document Harvester)
 
Building a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly useBuilding a Collection of the Historical UK Web for scholarly use
Building a Collection of the Historical UK Web for scholarly use
 
Legal Deposit in a Digital Age: an overview
Legal Deposit in a Digital Age: an overviewLegal Deposit in a Digital Age: an overview
Legal Deposit in a Digital Age: an overview
 
Useful resources for student training and orientation
Useful resources for student training and orientationUseful resources for student training and orientation
Useful resources for student training and orientation
 
Doing something different staff development and workplace learning at Cardiff...
Doing something different staff development and workplace learning at Cardiff...Doing something different staff development and workplace learning at Cardiff...
Doing something different staff development and workplace learning at Cardiff...
 
Knowledge, skills and reskilling – where does the MSc fit in?
Knowledge, skills and reskilling – where does the MSc fit in?Knowledge, skills and reskilling – where does the MSc fit in?
Knowledge, skills and reskilling – where does the MSc fit in?
 

Digitising Hansard

  • 1. Digitising Hansard Edward Wood Director of Information Management House of Commons 16.6.08
  • 2. digitising Hansard • digitising Hansard: scanning and OCR • the policy context • database and front end
  • 3. Hansard • the official report of debates in Parliament • actually an unofficial private enterprise at first • “nationalised” in 1909 • early reports written in the third person • eventually developed into a (nearly) verbatim account • volumes from 1803 – 2005 were digitised • nearly 3 million pages
  • 4. “though not strictly verbatim, [it] is substantially the verbatim report, with repetitions and redundancies omitted and with obvious mistakes corrected, but [...] on the other hand leaves out nothing that adds to the meaning of the speech or illustrates the argument.”
  • 5. why digitise? • enable preservation • conservation is expensive • increase access • increase usability • improve business processes • re-use physical storage space • costs have fallen significantly • quality improving steadily
  • 6. preservation vs. conservation conservation direct intervention to prevent/make good damage to materials preservation a broader term than conservation. It includes all managerial and financial considerations including storage and accommodation provision, staffing levels, policies, techniques, and methods involved in preserving library and archive materials and the information contained therein
  • 7. preservation • originals printed on poor quality paper • starting to deteriorate • reduce wear and tear from daily use • keep in a controlled environment • conservation is expensive
  • 8. improve access • internal – extensive day to day business use across a very large site • public – national heritage and birthright – disposal by libraries – international interest
  • 9. increase usability • search • print • share • novel uses/mash-ups quality of digitisation techniques improving steadily
  • 10. costs • costs have fallen significantly • alternative funding models • reduce physical storage needs – dispose of surplus copies – locate in less valuable space • but beware the hidden costs…
  • 11. ongoing costs • developing a front-end and database • hosting • storing images • digital preservation • format migration
  • 13. why not leave it to the big boys? in a word, control • subject matter • quality • value added • use
  • 14. funding models • self-funding • commercial funding • joint funding • grants
  • 15. doing the work • In house or contractor? • method – image only – re-keying (single, double, triple...) – OCR (optical character recognition) – image plus text – metadata capture – manual intervention increases quality and costs!
  • 16. scanning from... • microfilm • loose originals • bound originals • dis-bound originals
  • 17. OCR • how accurate does it need to be? • mass vs batch capture • double or triple compare • diminishing returns
  • 18. QA (quality assurance) • automate where possible • contractor – 100% proof reading • client – heavy sampling of images – 1% sampling of text • third party?
  • 19. the need for a policy framework • Hansard was the first major digitisation project in the UK parliament • an earlier project to digitise Local and Private Acts captured images only • we needed a digitisation policy for parliament to ensure consistency and learning from experience
  • 20. policy aims • ensure that individual projects: – take into account the wider information context both inside and outside Parliament – deliver their target benefits – offer value for money • ensure the resources created can be: – exploited fully – used for as long as is required
  • 21. policy scope • publications • photographs • archival documents • business records
  • 22. policy principles • digitisation needs to be seen as an integral part of the information work carried out by parliament • use of appropriate technical standards • scan once for many purposes • business cases should take account of all relevant costs
  • 23. selection criteria • measurable user demand (for public use) • business need (for internal use) • the potential for learning and educational use • cost and the availability of other resources • technical considerations • the uniqueness of the items • conservation requirements • intellectual property rights and copyright issues • the availability of digitised versions of the same material elsewhere • the potential for revenue raising • the feasibility of long-term preservation, where required
  • 24. other aspects of the policy • the delivery method will be planned at the outset • the preservation master will be an uncompressed TIFF file • metadata will be created, to support resource discovery, use, storage and digital preservation • we will adopt international standards where possible • we will work with partners where possible
  • 25. developing a digitisation strategy • a project board has been created • an integral part of an online parliamentary history programme for parliament • will use the criteria set out in the digitisation policy to prioritise future digitisation work
  • 26. practical guidelines • guidelines have been developed for all parts of parliament which need to create digitised assets: – a checklist for doing the work – glossary – details of file formats, OCR options – describes popular myths on costs
  • 27. hosting • text and images • text only • navigation • search • web 2.0 • funding models • give it away!? http://www.parliament.uk/publications/archives.cfm
  • 28. developing a web interface drivers • keep costs down • work closely with users • meaningful search across a large amount of data solution • experimental approach • open source
  • 29. methodology and progress • small team of developers from Parliamentary ICT working closely with users (inside and outside Parliament) • uses “micro formats” approach • XML is parsed into HTML before loading into the database • JPEGs not currently being used • half of the data has been loaded (mainly 20th century) • public discussion group and issues log
  • 31.
  • 32.
  • 33.
  • 34. faceted classification • faceted approach to browsing and searching • assignment of multiple classifications to an object • classifications can be to be ordered in a variety of ways • facets include – date – volume number – monarch – chamber – content type (debates or questions) – constituencies – Members of Parliament – offices held.
  • 35. other features • references using the standard format can be located using the search box HC Deb Vol 385 13 May 2002 c498 • predictable URLs http://hansard.millbanksystems.com/commons/1941/may/07/w • pages created for: – individual Members of Parliament – constituencies – acts – bills – divisions
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.

Notas del editor

  1. Today we’re going to say a little bit about our project to digitise Hansard, which is of course the official record of debates in Parliament. Although this isn’t records management in the traditional sense, it is about taking one of the most important records in the country if not the world, using it to create a new digital asset and fully exploiting its value as an information source rather than gathering dust on a shelf