Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Linked open data for digital humanities
research at media archives
Victor de Boer
Corpus Analysis of Time-Based Arts & Med...
Digital Humanities
Part of the effort of humanities researcher is moved
from the physical archives to digital ones
New pos...
Integrating collections as Linked Data
Tools built on top of the data
Continuous
enrichment
Embed in humanities methodolog...
Linked Data
Some examples
Netherlands Institute for Sound and Vision (Beeld en Geluid)
Individuals
Media professionals
Heritage/museum
professionals
Teachers and pupils
Researchers
11 Petabyte of data
65,000 hrs of yearly ingest
Digitization of old material
Digital-born new content
CC BY - SA as
preferable license
3000 items
“Internet Quality”
Openbeelden.nl / openimages.eu
Reuse
NISV collection
∼1,000,000h
Public Domain
(∼320hr =0.032%)
Digitized part
(30%)
Online
(~1,800 hrs = 0.18%)
The Challenge ...
Slide: Johan Oomen
https://blog.brightcove.com/en/2011/02/guest-post-content-king-metadata-rules
Slide: Lotte Belice Balthussen
Metadata
• INSERT PICTURE OF IMMIX
– +-15 archivists, with changing role
– Speech recognition and radio streaming in
Gemeenschappelijke Thesaurus
Audiovisuele Archieven (GTAA)
• SKOS thesaurus
• ~160.000 terms:
– ~3800 Subjects,
– ~97.000 ...
Slide: Lotte Belice Balthussen
Slide: Lotte Belice Balthussen
How?
Image recognition
Speaker recognition
Speech recognition
Broadcasting guides
Crowdsourcing
TT888 subtitles
….
…
..
……...
Segmentation -> OCR -> Match to Vocab
Broadcasting guides
Over 350 speaker models can be recognised
Average precision is 90%
Visual analysis
Term extraction from TT888 subtitles
Web service
Remove
stopwords
N-grams
frequency list
Extract subtitles
OAI program metadata
Dutch
stopword
lists
Dutch Word...
Crowdsourcing: WAISDA? (What’s That?)
- Game-With-a-Purpose (GWAP)
- Allows internet users to annotate audiovisual archive...
GAME MECHANICS 1/5
GAME MECHANICS 2/5
GAME MECHANICS 4/5
GAME MECHANICS 5/5
RESULTS AND FINDINGS 1/2
Three implementations resulted in over a million social
tags (ongoing)
~40-50% of the social tags...
Human-based computation
Nichesourcing CrowdsourcingProfessional
annotation
Niche groups of amateur experts with shared cha...
Crowd-/Nichesourcing: Filmtagging
• Provenance of annotations
(who, when, what)
• Simple expertise model
developed based o...
Filmtagging
Nichesourcing: Accurator
• Annotating large amounts of media objects
– Rijksmuseum prints
• Beyond crowdsourcing
• Find ni...
Birdwatching @ Rijksmuseum
Niche: birdwatchers
CLARIAH Media Suite
• http://mediasuite.clariah.nl/
• Media Scholars
• Combines datasets and analysis tools in an integrat...
CLARIAH Media Suite:annotation
CC-by-nc-nd https://www.flickr.com/photos/joinash/
Moving away from silos
Linked Data
"Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Ric...
...TO CONTEXT: MUTUALLY CONNECTED
COLLECTIONS...
6-11-2018
Connecting collections:
topics, people, genres, etc
Catalogue
P...
External: Networked heritage
Machine readable format
Standardized
Flexibility to connect heterogeneous data
Link what can be linked
re-use and re-usabi...
ALIGNMENTVRT thesaurus GTAA
Example: Linking Dutch and Flemish collections
Application: Browsing Dutch-Flemish connections
https://youtu.be/iOJvcHRfvDY?t=42
Wikidata
• “Collaboratively
edited knowledge base”
• Drives ‘facts’ in Wikipedia
• 50 Million items 500M
statements (=trip...
Internal and External use cases
Tim de Bruyn and John Brooks
Wikidata retrieval service for the
CLARIAH media suite
• Based on interviews with five
users of the Media Suite,
focusing ...
Exploratory extension of the CLARIAH
Media Suite
● Interviewees
● Four tasks,
○ Sports, politics, disruptive media events
● Share feedback
○ Discuss limitations
○ Propose ...
DIVE INTO THE EVENT-BASED
BROWSING OF LINKED HISTORICALMEDIA
Support exploratory, event-centric browsing
of multiple, heterogeneous collections for Media Scholars
Access to Integrated Online Multimedia collections
using Linked Open Data
Interactive Exploration & Discovery in Context
l...
DIVE+ Case study
FOUR DATA SOURCES
OPENIMAGES.EU
3,220 news broadcasts
Netherlands Institute for Sound & Vision
GTAA thesaurus
DELPHER.NL
1...
Hybrid Enrichment Pipeline
CROWDTRUTH.ORG
ENTITYEXTRACTION
EVENTS CROWDSOURCINGANDLINKINGTOCONCEPTS
THROUGHCROWDTRUTH.ORG
...
Crowdsourcing for Events in Texts & Videos
CrowdTruth.org
Description Event
Foto is genomen tijdens de Eerste Zuid
Nieuw-Guinea Expeditie
Eerste Zuid Nieuw-
Guinea Expeditie
"Foto ...
Linking via SKOS thesauri results in a
Media Knowledge Graph
DIVE+ Enrichments
Enrichment
method
Media
Objects Actors Places Events Other Alignments
OI Crowd + NER 3,204 1,249 1,412 1...
DIVE+ UI
https://github.com/CLARIAH/grlc
API Layer
DIVE+ UI: INFINITY OF EXPLORATION
/ http://diveplus.beeldengeluid.nl
/ Support exploration and serendipity /
/ Visual insp...
Take home
Cultural Heritage Organisations (Libraries, Museums,
Archives) are becoming more Open, Smart, Connected
Continuo...
Thank you
Lora Aroyo
Roeland Ordelman
Tim de Bruyn
John Brooks
Willem Melder
Maarten Brinkerink
Jesse de Vos
Ashwin Stacia...
v.de.boer@vu.nl
http://victordeboer.com
@victordeboer
Thank you
For more detailed and technical
information:
• Connecting Dutch and Flemish thesaurus:
– http://www.victordeboer.com/digit...
More Clariah Media Suite
Roeland Ordelman, Carlos Martinez, et al.
filters
results ordering
filter on media objects
order media
objects by date
filter on events
explore
event
related
entities
explore
event
event
related
entities
place entity
exploration
narrative
bookmarking
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Linked Data for Digital Humanities research at Media Archives
Próxima SlideShare
Cargando en…5
×

Linked Data for Digital Humanities research at Media Archives

205 visualizaciones

Publicado el

Slides for my invited talk at the Workshop "Corpus Analysis of Time-Based Arts & Media" in Berlin 6-11-2018

Publicado en: Educación
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Linked Data for Digital Humanities research at Media Archives

  1. 1. Linked open data for digital humanities research at media archives Victor de Boer Corpus Analysis of Time-Based Arts & Media
  2. 2. Digital Humanities Part of the effort of humanities researcher is moved from the physical archives to digital ones New possibilities for humanities research Img:www.doaks.org, www.dkrz.de
  3. 3. Integrating collections as Linked Data Tools built on top of the data Continuous enrichment Embed in humanities methodology Continuous collection enrichment Multimedia analysis (image, text, video) Human computation
  4. 4. Linked Data
  5. 5. Some examples Netherlands Institute for Sound and Vision (Beeld en Geluid)
  6. 6. Individuals Media professionals Heritage/museum professionals Teachers and pupils Researchers
  7. 7. 11 Petabyte of data 65,000 hrs of yearly ingest Digitization of old material Digital-born new content
  8. 8. CC BY - SA as preferable license 3000 items “Internet Quality” Openbeelden.nl / openimages.eu
  9. 9. Reuse
  10. 10. NISV collection ∼1,000,000h Public Domain (∼320hr =0.032%) Digitized part (30%) Online (~1,800 hrs = 0.18%) The Challenge of Open
  11. 11. Slide: Johan Oomen https://blog.brightcove.com/en/2011/02/guest-post-content-king-metadata-rules
  12. 12. Slide: Lotte Belice Balthussen
  13. 13. Metadata • INSERT PICTURE OF IMMIX – +-15 archivists, with changing role – Speech recognition and radio streaming in
  14. 14. Gemeenschappelijke Thesaurus Audiovisuele Archieven (GTAA) • SKOS thesaurus • ~160.000 terms: – ~3800 Subjects, – ~97.000 Persons, – ~27.000 Names, – ~14.000 Locations, – 113 Genres – ~18.000 Makers • http://gtaa.beeldengeluid.nl/
  15. 15. Slide: Lotte Belice Balthussen
  16. 16. Slide: Lotte Belice Balthussen
  17. 17. How? Image recognition Speaker recognition Speech recognition Broadcasting guides Crowdsourcing TT888 subtitles …. … .. ……. … .. ? “Donkey” How could you do this?
  18. 18. Segmentation -> OCR -> Match to Vocab Broadcasting guides
  19. 19. Over 350 speaker models can be recognised Average precision is 90%
  20. 20. Visual analysis
  21. 21. Term extraction from TT888 subtitles
  22. 22. Web service Remove stopwords N-grams frequency list Extract subtitles OAI program metadata Dutch stopword lists Dutch Word frequencies Named Entity Recognition Frequency > T GTAA thesaurus (ElasticSearch) NER module Normalized Frequency > T Match with thesaurus Match with thesaurus Results Input Other subtitle source Term extraction from TT888 subtitles
  23. 23. Crowdsourcing: WAISDA? (What’s That?) - Game-With-a-Purpose (GWAP) - Allows internet users to annotate audiovisual archive material in the form of a (serious) game - The goal of the game is consensus between players (which also works as a filter) - Fun and competition as motivation 6-11-2018
  24. 24. GAME MECHANICS 1/5
  25. 25. GAME MECHANICS 2/5
  26. 26. GAME MECHANICS 4/5
  27. 27. GAME MECHANICS 5/5
  28. 28. RESULTS AND FINDINGS 1/2 Three implementations resulted in over a million social tags (ongoing) ~40-50% of the social tags consists of matched tags ~10-20% of the social tags are unique ‘Super taggers’ are responsible for the vast majority of the social tags that are added 6-11-2018
  29. 29. Human-based computation Nichesourcing CrowdsourcingProfessional annotation Niche groups of amateur experts with shared characteristics
  30. 30. Crowd-/Nichesourcing: Filmtagging • Provenance of annotations (who, when, what) • Simple expertise model developed based on interviews with film scholars • Cinematography, Cultural history, Locations • Varying needs, crowdsourcing does provide opportunities to broaden the annotations Aschwin Stacia, Vrije Universiteit Amsterdam
  31. 31. Filmtagging
  32. 32. Nichesourcing: Accurator • Annotating large amounts of media objects – Rijksmuseum prints • Beyond crowdsourcing • Find niche groups with required expertise • Use the knowledge of expert laymen • Cluster tasks based on topic Chris Dijkshoorn, Rijksmuseum Amsterdam
  33. 33. Birdwatching @ Rijksmuseum Niche: birdwatchers
  34. 34. CLARIAH Media Suite • http://mediasuite.clariah.nl/ • Media Scholars • Combines datasets and analysis tools in an integrated workspace
  35. 35. CLARIAH Media Suite:annotation
  36. 36. CC-by-nc-nd https://www.flickr.com/photos/joinash/ Moving away from silos
  37. 37. Linked Data "Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net
  38. 38. ...TO CONTEXT: MUTUALLY CONNECTED COLLECTIONS... 6-11-2018 Connecting collections: topics, people, genres, etc Catalogue Photos B&GWiki Programmeguides Internal: Video hyperlinking
  39. 39. External: Networked heritage
  40. 40. Machine readable format Standardized Flexibility to connect heterogeneous data Link what can be linked re-use and re-usability OBJECT EVENT PLACE TIME PERSON CONCEPT PROVENANCE Why Linked Open Data
  41. 41. ALIGNMENTVRT thesaurus GTAA Example: Linking Dutch and Flemish collections
  42. 42. Application: Browsing Dutch-Flemish connections https://youtu.be/iOJvcHRfvDY?t=42
  43. 43. Wikidata • “Collaboratively edited knowledge base” • Drives ‘facts’ in Wikipedia • 50 Million items 500M statements (=triples) • Wikidata query service – https://query.wikidata.org/ • Partially aligned with GTAA using Mix ‘n’ Match tool – Persons
  44. 44. Internal and External use cases Tim de Bruyn and John Brooks
  45. 45. Wikidata retrieval service for the CLARIAH media suite • Based on interviews with five users of the Media Suite, focusing on 1) Drugs, 2) Sports, 3) Occupations, 4) History and 5) Disruptive media events • Exploratory search by properties • Send SPARQL query to Wikidata Query Service • Retrieve list of persons based on properties • View additional information (Wikidata/GTAA) • Exploratory search
  46. 46. Exploratory extension of the CLARIAH Media Suite
  47. 47. ● Interviewees ● Four tasks, ○ Sports, politics, disruptive media events ● Share feedback ○ Discuss limitations ○ Propose improvements ● Added value for exploratory search ● Provides insight into background knowledge ● Participants report feeling grasping the context ● Data (in)completeness is a major issue Validation
  48. 48. DIVE INTO THE EVENT-BASED BROWSING OF LINKED HISTORICALMEDIA
  49. 49. Support exploratory, event-centric browsing of multiple, heterogeneous collections for Media Scholars
  50. 50. Access to Integrated Online Multimedia collections using Linked Open Data Interactive Exploration & Discovery in Context linking objects to events and entities building automatic storylines (narratives) DIVE+
  51. 51. DIVE+ Case study
  52. 52. FOUR DATA SOURCES OPENIMAGES.EU 3,220 news broadcasts Netherlands Institute for Sound & Vision GTAA thesaurus DELPHER.NL 197,199 Scans of Radio bulletins 1937 – 1984 AMSTERDAM MUSEUM 73,447 cultural heritage objects AM Thesaurus TROPENMUSEUM 78,270 cultural heritage objects SVNC thesaurus
  53. 53. Hybrid Enrichment Pipeline CROWDTRUTH.ORG ENTITYEXTRACTION EVENTS CROWDSOURCINGANDLINKINGTOCONCEPTS THROUGHCROWDTRUTH.ORG SEGMENTATION &KEYFRAMES LINKINGEVENTS AND CONCEPTSTO KEYFRAMES
  54. 54. Crowdsourcing for Events in Texts & Videos CrowdTruth.org
  55. 55. Description Event Foto is genomen tijdens de Eerste Zuid Nieuw-Guinea Expeditie Eerste Zuid Nieuw- Guinea Expeditie "Foto is genomen tijdens de Eerste- of de Tweede Zuid Nieuw-Guinea Expeditie" Tweede Zuid Nieuw- Guinea Expeditie "Masker gedragen tijdens oogstfeesten. Het feest in kwestie is het Sokari spel dat eenmaal per jaar wordt opgevoerd gedurende zeven opeenvolgende nachten na Nieuwjaar, medio april. …” Nieuwjaar FROG NLP toolkit NER Event extraction Victor Kramer https://languagemachines.github.io/frog/
  56. 56. Linking via SKOS thesauri results in a Media Knowledge Graph
  57. 57. DIVE+ Enrichments Enrichment method Media Objects Actors Places Events Other Alignments OI Crowd + NER 3,204 1,249 1,412 1,916 185,846 623 NB Interpreted + NER 197,200 194,890 54,571 197,200 6,736 6,353 AM original thesaurus 73,447 66,966 5,973 148 28,047 6,865 TM original thesaurus + FROG NER 78,226 27,829 3,896 23* 13,269 - Total 352,077 290,934 65,852 199,264 233,898 - *) more to come
  58. 58. DIVE+ UI https://github.com/CLARIAH/grlc API Layer
  59. 59. DIVE+ UI: INFINITY OF EXPLORATION / http://diveplus.beeldengeluid.nl / Support exploration and serendipity / / Visual inspection of media objects and entities / / Lets user build, save and share Proto-Narratives/
  60. 60. Take home Cultural Heritage Organisations (Libraries, Museums, Archives) are becoming more Open, Smart, Connected Continuous enrichment and linking of heterogeneous collections brings new possibilities for access, analysis Combination of semi-automatic methods, crowd- and nichesourcing, expert annotations-> human(s) in the loop. Shift from tech push to user needs. We show added value for a variety of users Data completeness and Quality are (and will remain) key
  61. 61. Thank you Lora Aroyo Roeland Ordelman Tim de Bruyn John Brooks Willem Melder Maarten Brinkerink Jesse de Vos Ashwin Stacia Chris Dijkshoorn
  62. 62. v.de.boer@vu.nl http://victordeboer.com @victordeboer Thank you
  63. 63. For more detailed and technical information: • Connecting Dutch and Flemish thesaurus: – http://www.victordeboer.com/digital-humanities/sound-and-vision/connecting-collections- across-national-borders/ (also links to a scientific paper) • Enriching videos from Subtitles: – http://www.victordeboer.com/digital-humanities/sound-and-vision/paper-about-automatic- labeling-in-ijdl/ (also links to a scientific paper) • Waisda Game with a purpose: – https://github.com/beeldengeluid/waisda – http://blog.waisda.nl/ • Axes visual search (image recognition): – http://labs.beeldengeluid.nl/application/6ade370a-1b50-11e5-b980-005056a71e3a – https://github.com/kencoken/axes-lite • DIVE+ – http://diveplus.beeldengeluid.nl – http://diveproject.beeldengeluid.nl • CLARIAH Media Suite – http://mediasuite.clariah.nl • More at http://labs.beeldengeluid.nl/
  64. 64. More Clariah Media Suite Roeland Ordelman, Carlos Martinez, et al.
  65. 65. filters results ordering
  66. 66. filter on media objects order media objects by date
  67. 67. filter on events
  68. 68. explore event related entities
  69. 69. explore event event related entities
  70. 70. place entity exploration
  71. 71. narrative
  72. 72. bookmarking

×