The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
NoisyDataAnalysisMetadataMining
1. Indexing and searching
of noisy data
Franciska de Jong
University of Twente Erasmus University
cluster Human Media Interaction Erasmus Studio for e-research
Enschede, The Netherlands Rotterdam, The Netherlands
http://hmi.ewi.utwente.nl/~fdejong
IMPACT Closing Event - The Hague 1
2. Overview
Part I: Noisy data analysis – other examples
Part II: Emerging scenarios of scholarly use
Part III: From noisy (meta)data towards
metadata mining
IMPACT Closing Event - The Hague 2
3. Noisy Channel for Spelling Correction
J&M Figure 5.23
noise: limitations in spelling skills
4. Noisy Channel for Speech Recognition
J&M Figure 9.2
noise: limitations in sound captured
5. Noisy Channel for Machine Translation
J&M Figure 25.15
noise: loss of information through translation
6. Noisy Channel for OCR
J&M Figure 5.23
noise:
loss of information through typesetting/handwriting
7. Decoding spoken audio
• Audio modelling: collect data on the ground
truth for audio segments
• Language modelling: collect data on co-
occurrence s of words
• 100 hours of speech,
• Text data (500 M words)
There is no data like more data
IMPACT Closing Event - The Hague 7
8. After decoding
• multiple hypotheses with varying probabilities
of being correct
• selection from n-best list: errors unavoidable
• post-editing can be an option, but never
without extra costs
– time (editors), money (editing platform)
– complexity of workflow
IMPACT Closing Event - The Hague 8
9. Impact of noise on access tasks
• Content/metadata with a certain amount of
errors
• Search with reduced accuracy:
– missed hits (false negatives)
– incorrect hits (false positive; ‘noise’)
• Noisy data less suited for presentation layer
– pdf versus ascii
– original audio versus transcript; alternatives: word
clouds, related content
IMPACT Closing Event - The Hague 9
10. Access to interviews: transcript generation
metadata multimedia
interview
archive
speech/
speaker speech
non-speech result
detection recognition
detection presentation
automatic speech transcription
users:
transcripts with time stamps search general public,
and semantic annotations engine archivists,
researchers
query
summarization text mining tagging
automatic metadata extraction
11. Optimization Strategies (1)
• Error correction: post-editing, better
recognition
• Improved recognition
– typically effective for core collections (WER below
20%)
– less effective for the long tail
Case: interviews with Willem Frederik Hermans
• With models for news: 81% WER
• Aim: reduction to around 60%
IMPACT Closing Event - The Hague 11
12. Optimization Strategies (2)
• Dedicated /task-specific evaluation
– for seach applications errors in function words are
less critical than errors in e.g. names of persons
and locations
• Dedicated weigthing schemes for search tasks
– assign confidences scores to fragments found and
rerank search results accordingly
IMPACT Closing Event - The Hague 12
13. Access to interviews: support for users
metadata multimedia
interview
archive
speech/
speaker speech
non-speech result
detection recognition
detection presentation
automatic speech transcription
users:
transcripts with time stamps search general public,
and semantic annotations engine archivists,
researchers
query
summarization text mining tagging
automatic metadata extraction
14. • Part II: Emerging scenarios of scholarly use
IMPACT Closing Event - The Hague 14
15. DLs and knowledge discovery
• Focus of attention for analysis is no longer the
document alone.
• Room for statistical methods to analyse entire
collections, archives, libraries.
• Tools that automatically detect and capture
various semantic layers and feed the patterns
found back into the metadata structures.
• Discovery versus item finding: room for
serendipity and data-driven content
exploration. IMPACT Closing Event - The Hague 15
16. Paradigm evolution
Science Information
examples studies examples
direct obervation interpretation/ decoding of
Experimental texts
work
E = mc2 S → NP VP
Theoretical
a2 + b2 = c2 Principle of
modeling Compositionality
change GIS for visualisation of
Computational mobility patterns
simulation
modeling text-mining: cross-
particle physics, document entity linking for
Data-intensive cultural heritage libraries
astronomy
computing rule-based parsing of large
IMPACT Closing Event - The Hague corpora (typology studies))
16
17. More than search: metadata
extraction
• For large-scale digital (distributed) collections the
potential added value of automatically generated
metadata is becoming more and more apparent.
• Automatic content labeling:
– not just a matter of speeding up the annotation process and
enlarging the scope of analysis, also
– starting point for generating annotation layers at collection
level , and
– basis for link structures for all kinds of semantic aspects of
content, such as chronological trends, topic shifts, style and
authenticity.
– potentially noisy IMPACT Closing Event - The Hague 17
18. “Multi”-issues for DL metadata (1)
• Multi-layer
– beyond tomb stone: content description at
fragment level (full text, full content, etc.)
– free text annotation versus thesaurus-based
labeling
• Multiple media formats
– text, text, text
– spoken audio, video, still images, music, scores,
umerical data, sensor data, sensus data, etc.
IMPACT Closing Event - The Hague 18
19. Multi-issues for DL metadata (2)
• Multiple perspectives
– cover more than local context
– cover more than one domain perspective
– cover more than one language
• Multiple values due to uncertainty
– multiple human annotators
– automatic labeling extracted from potentially
noisy data
– dynamics in collection/context
IMPACT Closing Event - The Hague 19
20. Scholarly use
• Comparative perspective
– Quantitative and qualitative issues
• Need for enhanced content presentation:
– Multiple layers
– Links to context
– Links to related content
• Emerging methodological shift
– Enhanced collection exploration (think of Google
n-grams)
IMPACT Closing Event - The Hague 20
21. Part III
From noisy data/metadata towards metadata
mining
IMPACT Closing Event - The Hague 21
22. Metadata mining: crucial steps
• Treat all annotation types (classical
metadata, automatically extracted
metadata, scholarly annotation, community
tagging) as assets.
• Learn how to integrate the various types and
layers to enhance accessibility and to be able to
exploit the knowledge captured in metadata
– Exploiting manual annotation for machine learning
training
– Detection of collection-level semantic features
– Innovative interface Event - The Hague
IMPACT Closing
and interaction design 22
23. What can metadata mining bring?
• Quality added to metadata for increased accessibility
of content:
– structured search (full text + classification-based)
– navigation across collections, rich presentation layers
• Increased insight in relations between data
collections (across media types, languages, etc.)
• Increased understanding of knowledge production
as captured by metadata and annotation processing
• Support for capturing the essence of association and
analogy.
There is no data like metadata!
IMPACT Closing Event - The Hague 23
24. Issues for metadata models
Old
• annotation interoperability (e.g., metadata
integration for content annotated with coding
tools such as thesauri and ontologies)
New
• how to capture fuzziness and uncertainty coming
from multiple sources and/or statistical
processing
• coding of change over time (e.g., metadata for
the dynamics of temporal and geo-spatial details)
IMPACT Closing Event - The Hague 24
25. Issues for scholarly users
Individual level
• Learn to deal with imperfection
• Understand the limitations of technological
innovation
Community level
• Stay tuned with developers
• Organize methodology teaching
• Study emerging practises
• Share success stories
IMPACT Closing Event - The Hague 25
26. Issues for developers
• Learn about scholarly practises
• Stay tuned with users during the entire
process
• Organize structured feedback loops
• Study best practises
• Share responsibility for centers of expertise
IMPACT Closing Event - The Hague 26
27. Issues for e-humanities
• e-humanities is e-research
• multiple media, multiple patforms
• keep connecting !
IMPACT Closing Event - The Hague 27
28. Contact
• email:
f.m.g.dejong@utwente.nl or
fdejong@ese.eur.nl
• url: http://hmi.ewi.utwente.nl/~fdejong
IMPACT Closing Event - The Hague 28