SlideShare una empresa de Scribd logo
1 de 28
Indexing and searching
        of noisy data

              Franciska de Jong
           University of Twente           Erasmus University
cluster Human Media Interaction           Erasmus Studio for e-research
     Enschede, The Netherlands            Rotterdam, The Netherlands
         http://hmi.ewi.utwente.nl/~fdejong




                   IMPACT Closing Event - The Hague                       1
Overview

Part I: Noisy data analysis – other examples
Part II: Emerging scenarios of scholarly use
Part III: From noisy (meta)data towards
          metadata mining




                 IMPACT Closing Event - The Hague   2
Noisy Channel for Spelling Correction




                                        J&M Figure 5.23

noise: limitations in spelling skills
Noisy Channel for Speech Recognition




                                       J&M Figure 9.2

noise: limitations in sound captured
Noisy Channel for Machine Translation




                                                 J&M Figure 25.15
noise: loss of information through translation
Noisy Channel for OCR




                                             J&M Figure 5.23

noise:
loss of information through typesetting/handwriting
Decoding spoken audio
• Audio modelling: collect data on the ground
  truth for audio segments
• Language modelling: collect data on co-
  occurrence s of words
• 100 hours of speech,
• Text data (500 M words)

There is no data like more data
                 IMPACT Closing Event - The Hague   7
After decoding
• multiple hypotheses with varying probabilities
  of being correct
• selection from n-best list: errors unavoidable
• post-editing can be an option, but never
  without extra costs
  – time (editors), money (editing platform)
  – complexity of workflow


                   IMPACT Closing Event - The Hague   8
Impact of noise on access tasks
• Content/metadata with a certain amount of
  errors
• Search with reduced accuracy:
  – missed hits (false negatives)
  – incorrect hits (false positive; ‘noise’)
• Noisy data less suited for presentation layer
  – pdf versus ascii
  – original audio versus transcript; alternatives: word
    clouds, related content
                    IMPACT Closing Event - The Hague   9
Access to interviews: transcript generation
metadata                 multimedia
                          interview
                           archive


      speech/
                    speaker         speech
    non-speech                                              result
                    detection     recognition
     detection                                           presentation
     automatic speech transcription


                                                                           users:
        transcripts with time stamps            search                  general public,
        and semantic annotations                engine                    archivists,
                                                                         researchers


                                                            query
    summarization   text mining      tagging

      automatic metadata extraction
Optimization Strategies (1)
• Error correction: post-editing, better
  recognition
• Improved recognition
  – typically effective for core collections (WER below
    20%)
  – less effective for the long tail
Case: interviews with Willem Frederik Hermans
• With models for news: 81% WER
• Aim: reduction to around 60%
                   IMPACT Closing Event - The Hague   11
Optimization Strategies (2)
• Dedicated /task-specific evaluation
  – for seach applications errors in function words are
    less critical than errors in e.g. names of persons
    and locations
• Dedicated weigthing schemes for search tasks
  – assign confidences scores to fragments found and
    rerank search results accordingly



                   IMPACT Closing Event - The Hague   12
Access to interviews: support for users
metadata                 multimedia
                          interview
                           archive


      speech/
                    speaker         speech
    non-speech                                              result
                    detection     recognition
     detection                                           presentation
     automatic speech transcription


                                                                           users:
        transcripts with time stamps            search                  general public,
        and semantic annotations                engine                    archivists,
                                                                         researchers


                                                            query
    summarization   text mining      tagging

      automatic metadata extraction
• Part II: Emerging scenarios of scholarly use




                  IMPACT Closing Event - The Hague   14
DLs and knowledge discovery
• Focus of attention for analysis is no longer the
  document alone.
• Room for statistical methods to analyse entire
  collections, archives, libraries.
• Tools that automatically detect and capture
  various semantic layers and feed the patterns
  found back into the metadata structures.
• Discovery versus item finding: room for
  serendipity and data-driven content
  exploration.  IMPACT Closing Event - The Hague   15
Paradigm evolution
                 Science                             Information
                 examples                            studies examples
                         direct obervation           interpretation/ decoding of
Experimental                                         texts
work
                 E = mc2                             S → NP VP
Theoretical
                 a2 + b2 = c2                        Principle of
modeling                                             Compositionality
                 change                              GIS for visualisation of
Computational                                        mobility patterns
                 simulation
modeling                                             text-mining: cross-
                 particle physics,                   document entity linking for
Data-intensive                                       cultural heritage libraries
                           astronomy
computing                                            rule-based parsing of large
                  IMPACT Closing Event - The Hague   corpora (typology studies))
                                                                             16
More than search: metadata
extraction
• For large-scale digital (distributed) collections the
  potential added value of automatically generated
  metadata is becoming more and more apparent.
• Automatic content labeling:
   – not just a matter of speeding up the annotation process and
     enlarging the scope of analysis, also
   – starting point for generating annotation layers at collection
     level , and
   – basis for link structures for all kinds of semantic aspects of
     content, such as chronological trends, topic shifts, style and
     authenticity.
   – potentially noisy IMPACT Closing Event - The Hague            17
“Multi”-issues for DL metadata (1)
• Multi-layer
  – beyond tomb stone: content description at
    fragment level (full text, full content, etc.)
  – free text annotation versus thesaurus-based
    labeling
• Multiple media formats
  – text, text, text
  – spoken audio, video, still images, music, scores,
    umerical data, sensor data, sensus data, etc.
                   IMPACT Closing Event - The Hague     18
Multi-issues for DL metadata (2)
• Multiple perspectives
  – cover more than local context
  – cover more than one domain perspective
  – cover more than one language
• Multiple values due to uncertainty
  – multiple human annotators
  – automatic labeling extracted from potentially
    noisy data
  – dynamics in collection/context
                  IMPACT Closing Event - The Hague   19
Scholarly use
• Comparative perspective
  – Quantitative and qualitative issues
• Need for enhanced content presentation:
  – Multiple layers
  – Links to context
  – Links to related content
• Emerging methodological shift
  – Enhanced collection exploration (think of Google
    n-grams)

                   IMPACT Closing Event - The Hague    20
Part III
From noisy data/metadata towards metadata
mining




              IMPACT Closing Event - The Hague   21
Metadata mining: crucial steps
• Treat all annotation types (classical
  metadata, automatically extracted
  metadata, scholarly annotation, community
  tagging) as assets.
• Learn how to integrate the various types and
  layers to enhance accessibility and to be able to
  exploit the knowledge captured in metadata
  – Exploiting manual annotation for machine learning
    training
  – Detection of collection-level semantic features
  – Innovative interface Event - The Hague
                  IMPACT Closing
                                 and interaction design 22
What can metadata mining bring?
• Quality added to metadata for increased accessibility
  of content:
   – structured search (full text + classification-based)
   – navigation across collections, rich presentation layers
• Increased insight in relations between data
  collections (across media types, languages, etc.)
• Increased understanding of knowledge production
  as captured by metadata and annotation processing
• Support for capturing the essence of association and
  analogy.
There is no data like metadata!
                   IMPACT Closing Event - The Hague 23
Issues for metadata models
Old
• annotation interoperability (e.g., metadata
  integration for content annotated with coding
  tools such as thesauri and ontologies)
New
• how to capture fuzziness and uncertainty coming
  from multiple sources and/or statistical
  processing
• coding of change over time (e.g., metadata for
  the dynamics of temporal and geo-spatial details)

                 IMPACT Closing Event - The Hague   24
Issues for scholarly users
Individual level
• Learn to deal with imperfection
• Understand the limitations of technological
  innovation
Community level
• Stay tuned with developers
• Organize methodology teaching
• Study emerging practises
• Share success stories
                 IMPACT Closing Event - The Hague   25
Issues for developers
• Learn about scholarly practises
• Stay tuned with users during the entire
  process
• Organize structured feedback loops
• Study best practises
• Share responsibility for centers of expertise



                  IMPACT Closing Event - The Hague   26
Issues for e-humanities
• e-humanities is e-research
• multiple media, multiple patforms
• keep connecting !




                 IMPACT Closing Event - The Hague   27
Contact
• email:
  f.m.g.dejong@utwente.nl or
  fdejong@ese.eur.nl
• url:     http://hmi.ewi.utwente.nl/~fdejong




                  IMPACT Closing Event - The Hague   28

Más contenido relacionado

Destacado

Destacado (6)

National library of the netherlands judith rog
National library of the netherlands   judith rogNational library of the netherlands   judith rog
National library of the netherlands judith rog
 
IMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard BoulderstoneIMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard Boulderstone
 
BIT Alpha - ICoC
BIT Alpha - ICoCBIT Alpha - ICoC
BIT Alpha - ICoC
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 
Statistical Machine Translation for Language Localisation
Statistical Machine Translation for Language LocalisationStatistical Machine Translation for Language Localisation
Statistical Machine Translation for Language Localisation
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 

Similar a NoisyDataAnalysisMetadataMining

Integrating digital traces into a semantic enriched data
Integrating digital traces into a semantic enriched dataIntegrating digital traces into a semantic enriched data
Integrating digital traces into a semantic enriched dataDhaval Thakker
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmondayDaniel Koller
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkIMPACT Centre of Competence
 
Semi-automated metadata extraction in the long-term
Semi-automated metadata extraction in the long-termSemi-automated metadata extraction in the long-term
Semi-automated metadata extraction in the long-termPERICLES_FP7
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 
20120411 travelalliancemcguinnessfinal
20120411 travelalliancemcguinnessfinal20120411 travelalliancemcguinnessfinal
20120411 travelalliancemcguinnessfinalDeborah McGuinness
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesChristoph Lange
 
SWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementSWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementChristoph Lange
 
Archiving and managing a million or more data files on BiG Grid
Archiving and managing a million or more data files on BiG GridArchiving and managing a million or more data files on BiG Grid
Archiving and managing a million or more data files on BiG Gridpkdoorn
 
Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?Georg Rehm
 
ALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataMartin Memmel
 
The VESTA Platform: Video Evaluation System for Task Analysis
The VESTA Platform: Video Evaluation System for Task AnalysisThe VESTA Platform: Video Evaluation System for Task Analysis
The VESTA Platform: Video Evaluation System for Task AnalysisAgence du Numérique (AdN)
 
Taming digital traces for informal learning dhaval
Taming digital traces for informal learning  dhavalTaming digital traces for informal learning  dhaval
Taming digital traces for informal learning dhavalDhavalkumar Thakker
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Annotation seminar
Annotation seminarAnnotation seminar
Annotation seminarhozifa1010
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenVladimir Alexiev, PhD, PMP
 

Similar a NoisyDataAnalysisMetadataMining (20)

Integrating digital traces into a semantic enriched data
Integrating digital traces into a semantic enriched dataIntegrating digital traces into a semantic enriched data
Integrating digital traces into a semantic enriched data
 
Bne impact iif
Bne impact iifBne impact iif
Bne impact iif
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmonday
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
 
Semi-automated metadata extraction in the long-term
Semi-automated metadata extraction in the long-termSemi-automated metadata extraction in the long-term
Semi-automated metadata extraction in the long-term
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
20120411 travelalliancemcguinnessfinal
20120411 travelalliancemcguinnessfinal20120411 travelalliancemcguinnessfinal
20120411 travelalliancemcguinnessfinal
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologies
 
SWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge ManagementSWiM – A Semantic Wiki for Mathematical Knowledge Management
SWiM – A Semantic Wiki for Mathematical Knowledge Management
 
Archiving and managing a million or more data files on BiG Grid
Archiving and managing a million or more data files on BiG GridArchiving and managing a million or more data files on BiG Grid
Archiving and managing a million or more data files on BiG Grid
 
Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?
 
Text Mining
Text MiningText Mining
Text Mining
 
ALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional Metadata
 
The VESTA Platform: Video Evaluation System for Task Analysis
The VESTA Platform: Video Evaluation System for Task AnalysisThe VESTA Platform: Video Evaluation System for Task Analysis
The VESTA Platform: Video Evaluation System for Task Analysis
 
Taming digital traces for informal learning dhaval
Taming digital traces for informal learning  dhavalTaming digital traces for informal learning  dhaval
Taming digital traces for informal learning dhaval
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Annotation seminar
Annotation seminarAnnotation seminar
Annotation seminar
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
 

Más de IMPACT Centre of Competence

Más de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Último

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

NoisyDataAnalysisMetadataMining

  • 1. Indexing and searching of noisy data Franciska de Jong University of Twente Erasmus University cluster Human Media Interaction Erasmus Studio for e-research Enschede, The Netherlands Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 1
  • 2. Overview Part I: Noisy data analysis – other examples Part II: Emerging scenarios of scholarly use Part III: From noisy (meta)data towards metadata mining IMPACT Closing Event - The Hague 2
  • 3. Noisy Channel for Spelling Correction J&M Figure 5.23 noise: limitations in spelling skills
  • 4. Noisy Channel for Speech Recognition J&M Figure 9.2 noise: limitations in sound captured
  • 5. Noisy Channel for Machine Translation J&M Figure 25.15 noise: loss of information through translation
  • 6. Noisy Channel for OCR J&M Figure 5.23 noise: loss of information through typesetting/handwriting
  • 7. Decoding spoken audio • Audio modelling: collect data on the ground truth for audio segments • Language modelling: collect data on co- occurrence s of words • 100 hours of speech, • Text data (500 M words) There is no data like more data IMPACT Closing Event - The Hague 7
  • 8. After decoding • multiple hypotheses with varying probabilities of being correct • selection from n-best list: errors unavoidable • post-editing can be an option, but never without extra costs – time (editors), money (editing platform) – complexity of workflow IMPACT Closing Event - The Hague 8
  • 9. Impact of noise on access tasks • Content/metadata with a certain amount of errors • Search with reduced accuracy: – missed hits (false negatives) – incorrect hits (false positive; ‘noise’) • Noisy data less suited for presentation layer – pdf versus ascii – original audio versus transcript; alternatives: word clouds, related content IMPACT Closing Event - The Hague 9
  • 10. Access to interviews: transcript generation metadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 11. Optimization Strategies (1) • Error correction: post-editing, better recognition • Improved recognition – typically effective for core collections (WER below 20%) – less effective for the long tail Case: interviews with Willem Frederik Hermans • With models for news: 81% WER • Aim: reduction to around 60% IMPACT Closing Event - The Hague 11
  • 12. Optimization Strategies (2) • Dedicated /task-specific evaluation – for seach applications errors in function words are less critical than errors in e.g. names of persons and locations • Dedicated weigthing schemes for search tasks – assign confidences scores to fragments found and rerank search results accordingly IMPACT Closing Event - The Hague 12
  • 13. Access to interviews: support for users metadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 14. • Part II: Emerging scenarios of scholarly use IMPACT Closing Event - The Hague 14
  • 15. DLs and knowledge discovery • Focus of attention for analysis is no longer the document alone. • Room for statistical methods to analyse entire collections, archives, libraries. • Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures. • Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
  • 16. Paradigm evolution Science Information examples studies examples direct obervation interpretation/ decoding of Experimental texts work E = mc2 S → NP VP Theoretical a2 + b2 = c2 Principle of modeling Compositionality change GIS for visualisation of Computational mobility patterns simulation modeling text-mining: cross- particle physics, document entity linking for Data-intensive cultural heritage libraries astronomy computing rule-based parsing of large IMPACT Closing Event - The Hague corpora (typology studies)) 16
  • 17. More than search: metadata extraction • For large-scale digital (distributed) collections the potential added value of automatically generated metadata is becoming more and more apparent. • Automatic content labeling: – not just a matter of speeding up the annotation process and enlarging the scope of analysis, also – starting point for generating annotation layers at collection level , and – basis for link structures for all kinds of semantic aspects of content, such as chronological trends, topic shifts, style and authenticity. – potentially noisy IMPACT Closing Event - The Hague 17
  • 18. “Multi”-issues for DL metadata (1) • Multi-layer – beyond tomb stone: content description at fragment level (full text, full content, etc.) – free text annotation versus thesaurus-based labeling • Multiple media formats – text, text, text – spoken audio, video, still images, music, scores, umerical data, sensor data, sensus data, etc. IMPACT Closing Event - The Hague 18
  • 19. Multi-issues for DL metadata (2) • Multiple perspectives – cover more than local context – cover more than one domain perspective – cover more than one language • Multiple values due to uncertainty – multiple human annotators – automatic labeling extracted from potentially noisy data – dynamics in collection/context IMPACT Closing Event - The Hague 19
  • 20. Scholarly use • Comparative perspective – Quantitative and qualitative issues • Need for enhanced content presentation: – Multiple layers – Links to context – Links to related content • Emerging methodological shift – Enhanced collection exploration (think of Google n-grams) IMPACT Closing Event - The Hague 20
  • 21. Part III From noisy data/metadata towards metadata mining IMPACT Closing Event - The Hague 21
  • 22. Metadata mining: crucial steps • Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets. • Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning training – Detection of collection-level semantic features – Innovative interface Event - The Hague IMPACT Closing and interaction design 22
  • 23. What can metadata mining bring? • Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based) – navigation across collections, rich presentation layers • Increased insight in relations between data collections (across media types, languages, etc.) • Increased understanding of knowledge production as captured by metadata and annotation processing • Support for capturing the essence of association and analogy. There is no data like metadata! IMPACT Closing Event - The Hague 23
  • 24. Issues for metadata models Old • annotation interoperability (e.g., metadata integration for content annotated with coding tools such as thesauri and ontologies) New • how to capture fuzziness and uncertainty coming from multiple sources and/or statistical processing • coding of change over time (e.g., metadata for the dynamics of temporal and geo-spatial details) IMPACT Closing Event - The Hague 24
  • 25. Issues for scholarly users Individual level • Learn to deal with imperfection • Understand the limitations of technological innovation Community level • Stay tuned with developers • Organize methodology teaching • Study emerging practises • Share success stories IMPACT Closing Event - The Hague 25
  • 26. Issues for developers • Learn about scholarly practises • Stay tuned with users during the entire process • Organize structured feedback loops • Study best practises • Share responsibility for centers of expertise IMPACT Closing Event - The Hague 26
  • 27. Issues for e-humanities • e-humanities is e-research • multiple media, multiple patforms • keep connecting ! IMPACT Closing Event - The Hague 27
  • 28. Contact • email: f.m.g.dejong@utwente.nl or fdejong@ese.eur.nl • url: http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 28