SlideShare una empresa de Scribd logo
1 de 21
PolimediaSymposium
  Linking the data sets


  Damir Juric (TU Delft)
 Amsterdam, 23.01.2013.
Background: the PoliMedia project
• The PoliMedia project:
   – driven by research questions from historians
   – interested in media coverage across several types of
     media outlets
   – Cross-media comparisons
      • conducted over a longer period of time, on different topics
      • focus on the coverage of the debates in the Dutch parliament
      • insight on the different choices that different media make while
        reporting on those debates
   – three phases :
      • modeling phase: creating a semantic model
      • data production phase: creating links between debates and
        associated media sources
      • application phase: searching and navigating linked datasets
Introduction
• Polimediasemantic model needs to represent:
  – people
  – topics
  – time
  – media types
• Model has to be expressive enough:
  – describing events from the Dutch parliament
Data Sets
• Primary data set:
  – The Dutch parliamentary debates
    (Handelingender Staten-General
    or Dutch Hansard)
  – transcripts of speeches that
    politicians had in the parliament
  – this project uses data from the
    Political Mashup
  – all debates until the year 1995:
     • published as XML documents (OCR
       with satisfactory quality is being
       used).
     • data shows a fine-grained structure.
Data Sets
• Secondary data set:
  – different media types:
     • newspaper articles and radio
       bulletins
        – National Library of the
          Netherlands
     • newscasts
        – evening news and current affairs
          programs
Semantic model: Goals
• Goal of the project:
  – to publish the links on the Web
  – to use open Web formats and standards
  – Web query language
  – unique identifiers (URI’s)
• Model has to be expressive:
  – important information regarding parliamentary
    debates should be easily accessed
Debate: The structure

                                   Metadata
                Debate
               Metadata             NEs={EconomischeZaken, Borssele}


                                   Aan de orde is de behandeling van: - de brief van de          minister   van
                  Topic 1          EconomischeZakeninzakeBorssele(16226, nr. 26).
                                   De beraadslagingwordtgeopend.


Speaker 1    Speaker 1 / Content
                                   NEs={Borssele, Partij van de Arbeid, D66}
Speaker 2    Speaker 2 / Content
                                    Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met
                                    Denemarken,Engeland, Ierland en Noorwegenwordteen van de doelstellingen
Speaker 3    Speaker 3 / Content    van onsbuitenlandsbeleidverwezenlijkt.




                  Topic 2

             Speaker 1 / Content
Semantic model: Description




Part of semantic model representation of the
debates dataset
Semantic model: Description




Semantic model representation of the debates
dataset
Polimedia linking method

•   The challenge: how to create a representation of the speech that contains enough
    information, so it can be used as a query to retrieve relevant media articles from
    the archive?

•   Debate speeches and newspaper articles are generally different types of
    documents (so computing document similarity doesn’t work) in the style and
    scope

•   Speeches can contain large number of NEs and digressions:
     –   Problem: hard to distinguish the right context for each speech

•   Newspaper articles:
     –   very strict and concise
     –   words are used sparingly
Polimedia linking method

• Our PoliMedia linking method consists of four steps:

    1. topics: enriching the existing debate metadata with topics

    2. preselection of articles: when the candidate articles were published and who
       spoke in the debate (timeframe and speakers)?

    3. automatic query creation: candidate articles are ranked based on similarity to
       the query (automatically created from speech text) by comparing vectors of
       topics and named entities

    4. link creation: links are created between a speech and an article if the
       similarity score is above a threshold t
Topics
•   Topic modeling:
     –   popular tool for the unsupervised analysis of text,
     –   used to check models, summarize the corpus, and guide exploration of its contents
     –   topic models lead to semantically meaningful decompositions of text because they tend to place high probability on
         words that represent concepts

•   Extracting topics from speech:
     –   ten words that represent one topic discussed inside the speech are extracted
     –   all speeches contained inside one debate segment are concatenated into one text
     –   set of ten words that represent one topic of the debate segment as a whole is extracted from that text

•   Input: text /number of iterations/number of topics
•   Output: generic names for topics/words that cluster around one topic
•   Example:
     – Test case: debate nr. 1975/number of iterations: 2000/numbner of topics: 1
Automatic query creation
                                                                             Metadata

                                                 NE Speech
                                           Staatssecretaris
                                           Regering
                                           Euro-kapitaalmarkt              Debate
                                           Tariefnota
                                                                          Metadata
                       TopicSet Topic      Financiën
                      moeten               Zwitserland                       Topic 1
                      fraude               Grave
 TopicSet Speech      wetgeving            Brussel                      Speaker 1 / Content
inkomstenbelasting    sociale              EEG
bronheffing                                                             Speaker 2 / Content
                     misbruik              Netherlandse
Kombrink             ten                   Contourennota
                                                                        Speaker 3 / Content
rente                fraudebestrijding     Kombrink
contourennota        vraag                 Nederland
Nederland            gebruik               Contou                            Topic 2
vereenvoudiging      kamer                 OESO-verband
                                                                        Speaker 1 / Content
tarief                                     Midden-Oosten
word                 misbruikfraudebest    Engwirda
                     rijdingismo-rapport
tussen                                     Couprie

                           NE Topic                             Actor
Automatic query creation
 Scholten

+(text:wetsontwerptext:latertext:septembertext:prijzentext:lonentext:ontwikkelingtext:zeggentext:staatssecretaristext:gebrachttext:ertoe)
(title:wetsontwerptitle:latertitle:septembertitle:prijzentitle:lonentitle:ontwikkelingtitle:zeggentitle:staatssecretaristitle:gebrachttitle:ertoe)
 +(text:staatssecretaristext:huurverhogingtext:jaartext:moetentext:apriltext:uitsteltext:percentagetext:nieuwe)
 (title:staatssecretaristitle:huurverhogingtitle:jaartitle:moetentitle:apriltitle:uitsteltitle:percentagetitle:nieuwe)
+(text:regelentext:wet)
(title:regelentitle:wet)
+text:staatssecretaris
title:staatssecretaris



Mijnheerde Voorzitter ! In de memorie van toelichtingbij het voorliggendewetsontwerpzegt de Staatssecretaris , dathij over het
trendmatigehuurstijgingspercentagevoor 1977 nognietskanzeggenomdat de gegevens over de teverwachtenontwikkeling van lonen en prijzenvoor 1977
nognietbekendzijn . Dit is gedateerd 14 september . Impliceertdit , wanneerergeensprakezouzijn van eenwetsontwerp tot verschuiving van de ingangsdatum
, danook ten aanzien van de 8 procent per 1 aprilzougeldendatnogafgewachtmoetworden , of het dat percentage zalworden , omdat men pas later
ietsmeerweet over de ontwikkeling van lonen en prijzen ? De Staatssecretarisvoeltzich door ditwetsontwerpeigenlijkgedwongen op
eenvrijvroegtijdstiptochdaaroverietstezeggen . Immers , een week later namelijkbij brief van 21 septemberkomthijwel met eenbepaaldconcreetvoorstel .
Daarinstelthij : Het overleg met de vastecommissieheeftmijertoegebracht ...



                                     ExpandedQuery =
                                            NERsSpeech       TopicSet Speech        NER Topic         TopicSet Topic



                                        +
                                      Speaker X =
                                       ActorFromSpeech                                                           TimeFrame
Example of the relevant article
vvd: van dam baseertbeleidteveel op rossige prognoses van planbureau
kamermeerderheidtegenuitstel van huurverhoging



den haag — eenmeerderheid van de tweedekamervoelternietsvoor de huurverhoging van volgendjaaruitte, stellen van 1 april tot 1 juli. de fracties
van kvp, arp, chu, vvd, ds'7o en de kleinechristelijkepairtijenwillen de huurverhoging op 1 aprillatendoorgaan. staatssecretaris van dam van
volkshuisvestingwiluitstelom op 1 julivolgendjaareennieuwhuurbeleidtekunneninvoeren. daarvoorzalhij op kortetetmijndriewetsontwerpenindienen:
de huurprijzenwet, de wet op de huurcommissie en eenwijziging van het burgerlijkwetboek.debewindsmanzeidat met het afwijzen van uitstel in
feiteinvoering van het nieuwehuurbeleid op 1 julivolgendjaaronmogelijkwordtgemaakt. het nieuwestelselzaldan pas in 1978
ingevoerdkunnenworden. „met eenuitstel van driemaandenkomen we preciesuit", aldus de heer van dam. de arp'erscholten, die medenamenskvp en
chusprak, zegde de regeringallemedewerking toe om de nieuwehuurwetnog in dezekabinetsperiodetebehandelen, maarhijtwijfeldeeraan of op 1 juli
1977 het nieuwehuurbeleid al ingevoerdkanworden. de confessionelen en de vvdhouden vast aaneenhuurverhoging van 8 procent op 1 april.
staatssecretaris van dam wil pas op 1 julizonverhoging. zou de verhogingtoch op 1 aprilmoeteningaan, danwilhijeenverhoging van 7 procent. de
bewindsmankomtvolgens de confessionelentevroeg met eenverlaging van de jaarlijksehuurverhoging.het d'66-kamerlid nypelsdiendeeenmotie in
waarinhij de regeringverzoektbijverwerping van het uitsteltekomen met eenwetsontwerpvoor 7 procent op 1 april. ook de
pvda'etkombrinksuggereerdedezeoplossing. de heerkombrink deed eendringendberoep op de confessionelenom het uitstelteaanvaarden. de vvder de
beer vonddatelrnietvoldoenderedenenzijnvooruitstel van de huurverhoging. de staatssecretarisbaseertzijnbeleidteveel op „de rossige prognoses van
het centraal plan bureau", vindt de vvd. ook de christendemocratenvindendat van dam teveel van prognoses uitgaat die vaaktelaagzijn.depvda is het
met de regeringeensdat de huren op 1 juli met 8 procentomhoogmoeten. wijst de kamerdataf, danmoeten de huren op 1 april met 7
procentwordenverhoogd. men moetnietalleenkijkennaar de ontwikkeling van lonen en prijzen, men moetookkijkennaar het vrijbesteedbareinkomen.
de stijgingdaarvanzal in de komendejarenuiterstgeringzijn", zeikombrink. cpn-woordvoerderdraagstrazeidat de hurenbevrorenmoetenworden op het
huidigepeil.
Polimedia pipeline
                                          semantic model
                                                                                          RDF files



PoliticalMash
      up                                                                                      RDF
     (xml)




                                                                             Query NE
                                                                                                      Query content
        KB
 (preselect data)
                      Expanded query
                      creation                               NERs Speech



                                                              NERs Topic
                                                                                        Stopword removal
      similarity
      calculation      SRU Query (actor, date
                       range)                              TopicSet Speech
                                                                                          Topic modeling
       ranking
                                                            TopicSet Topic


       filtering      automatic query creation
                                                     contextual vectors


                                                                                                                       article
                                                                                                                      metadata
Evaluation
•We tried three different approaches:

      • Experiment 1: NEs in speech

      • Experiment 2: NEs + topics in speech

      • Experiment 3: NEs + topics in speech and debate

• Conclusion:
      • best approach:
             • named entities (speech + debate descriptions) and topics (speech + debate)
Results discussion
• structural elements of transcript:

      • used to create complex and rich query from the speech

• treating particular speech as a part of the bigger context (conversation) and creating a query that is a
 mixture of those elements:

      • higher number or related articles retrieved

• What we learned?

      • definition of link can be vague

      • simple document similarity methods doesn't work

      • journalist use their own “compression” methods when writing about debates

      • long speeches with dozens of NEs and topics are sometimes represented with few concise sentences
End
• Thank you for listening
• more information on polimedia.nl
similarity measures
•    similarity measures: metric that measures similarity or dissimilarity (distance) between two text strings for
     approximate string matching or comparison and in fuzzy string searching
•    Given two segments, the expanded query Q and the document from media archive D, the term frequency (TF) is
     associated to a term t from the query Q and the document D, the similarity between Q and D is computed
     according to the cosine similarity formula, where the generated value varies between 0 and 1:


•    CosineSimilarity(Q,D) =


•    BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in
     each document, regardless of the inter-relationship between the query terms within a document (e.g., their
     relative proximity).
    Given a query Q, containing keywords t1, ..., tn, the BM25 score of a document D is:


•    BM25Score(Q,D) =



•    where function represents term frequency of the term qtfrom the document D, is the length of the document D in
     words, and avgdl is the average document length in the text collection from which documents are drawn.
     Parameters k1 and bare free parameters. Function is the inverse document frequency weight of the query term qt.
similarity measures

•   The overlap coefficient is a similarity measure related to the Jaccard index that
    computes the overlap between two sets which is defined as follows:




                overlap(Q,D) =


•   If set X is a subset of Y or the converse then the overlap coefficient is equal to one.

Más contenido relacionado

Destacado

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destacado (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Polimedia Syposium - Linking the data sets

  • 1. PolimediaSymposium Linking the data sets Damir Juric (TU Delft) Amsterdam, 23.01.2013.
  • 2. Background: the PoliMedia project • The PoliMedia project: – driven by research questions from historians – interested in media coverage across several types of media outlets – Cross-media comparisons • conducted over a longer period of time, on different topics • focus on the coverage of the debates in the Dutch parliament • insight on the different choices that different media make while reporting on those debates – three phases : • modeling phase: creating a semantic model • data production phase: creating links between debates and associated media sources • application phase: searching and navigating linked datasets
  • 3. Introduction • Polimediasemantic model needs to represent: – people – topics – time – media types • Model has to be expressive enough: – describing events from the Dutch parliament
  • 4. Data Sets • Primary data set: – The Dutch parliamentary debates (Handelingender Staten-General or Dutch Hansard) – transcripts of speeches that politicians had in the parliament – this project uses data from the Political Mashup – all debates until the year 1995: • published as XML documents (OCR with satisfactory quality is being used). • data shows a fine-grained structure.
  • 5. Data Sets • Secondary data set: – different media types: • newspaper articles and radio bulletins – National Library of the Netherlands • newscasts – evening news and current affairs programs
  • 6. Semantic model: Goals • Goal of the project: – to publish the links on the Web – to use open Web formats and standards – Web query language – unique identifiers (URI’s) • Model has to be expressive: – important information regarding parliamentary debates should be easily accessed
  • 7. Debate: The structure Metadata Debate Metadata NEs={EconomischeZaken, Borssele} Aan de orde is de behandeling van: - de brief van de minister van Topic 1 EconomischeZakeninzakeBorssele(16226, nr. 26). De beraadslagingwordtgeopend. Speaker 1 Speaker 1 / Content NEs={Borssele, Partij van de Arbeid, D66} Speaker 2 Speaker 2 / Content Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken,Engeland, Ierland en Noorwegenwordteen van de doelstellingen Speaker 3 Speaker 3 / Content van onsbuitenlandsbeleidverwezenlijkt. Topic 2 Speaker 1 / Content
  • 8. Semantic model: Description Part of semantic model representation of the debates dataset
  • 9. Semantic model: Description Semantic model representation of the debates dataset
  • 10. Polimedia linking method • The challenge: how to create a representation of the speech that contains enough information, so it can be used as a query to retrieve relevant media articles from the archive? • Debate speeches and newspaper articles are generally different types of documents (so computing document similarity doesn’t work) in the style and scope • Speeches can contain large number of NEs and digressions: – Problem: hard to distinguish the right context for each speech • Newspaper articles: – very strict and concise – words are used sparingly
  • 11. Polimedia linking method • Our PoliMedia linking method consists of four steps: 1. topics: enriching the existing debate metadata with topics 2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)? 3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities 4. link creation: links are created between a speech and an article if the similarity score is above a threshold t
  • 12. Topics • Topic modeling: – popular tool for the unsupervised analysis of text, – used to check models, summarize the corpus, and guide exploration of its contents – topic models lead to semantically meaningful decompositions of text because they tend to place high probability on words that represent concepts • Extracting topics from speech: – ten words that represent one topic discussed inside the speech are extracted – all speeches contained inside one debate segment are concatenated into one text – set of ten words that represent one topic of the debate segment as a whole is extracted from that text • Input: text /number of iterations/number of topics • Output: generic names for topics/words that cluster around one topic • Example: – Test case: debate nr. 1975/number of iterations: 2000/numbner of topics: 1
  • 13. Automatic query creation Metadata NE Speech Staatssecretaris Regering Euro-kapitaalmarkt Debate Tariefnota Metadata TopicSet Topic Financiën moeten Zwitserland Topic 1 fraude Grave TopicSet Speech wetgeving Brussel Speaker 1 / Content inkomstenbelasting sociale EEG bronheffing Speaker 2 / Content misbruik Netherlandse Kombrink ten Contourennota Speaker 3 / Content rente fraudebestrijding Kombrink contourennota vraag Nederland Nederland gebruik Contou Topic 2 vereenvoudiging kamer OESO-verband Speaker 1 / Content tarief Midden-Oosten word misbruikfraudebest Engwirda rijdingismo-rapport tussen Couprie NE Topic Actor
  • 14. Automatic query creation Scholten +(text:wetsontwerptext:latertext:septembertext:prijzentext:lonentext:ontwikkelingtext:zeggentext:staatssecretaristext:gebrachttext:ertoe) (title:wetsontwerptitle:latertitle:septembertitle:prijzentitle:lonentitle:ontwikkelingtitle:zeggentitle:staatssecretaristitle:gebrachttitle:ertoe) +(text:staatssecretaristext:huurverhogingtext:jaartext:moetentext:apriltext:uitsteltext:percentagetext:nieuwe) (title:staatssecretaristitle:huurverhogingtitle:jaartitle:moetentitle:apriltitle:uitsteltitle:percentagetitle:nieuwe) +(text:regelentext:wet) (title:regelentitle:wet) +text:staatssecretaris title:staatssecretaris Mijnheerde Voorzitter ! In de memorie van toelichtingbij het voorliggendewetsontwerpzegt de Staatssecretaris , dathij over het trendmatigehuurstijgingspercentagevoor 1977 nognietskanzeggenomdat de gegevens over de teverwachtenontwikkeling van lonen en prijzenvoor 1977 nognietbekendzijn . Dit is gedateerd 14 september . Impliceertdit , wanneerergeensprakezouzijn van eenwetsontwerp tot verschuiving van de ingangsdatum , danook ten aanzien van de 8 procent per 1 aprilzougeldendatnogafgewachtmoetworden , of het dat percentage zalworden , omdat men pas later ietsmeerweet over de ontwikkeling van lonen en prijzen ? De Staatssecretarisvoeltzich door ditwetsontwerpeigenlijkgedwongen op eenvrijvroegtijdstiptochdaaroverietstezeggen . Immers , een week later namelijkbij brief van 21 septemberkomthijwel met eenbepaaldconcreetvoorstel . Daarinstelthij : Het overleg met de vastecommissieheeftmijertoegebracht ... ExpandedQuery = NERsSpeech TopicSet Speech NER Topic TopicSet Topic + Speaker X = ActorFromSpeech TimeFrame
  • 15. Example of the relevant article vvd: van dam baseertbeleidteveel op rossige prognoses van planbureau kamermeerderheidtegenuitstel van huurverhoging den haag — eenmeerderheid van de tweedekamervoelternietsvoor de huurverhoging van volgendjaaruitte, stellen van 1 april tot 1 juli. de fracties van kvp, arp, chu, vvd, ds'7o en de kleinechristelijkepairtijenwillen de huurverhoging op 1 aprillatendoorgaan. staatssecretaris van dam van volkshuisvestingwiluitstelom op 1 julivolgendjaareennieuwhuurbeleidtekunneninvoeren. daarvoorzalhij op kortetetmijndriewetsontwerpenindienen: de huurprijzenwet, de wet op de huurcommissie en eenwijziging van het burgerlijkwetboek.debewindsmanzeidat met het afwijzen van uitstel in feiteinvoering van het nieuwehuurbeleid op 1 julivolgendjaaronmogelijkwordtgemaakt. het nieuwestelselzaldan pas in 1978 ingevoerdkunnenworden. „met eenuitstel van driemaandenkomen we preciesuit", aldus de heer van dam. de arp'erscholten, die medenamenskvp en chusprak, zegde de regeringallemedewerking toe om de nieuwehuurwetnog in dezekabinetsperiodetebehandelen, maarhijtwijfeldeeraan of op 1 juli 1977 het nieuwehuurbeleid al ingevoerdkanworden. de confessionelen en de vvdhouden vast aaneenhuurverhoging van 8 procent op 1 april. staatssecretaris van dam wil pas op 1 julizonverhoging. zou de verhogingtoch op 1 aprilmoeteningaan, danwilhijeenverhoging van 7 procent. de bewindsmankomtvolgens de confessionelentevroeg met eenverlaging van de jaarlijksehuurverhoging.het d'66-kamerlid nypelsdiendeeenmotie in waarinhij de regeringverzoektbijverwerping van het uitsteltekomen met eenwetsontwerpvoor 7 procent op 1 april. ook de pvda'etkombrinksuggereerdedezeoplossing. de heerkombrink deed eendringendberoep op de confessionelenom het uitstelteaanvaarden. de vvder de beer vonddatelrnietvoldoenderedenenzijnvooruitstel van de huurverhoging. de staatssecretarisbaseertzijnbeleidteveel op „de rossige prognoses van het centraal plan bureau", vindt de vvd. ook de christendemocratenvindendat van dam teveel van prognoses uitgaat die vaaktelaagzijn.depvda is het met de regeringeensdat de huren op 1 juli met 8 procentomhoogmoeten. wijst de kamerdataf, danmoeten de huren op 1 april met 7 procentwordenverhoogd. men moetnietalleenkijkennaar de ontwikkeling van lonen en prijzen, men moetookkijkennaar het vrijbesteedbareinkomen. de stijgingdaarvanzal in de komendejarenuiterstgeringzijn", zeikombrink. cpn-woordvoerderdraagstrazeidat de hurenbevrorenmoetenworden op het huidigepeil.
  • 16. Polimedia pipeline semantic model RDF files PoliticalMash up RDF (xml) Query NE Query content KB (preselect data) Expanded query creation NERs Speech NERs Topic Stopword removal similarity calculation SRU Query (actor, date range) TopicSet Speech Topic modeling ranking TopicSet Topic filtering automatic query creation contextual vectors article metadata
  • 17. Evaluation •We tried three different approaches: • Experiment 1: NEs in speech • Experiment 2: NEs + topics in speech • Experiment 3: NEs + topics in speech and debate • Conclusion: • best approach: • named entities (speech + debate descriptions) and topics (speech + debate)
  • 18. Results discussion • structural elements of transcript: • used to create complex and rich query from the speech • treating particular speech as a part of the bigger context (conversation) and creating a query that is a mixture of those elements: • higher number or related articles retrieved • What we learned? • definition of link can be vague • simple document similarity methods doesn't work • journalist use their own “compression” methods when writing about debates • long speeches with dozens of NEs and topics are sometimes represented with few concise sentences
  • 19. End • Thank you for listening • more information on polimedia.nl
  • 20. similarity measures • similarity measures: metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison and in fuzzy string searching • Given two segments, the expanded query Q and the document from media archive D, the term frequency (TF) is associated to a term t from the query Q and the document D, the similarity between Q and D is computed according to the cosine similarity formula, where the generated value varies between 0 and 1: • CosineSimilarity(Q,D) = • BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). Given a query Q, containing keywords t1, ..., tn, the BM25 score of a document D is: • BM25Score(Q,D) = • where function represents term frequency of the term qtfrom the document D, is the length of the document D in words, and avgdl is the average document length in the text collection from which documents are drawn. Parameters k1 and bare free parameters. Function is the inverse document frequency weight of the query term qt.
  • 21. similarity measures • The overlap coefficient is a similarity measure related to the Jaccard index that computes the overlap between two sets which is defined as follows: overlap(Q,D) = • If set X is a subset of Y or the converse then the overlap coefficient is equal to one.