SlideShare una empresa de Scribd logo
1 de 27
Search engines for the humanities that go beyond Google Suzan Verberne Centre for Language and Speech Technology  Radboud University Nijmegen Brainstorm Meeting  e-Humanities, March 29 2011 29.03.2011 Suzan Verberne 1
Outline Searching with Google Limitations of Google search Searching in text collections Better guidance through texts What technology is needed? 29.03.2011 2 Suzan Verberne
Searching with Google 29.03.2011 3 Suzan Verberne
Searching with Google 29.03.2011 4 Suzan Verberne
How does Google work? Index Relevance model query 29.03.2011 5 Suzan Verberne
How does Google work? Google calculates the relevance of web pages using word counts and popularity estimates. So, Google does not ‘understand’ the texts it sees; it can efficiently estimate a document’s relevance based on the words it contains. This is very effective and efficient for retrieving full documents (web pages). 29.03.2011 Suzan Verberne 6
Limitations of Google But what if I have a more specific information need:  Which books did Multatuli write? How did other writers respond to Multatuli’s work? To which events did Multatuli refer in ‘Max Havelaar’? Then I need  a more specialized text collection than the web a search engine that guides me through the retrieved documents. 29.03.2011 Suzan Verberne 7
Specialized text collection: DBNL DBNL: The Digital Library of Dutch Literature A website about Dutch literature, language and cultural history.  Contains literary texts, secondary literature and additional information such as biographies, portraits and hyperlinks. http://www.dbnl.org 29.03.2011 Suzan Verberne 8
Searching in DBNL 29.03.2011 Suzan Verberne 9
Searching in DBNL 29.03.2011 Suzan Verberne 10 ,[object Object],[object Object]
Searching in DBNL 29.03.2011 Suzan Verberne 12 ,[object Object]
4 pages of results, not sorted by relevance.,[object Object]
The document is 6 pages long.
Only the query term is highlighted,[object Object]
Better guidance through texts Step 1: Label important terms and entities in the text Person and place names Book and journal titles Events Other terms of interest This task is called ‘named entity recognition’. It is well developed in the field of computational linguistics. 29.03.2011 Suzan Verberne 15
Better guidance through texts 29.03.2011 Suzan Verberne 16 Journal title Book title Person name Person name
Better guidance through texts Step 2: collect information about entities in the text: Factual information: what is it and to whom does it relate? Links to external sources (biographies, encyclopaedias) Links to other mentions in the collection Automatically collecting large amounts of factual information is a current research topic in computational linguistics/artificial intelligence. 29.03.2011 Suzan Verberne 17
Better guidance through texts 29.03.2011 Suzan Verberne 18 Vaderlandsche Letteroefeningen was meer dan een eeuw lang een van de toonaangevende literair-culturele tijdschriften van Nederland.  Verscheen maandelijks.  Het laatste nummer kwam van de pers in december 1876. Het doel was in de eerste plaats om de lezers te wijzen op nuttige publicaties. Dat betrof zowel recente werken als boeken die lang geleden verschenen waren en niet meer in de aandacht stonden. http://www.kb.nl/dossiers/vaderlandscheletteroefeningen/
Better guidance through texts 29.03.2011 Suzan Verberne 19 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli.  Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandsekoloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur.  http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
Collecting facts from text Dutch Wikipedia: 678.683 articles (March 2011) Articles are categorized by topic Number of articles about Dutch writers: 439  29.03.2011 Suzan Verberne 20
Collecting facts from text Split the texts in sentences Analyze the sentences with a parser that indicates the most important syntactic parts of each sentence. Generate (nuclear) facts from the syntactic analysis: SUBJECT  | VERB  | OBJECT/PREDICATE | COMPLEMENTS Multatuli | write  | Max Havelaar           | in 1860, in Java 29.03.2011 Suzan Verberne 21
Collecting facts from text Hans Dekkers http://nl.wikipedia.org/wiki/Hans_Dekkers_(1954) “Hijschrijftromans, korteverhalen, gedichten en theaterstukken”“He writes novels, short stories, poems and plays” Factoids: hij | schrijven | theaterstukken |  | hij | schrijven | gedichten |  | hij | schrijven | romans |  | hij | schrijven | korteverhalen |  | 29.03.2011 Suzan Verberne 22
Collecting facts from text P.F. Thomése http://nl.wikipedia.org/wiki/P.F._Thom%C3%A9se “In 1991 en 2003 ontving hij literaire prijzen.”“In 1991 and 2003, hereceivedliteraryawards.” Factoids: hij | ontvangen | literaire prijzen |  in 1991,  in 2003 | 29.03.2011 Suzan Verberne 23
Better guidance through texts Step 3: enrich the text collection with this factual information. When the user clicks one of the labelled terms, the most important factual information will be shown, together with links to sources. 29.03.2011 Suzan Verberne 24
Better guidance through texts 29.03.2011 Suzan Verberne 25 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli.  Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandse koloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur.  http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
How to proceed? There are multiple initiatives (also in the Netherlands) to develop the described techniques. Challenges: What are the needs of the target group? Collaboration is essential. Older varieties of Dutch: development of resources and tools is needed (some already exist). User interfacing is very important: specialist knowledge needed. … 29.03.2011 Suzan Verberne 26
Thankyou! You can find more information on my web site (Google my name and you will get there) 29.03.2011 27 Suzan Verberne

Más contenido relacionado

Similar a Search engines for the humanities that go beyond Google

How to write a dissertation literature review chapter
How to write a dissertation literature review chapterHow to write a dissertation literature review chapter
How to write a dissertation literature review chapterThe Free School
 
238974514 autobibliography
238974514 autobibliography238974514 autobibliography
238974514 autobibliographyhomeworkping4
 
Analysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docxAnalysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docxwrite12
 
One of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docxOne of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docxvannagoforth
 
1st class culture, identity, and mass media
1st class culture, identity, and mass media1st class culture, identity, and mass media
1st class culture, identity, and mass medialmazurs1
 
Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"bullsi
 
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docxAssignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docxsimba35
 
Background research
Background researchBackground research
Background researchSamiulhaq32
 
Elit 46 c class 19
Elit 46 c class 19Elit 46 c class 19
Elit 46 c class 19kimpalmore
 
14 ayesha abrar
14 ayesha abrar14 ayesha abrar
14 ayesha abrarSRJIS
 

Similar a Search engines for the humanities that go beyond Google (13)

How to write a dissertation literature review chapter
How to write a dissertation literature review chapterHow to write a dissertation literature review chapter
How to write a dissertation literature review chapter
 
238974514 autobibliography
238974514 autobibliography238974514 autobibliography
238974514 autobibliography
 
Analysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docxAnalysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docx
 
One of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docxOne of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docx
 
1st class culture, identity, and mass media
1st class culture, identity, and mass media1st class culture, identity, and mass media
1st class culture, identity, and mass media
 
Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"
 
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docxAssignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
 
Jonathan Culler on Literary Theory
Jonathan Culler on Literary TheoryJonathan Culler on Literary Theory
Jonathan Culler on Literary Theory
 
Background research
Background researchBackground research
Background research
 
OWL Purdue MLA format
OWL Purdue MLA formatOWL Purdue MLA format
OWL Purdue MLA format
 
Elit 46 c class 19
Elit 46 c class 19Elit 46 c class 19
Elit 46 c class 19
 
14 ayesha abrar
14 ayesha abrar14 ayesha abrar
14 ayesha abrar
 
Literary Analysis - Worlds Collide
Literary Analysis - Worlds CollideLiterary Analysis - Worlds Collide
Literary Analysis - Worlds Collide
 

Más de Leiden University

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...Leiden University
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discoveryLeiden University
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionalsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threadsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchLeiden University
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextLeiden University
 

Más de Leiden University (12)

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discovery
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals
 
kanker.nl & Data Science
kanker.nl & Data Sciencekanker.nl & Data Science
kanker.nl & Data Science
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Computationeel denken
Computationeel denkenComputationeel denken
Computationeel denken
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in context
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Search engines for the humanities that go beyond Google

  • 1. Search engines for the humanities that go beyond Google Suzan Verberne Centre for Language and Speech Technology Radboud University Nijmegen Brainstorm Meeting e-Humanities, March 29 2011 29.03.2011 Suzan Verberne 1
  • 2. Outline Searching with Google Limitations of Google search Searching in text collections Better guidance through texts What technology is needed? 29.03.2011 2 Suzan Verberne
  • 3. Searching with Google 29.03.2011 3 Suzan Verberne
  • 4. Searching with Google 29.03.2011 4 Suzan Verberne
  • 5. How does Google work? Index Relevance model query 29.03.2011 5 Suzan Verberne
  • 6. How does Google work? Google calculates the relevance of web pages using word counts and popularity estimates. So, Google does not ‘understand’ the texts it sees; it can efficiently estimate a document’s relevance based on the words it contains. This is very effective and efficient for retrieving full documents (web pages). 29.03.2011 Suzan Verberne 6
  • 7. Limitations of Google But what if I have a more specific information need: Which books did Multatuli write? How did other writers respond to Multatuli’s work? To which events did Multatuli refer in ‘Max Havelaar’? Then I need a more specialized text collection than the web a search engine that guides me through the retrieved documents. 29.03.2011 Suzan Verberne 7
  • 8. Specialized text collection: DBNL DBNL: The Digital Library of Dutch Literature A website about Dutch literature, language and cultural history. Contains literary texts, secondary literature and additional information such as biographies, portraits and hyperlinks. http://www.dbnl.org 29.03.2011 Suzan Verberne 8
  • 9. Searching in DBNL 29.03.2011 Suzan Verberne 9
  • 10.
  • 11.
  • 12.
  • 13. The document is 6 pages long.
  • 14.
  • 15. Better guidance through texts Step 1: Label important terms and entities in the text Person and place names Book and journal titles Events Other terms of interest This task is called ‘named entity recognition’. It is well developed in the field of computational linguistics. 29.03.2011 Suzan Verberne 15
  • 16. Better guidance through texts 29.03.2011 Suzan Verberne 16 Journal title Book title Person name Person name
  • 17. Better guidance through texts Step 2: collect information about entities in the text: Factual information: what is it and to whom does it relate? Links to external sources (biographies, encyclopaedias) Links to other mentions in the collection Automatically collecting large amounts of factual information is a current research topic in computational linguistics/artificial intelligence. 29.03.2011 Suzan Verberne 17
  • 18. Better guidance through texts 29.03.2011 Suzan Verberne 18 Vaderlandsche Letteroefeningen was meer dan een eeuw lang een van de toonaangevende literair-culturele tijdschriften van Nederland. Verscheen maandelijks. Het laatste nummer kwam van de pers in december 1876. Het doel was in de eerste plaats om de lezers te wijzen op nuttige publicaties. Dat betrof zowel recente werken als boeken die lang geleden verschenen waren en niet meer in de aandacht stonden. http://www.kb.nl/dossiers/vaderlandscheletteroefeningen/
  • 19. Better guidance through texts 29.03.2011 Suzan Verberne 19 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli. Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandsekoloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur. http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
  • 20. Collecting facts from text Dutch Wikipedia: 678.683 articles (March 2011) Articles are categorized by topic Number of articles about Dutch writers: 439 29.03.2011 Suzan Verberne 20
  • 21. Collecting facts from text Split the texts in sentences Analyze the sentences with a parser that indicates the most important syntactic parts of each sentence. Generate (nuclear) facts from the syntactic analysis: SUBJECT | VERB | OBJECT/PREDICATE | COMPLEMENTS Multatuli | write | Max Havelaar | in 1860, in Java 29.03.2011 Suzan Verberne 21
  • 22. Collecting facts from text Hans Dekkers http://nl.wikipedia.org/wiki/Hans_Dekkers_(1954) “Hijschrijftromans, korteverhalen, gedichten en theaterstukken”“He writes novels, short stories, poems and plays” Factoids: hij | schrijven | theaterstukken | | hij | schrijven | gedichten | | hij | schrijven | romans | | hij | schrijven | korteverhalen | | 29.03.2011 Suzan Verberne 22
  • 23. Collecting facts from text P.F. Thomése http://nl.wikipedia.org/wiki/P.F._Thom%C3%A9se “In 1991 en 2003 ontving hij literaire prijzen.”“In 1991 and 2003, hereceivedliteraryawards.” Factoids: hij | ontvangen | literaire prijzen | in 1991, in 2003 | 29.03.2011 Suzan Verberne 23
  • 24. Better guidance through texts Step 3: enrich the text collection with this factual information. When the user clicks one of the labelled terms, the most important factual information will be shown, together with links to sources. 29.03.2011 Suzan Verberne 24
  • 25. Better guidance through texts 29.03.2011 Suzan Verberne 25 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli. Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandse koloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur. http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
  • 26. How to proceed? There are multiple initiatives (also in the Netherlands) to develop the described techniques. Challenges: What are the needs of the target group? Collaboration is essential. Older varieties of Dutch: development of resources and tools is needed (some already exist). User interfacing is very important: specialist knowledge needed. … 29.03.2011 Suzan Verberne 26
  • 27. Thankyou! You can find more information on my web site (Google my name and you will get there) 29.03.2011 27 Suzan Verberne