Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Text Mining lecture
Information Retrieval
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, October 18th
, 2017
A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc
Core Research Questions
 How to represent information?
- The information need and search requests
- The objects to be sho...
Two views on ‘search’
DB
 Business applications
 Deductive reasoning
 Precise and efficient
query processing
 Users wi...
Search Flow Chart
A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5
IR vs. AI
 Many related topics in AI:
- Computational Linguistics
- Natural Language Processing
- Question Answering
- In...
IR vs. AI (Kunstmatige Intelligentie)
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human...
IR vs. AI
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with se...
IR vs. AI
“In some sense, of course, classic IR is superhuman: there was
no pre-existing human skill, as there was with se...
Relevance
 Inherently dependent on user, context and task
 Different “relevance criteria”
- Topicality: is the document ...
“Computational Relevance”
“Intellectually it is possible for a human to establish the
relevance of a document to a query. ...
‘Computational Relevance’
 How to combine different
indicators of relevance?
- E.g., topicality, child-
suitability, pola...
Relevance
 Various aspects of understanding this notion of relevance
position information retrieval between computer scie...
NLP in IR
 Stemming & Stopping
- De facto default setting
 N-grams (bi-grams)
- SDM (Sequential Dependence Model)
 Enti...
Footnote in Victor Lavrenko’s PhD thesis
 “It is my personal observation that almost every
mathematically inclined gradua...
Take words as
they stand !
The Secret
 The user can simply reformulate their information need in
response to insufficiently relevant results retriev...
Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, p...
 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Kno...
Information Retrieval intro TMM
Próxima SlideShare
Cargando en…5
×

Information Retrieval intro TMM

219 visualizaciones

Publicado el

Intro IR for the Text and Multimedia Mining course

Publicado en: Ciencias
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Information Retrieval intro TMM

  1. 1. Text Mining lecture Information Retrieval Prof.dr.ir. Arjen P. de Vries arjen@acm.org Nijmegen, October 18th , 2017
  2. 2. A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc
  3. 3. Core Research Questions  How to represent information? - The information need and search requests - The objects to be shown in response to an information request  How to match information representations? The information objects to be retrieved are not necessarily textual! Van Rijsbergen, 1979
  4. 4. Two views on ‘search’ DB  Business applications  Deductive reasoning  Precise and efficient query processing  Users with technical skills (SQL) and precise information needs Selection Books where category=‘CS’ IR  Digital libraries, patent collections, etc.  Inductive reasoning  Best-effort processing  Untrained users with imprecise information needs Ranking Books about CS Note: SemWeb more DB than IR!!! Symbolic Connectionist
  5. 5. Search Flow Chart A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5
  6. 6. IR vs. AI  Many related topics in AI: - Computational Linguistics - Natural Language Processing - Question Answering - Information Extraction - Machine Translation - Computer vision / Multimedia vs.  Information Retrieval?
  7. 7. IR vs. AI (Kunstmatige Intelligentie) “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  8. 8. IR vs. AI “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  9. 9. IR vs. AI “In some sense, of course, classic IR is superhuman: there was no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.” Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006
  10. 10. Relevance  Inherently dependent on user, context and task  Different “relevance criteria” - Topicality: is the document about the information request? - Readability: can I understand the text? - Authoritiveness: can I trust the text? - Child-suitability: is the text appropriate for children? - Etc.
  11. 11. “Computational Relevance” “Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.” Van Rijsbergen, 1976 Retrieval Model
  12. 12. ‘Computational Relevance’  How to combine different indicators of relevance? - E.g., topicality, child- suitability, polarity, …  Apply ‘copulas’ (a technique from econometrics) to model non-linear dependencies (SIGIR 2013, CIKM 2014)
  13. 13. Relevance  Various aspects of understanding this notion of relevance position information retrieval between computer science and information science  Examples of questions that traditionally do not even presume involvement of a computer: - What makes an information object relevant? - What stages constitute a search process? - How does relevance evolve during this search process? - How do users learn from the search process? - Why do users issue short queries even if we know that long ones are more effective? Etc.
  14. 14. NLP in IR  Stemming & Stopping - De facto default setting  N-grams (bi-grams) - SDM (Sequential Dependence Model)  Entity tagging
  15. 15. Footnote in Victor Lavrenko’s PhD thesis  “It is my personal observation that almost every mathematically inclined graduate student in Information Retrieval attempts to formulate some sort of a non- independent model of IR within the first two or three years of his studies. The vast majority of these attempts yield no improvements and remain unpublished.”
  16. 16. Take words as they stand !
  17. 17. The Secret  The user can simply reformulate their information need in response to insufficiently relevant results retrieved by the system!
  18. 18. Why Search Remains Difficult to Get Right  Heterogeneous data sources - WWW, wikipedia, news, e-mail, patents, twitter, personal information, …  Varying result types - “Documents”, tweets, courses, people, experts, gene expressions, temperatures, …  Multiple dimensions of relevance - Topicality, recency, reading level, … Actual information needs often require a mix within and across dimensions. E.g., “recent news and patents from our top competitors”
  19. 19.  System’s internal information representation - Linguistic annotations - Named entities, sentiment, dependencies, … - Knowledge resources - Wikipedia, Freebase, IDC9, IPTC, … - Links to related documents - Citations, urls  Anchors that describe the URI - Anchor text  Queries that lead to clicks on the URI - Session, user, dwell-time, …  Tweets that mention the URI - Time, location, user, …  Other social media that describe the URI - User, rating - Tag, organisation of `folksonomy’ + UNCERTAINTY ALL OVER!

×