The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
4. Three retrieval paradigms Document Retrieval Focused Retrieval Aggregated Retrieval Complexity of the information space (s)
5. Classical document retrieval Retrieval System Query Document corpus Ranked Documents One homogeneous information space
6. Classical document retrieval process Documents Query Ranked documents Representation Function Representation Function Query Representation Document Representation Retrieval Function Index
7. Information retrieval process Documents Query Results Representation Function Representation Function Query representation Object representation Retrieval Function Index Task Context Interface Interaction Multimodality Genre Media Language Structure Heterogeneity The Turn, Ingwersen & Jarvelin, 2005
12. Logical structure - XML Document This is a heading This is some text This is a quote <doc> <head>This is a heading</head> <text>This is some text</text> <quote>This is a quote</quote> </doc> doc head text quote This is a heading This is a quote This is some text
18. “ Element” Ranking algorithms Combination of evidence Element score Document score Element size … … “ Aggregation” in semi-complex information spaces vector space model language model extending DB model polyrepresentation probabilistic model logistic regression Bayesian network divergence from randomness Boolean model machine learning belief model statistical model natural language processing structured text models
38. Images on top Images in the middle Images at the bottom Images at top-right Images on the left Images at the bottom-right Result presentation: User studies Blended vs non-blended interfaces 3 verticals (image, video, news) 3 positions 3 vertical intents (high, medium, low)
39.
40. Evaluation: Test collections ImageCLEF photo retrieval track …… TREC web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 … d n judgment R N R … R …… Blog Vertical Reference (Encyclopedia) Vertical Image Vertical General Web Vertical Shopping Vertical topic t 1 doc d 1 d 2 … d V1 judgment R N … R vertical V 1 V 2 d 1 d 2 … d V2 N N … R …… V k d 1 d 2 … d Vk N N … N t 1 existing test collections (simulated) verticals
41. Evaluation: Test collections * There are on an average more than 100 events/shots contained in each video clip (document). Zhou Statistics on Topics number of topics 150 average rel docs per topic 110.3 average rel verticals per topic 1.75 ratio of “General Web” topics 29.3% ratio of topics with two vertical intents 66.7% ratio of topics with more than two vertical intents 4.0% quantity/media text image video total size (G) 2125 41.1 445.5 2611.6 number of documents 86,186,315 670,439 1,253* 86,858,007
42.
43.
44.
Notas del editor
Passage retrieval was an active research area in the mid 90s. The idea here is that a document is decomposed into passages, and then we are doing IR against passages and not the whole document. A main issue is the actually decomposition of the document in passages. There were three main techniques: sliding windows of words using the discourse, e.g. every sentence is a passage, or more likely, every paragraph is a passage using a topic segmentation algorithm such as TextTiling, to identify shifts in topics, and to assign passage boundaries in such occurring topic shifts.
Now I will concentrate on the use of the structure and in particular XML to perform focused retrieval. But let me say few words about structure and document. Structure is everywhere in a document. I will mainly concentrate on the structure in the sense of the hierarchical or logical structure.
This illustrates the logical structure of a document, and how this is represented in the XML markup language. XML is de facto markup language for documents, and represent both the content and the structure of documents. Note that here do, head, text, and quote are what are called XML elements. Doc is the root element, or in other words the whole documents.
How can the structure be used for focused retrieval? Before describing the use of structure to retrieve so-called XML elements, I want to say few words about complex information needs.
To express information needs in the context of XML retrieval, we need query languages. These can be divided into four main groups, where from up here to down here, we have an increase in expressivity. What is the appropriate query language depends on the application and its users.
Now what XML retrieval is about: Before hand we do not know at which level of the hierarchy the best answer is contained for a given query. So we do not know what is actually a retrieval units, as any elements cane bee returned. Elements are related to each other, and this may be useful to find these that should be returned. Finally we may have structural constraints to satisfy.
This is an example of the outcomes of the techniques described so far, where as a result to a given query, we have a ranked list of element.
Lots of models from IR and approaches from other areas were used to rank elements for a given queries. There is yet to know what is the best model. XML element retrieval is more a combination of evidence problem, where what matters is what get combined. We know that whatever the model combining the element score (estimated relevance), the document score and some information about the size is what bring the best performance and this whatever the model.
At INEX, a number of user studies were carried (Tassos Tombros at QMUL was heavily involved in these). One outcome was that, although users liked to be returned XML elements, they preferred these to be grouped by document containing them. This led to the definition of a retrieval task at INEX, called releance in context. Rank article Highlight relevant elements This can be implemented as a heat map. This is an example of a aggregated result.
A second line of outcomes of the user studies was that user like to be shown the context of the element they were reading. This can be done by the following interface, where on one side we display the element, and the other side the ToC of the document containing that element. As part of his PhD student, Zoltan looked at way to build a ToC that adapt to the element be displayed, where for example the structure of the elements around the displayed elements are shown in more details than those further away. This is again an example of an aggregated result.
Initial ranking of elements, where elements in the same documents are shown in the same colour. We could aggregate elements in various ways to form virtual documents. The terminology of virtual document came a while ago from a discussion between TR and YC, and Thomas has extended it in so-called augmentation context. The RiC is a special case of the above, where we are grouping elements per document. There is nothing stopping us to do the grouping across documents.
The notion of virtual documents is not new for example on the web, although the aggregated aspect is not exactly the same. Example of virtual documents is for example the organization of result into clusters. A cluster is a virtual document. We can go further and provide a summary of the documents forming the clusters, which is what WebInEssence (not working any more) is doing. In the news domain, the TDT at TREC is about about detecting topics and tracking them so that to present the while news item in one go to users
Aggregated retrieval may also consist of what I call aggregated views. Going back to our example, we could for instance have one window showing one type of results (title element), and a window showing abstract element, and so on. Note that ToC is special case, where we have an element in one window and on the other a selected number of elements that correspond to the ToC for that element.
Test collection generation: Existing test collections (topic, doc, judgment) Defining a set of verticals (varied at different genres and media) Document classification (mapping documents in existing test collections into the verticals defined) Duplicate topic detection (detecting duplicate topics from existing test collections, e.g. “golden gate bridge” in image and web test collection) Extracting topics with more than one vertical intent Adding topics with only “General Web” vertical intent
Two tables + one figure Table 1: statistics of whole test collection Figure 1: breakdown of text documents into different genres Table 2: basic statistics of topics Those results are based on AIRS paper.