Archive for the 'Paper abstracts' Category

Query log analysis for children

Tuesday, May 11th, 2010, posted by Djoerd Hiemstra

Query log analysis in the context of Information Retrieval for children

by Sergio Duarte Torres, Djoerd Hiemstra, and Pavel Serdyukov

In this paper we analyze queries and sessions intended to satisfy children’s information needs using a large-scale query log. The aim of this analysis is twofold: i) To identify differences between such queries and sessions, and general queries and sessions; ii) To enhance the query log by including annotations of queries, sessions, and actions for future research on information retrieval for children. We found statistically significant differences between the set of general purpose and queries seeking for content intended for children. We show that our findings are consistent with previous studies on the physical behavior of children using Web search engines.

most frequent queries

The paper will be presented at the ACM IIiX Conference in New Brunswick, USA

[download preprint]

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.

Query Performance Prediction

Friday, April 23rd, 2010, posted by Djoerd Hiemstra

Evaluation Contrasted with Effectiveness

by Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, and Franciska de Jong

Query performance predictors are commonly evaluated by reporting correlation coefficients to denote how well the methods perform at predicting the retrieval performance of a set of queries. Despite the amount of research dedicated to this area, one aspect remains neglected: how strong does the correlation need to be in order to realize an improvement in retrieval effectiveness in an operational setting? We address this issue in the context of two settings: Selective Query Expansion and Meta-Search. In an empirical study, we control the quality of a predictor in order to examine how the strength of the correlation achieved, affects the effectiveness of an adaptive retrieval system. The results of this study show that many existing predictors fail to achieve a correlation strong enough to reliably improve the retrieval effectiveness in the Selective Query Expansion as well as the Meta-Search setting.

[download pdf]

A Case for Automatic System Evaluation

Friday, April 16th, 2010, posted by Djoerd Hiemstra

by Claudia Hauff, Djoerd Hiemstra, Leif Azzopardi, and Franciska de Jong

Ranking a set retrieval systems according to their retrieval effectiveness without relying on relevance judgments was first explored by Soboroff et al. (Soboroff, I., Nicholas, C., Cahan, P. Ranking retrieval systems without relevance judgments. In: SIGIR 2001, pp. 66-73) Over the years, a number of alternative approaches have been proposed, all of which have been evaluated on early TREC test collections. In this work, we perform a wider analysis of system ranking estimation methods on sixteen TREC data sets which cover more tasks and corpora than previously. Our analysis reveals that the performance of system ranking estimation approaches varies across topics. This observation motivates the hypothesis that the performance of such methods can be improved by selecting the “right” subset of topics from a topic set. We show that using topic subsets improves the performance of automatic system ranking methods by 26% on average, with a maximum of 60%. We also observe that the commonly experienced problem of underestimating the performance of the best systems is data set dependent and not inherent to system ranking estimation. These findings support the case for automatic system evaluation and motivate further research.

[download pdf]

Query-based sampling using only snippets

Thursday, April 1st, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results.

The paper will be presented at the SIGIR 2010 Workshop on Large-Scale Distributed Systems for Information Retrieval, on July 23rd, 2010 in Geneva, Switzerland

[download pdf]

Automatic summarisation of discussion fora

Thursday, March 25th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar, Rieks op den Akker, and Djoerd Hiemstra

Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by ‘cherry-picking’ sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

Published in Natural Language Engineering 16(2): 161–192, Cambridge University Press. [download pdf]

QueryBased Sampling: Can we do Better than Random?

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

[download pdf]

Learning to Merge Search Results

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

Learning to Merge Search Results for Efficient Distributed Information Retrieval

Kien Tjin-Kam-Jet and Djoerd Hiemstra

Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which learn a function that merges results based on information that is readily available, i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM methods do not improve over those baselines

[download pdf]

Remko Nolten graduates on automatic hyperlinking

Friday, January 22nd, 2010, posted by Djoerd Hiemstra

WikiLink: Anchor Detection and Link Generation in Wiki’s

by Remko Nolten

In this research we try to automate the process of link generation in Wiki’s by looking at existing link generation techniques and enhancing these with our own ideas. We started the research by analyzing a large document corpus to find out more about the links we want to create. In our analysis we looked into three aspects of our datasets. First, we wanted to know more about the relation between the text that is used to display the link and the title of the page where the link points to. We showed that a large majority of the links could theoretically be identified by matching the text of the link with the page title of the appropriate page, but we also identified several problems with this approach. Second, we wanted to learn more about the existing link structure in our dataset. Here, we confirmed most advantages and disadvantages of using existing links in a link generation algorithm that were also identified by other studies. Finally, we decided to analyze the grammatical structure of links, to see if we could use this later on. Our analysis showed that a very large majority of the links were nouns or noun phrases, which suggests that this would be a good way to identify links in a text.

Based on the results of this analysis, we built a framework in which we could implement new and existing methods for link generation. In the framework, the process of ‘anchor detection’ (the technique of discovering phrases in a larger text that could be used a basis for a link) and ‘destination finding’ (the process of finding a suitable destination page for a short piece of text) where separated. This way we could try multiple combinations to see which would work best. Using this framework, we found that our grammar based anchor detection algorithm combined with multiple destination finding algorithms resulted in the best performance. Our final performance figures were better than most competitors which showed the potential of our techniques.

Towards Affordable Disclosure of Spoken Heritage Archives

Friday, December 11th, 2009, posted by Djoerd Hiemstra

by Roeland Ordelman, Willemijn Heeren, Franciska de Jong, Marijn Huijbregts, and Djoerd Hiemstra

This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken heritage archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, we at least want to provide search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is not yet satisfactory, and requires additional research.

To be published in the Journal of Digital Information 10(6).

[download pdf]