Archive for 2010

Query log analysis for children

Tuesday, May 11th, 2010, posted by Djoerd Hiemstra

Query log analysis in the context of Information Retrieval for children

by Sergio Duarte Torres, Djoerd Hiemstra, and Pavel Serdyukov

In this paper we analyze queries and sessions intended to satisfy children’s information needs using a large-scale query log. The aim of this analysis is twofold: i) To identify differences between such queries and sessions, and general queries and sessions; ii) To enhance the query log by including annotations of queries, sessions, and actions for future research on information retrieval for children. We found statistically significant differences between the set of general purpose and queries seeking for content intended for children. We show that our findings are consistent with previous studies on the physical behavior of children using Web search engines.

most frequent queries

The paper will be presented at the ACM IIiX Conference in New Brunswick, USA

[download preprint]

Quality from Twente

Monday, May 3rd, 2010, posted by Djoerd Hiemstra

FC Twente Champions of the Netherlands!

Newest logo UT a real bargain ;-)

Automatic Reformulation of Children’s Search Queries

Monday, May 3rd, 2010, posted by Djoerd Hiemstra

Maarten van Kalsbeek, Joost de Wit, Dolf Trieschnigg, Paul van der Vet, Theo Huibers and Djoerd Hiemstra

The number of children that have access to an Internet connection (at home or at school) is large and growing fast. Many of these children search the web by using a search engine. These search engines do not consider their skills and preferences however, which makes searching difficult. This paper tries to uncover methods and techniques that can be used to automatically improve search results on queries formulated by children. In order to achieve this, a prototype of a query expander is built that implements several of these techniques. The paper concludes with an evaluation of the prototype and a discussion of the promising results.

download pdf

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.

Anchor text for ClueWeb09 Category A

Tuesday, April 27th, 2010, posted by Djoerd Hiemstra

We’ve put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research): The source code is available from: http://mirex.sourceforge.net

Query Performance Prediction

Friday, April 23rd, 2010, posted by Djoerd Hiemstra

Evaluation Contrasted with Effectiveness

by Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, and Franciska de Jong

Query performance predictors are commonly evaluated by reporting correlation coefficients to denote how well the methods perform at predicting the retrieval performance of a set of queries. Despite the amount of research dedicated to this area, one aspect remains neglected: how strong does the correlation need to be in order to realize an improvement in retrieval effectiveness in an operational setting? We address this issue in the context of two settings: Selective Query Expansion and Meta-Search. In an empirical study, we control the quality of a predictor in order to examine how the strength of the correlation achieved, affects the effectiveness of an adaptive retrieval system. The results of this study show that many existing predictors fail to achieve a correlation strong enough to reliably improve the retrieval effectiveness in the Selective Query Expansion as well as the Meta-Search setting.

[download pdf]

A Case for Automatic System Evaluation

Friday, April 16th, 2010, posted by Djoerd Hiemstra

by Claudia Hauff, Djoerd Hiemstra, Leif Azzopardi, and Franciska de Jong

Ranking a set retrieval systems according to their retrieval effectiveness without relying on relevance judgments was first explored by Soboroff et al. (Soboroff, I., Nicholas, C., Cahan, P. Ranking retrieval systems without relevance judgments. In: SIGIR 2001, pp. 66-73) Over the years, a number of alternative approaches have been proposed, all of which have been evaluated on early TREC test collections. In this work, we perform a wider analysis of system ranking estimation methods on sixteen TREC data sets which cover more tasks and corpora than previously. Our analysis reveals that the performance of system ranking estimation approaches varies across topics. This observation motivates the hypothesis that the performance of such methods can be improved by selecting the “right” subset of topics from a topic set. We show that using topic subsets improves the performance of automatic system ranking methods by 26% on average, with a maximum of 60%. We also observe that the commonly experienced problem of underestimating the performance of the best systems is data set dependent and not inherent to system ranking estimation. These findings support the case for automatic system evaluation and motivate further research.

[download pdf]

Query-based sampling using only snippets

Thursday, April 1st, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results.

The paper will be presented at the SIGIR 2010 Workshop on Large-Scale Distributed Systems for Information Retrieval, on July 23rd, 2010 in Geneva, Switzerland

[download pdf]

SIGIR Workshop on Accessible Search Systems

Monday, March 29th, 2010, posted by Djoerd Hiemstra

We organize a workshop on an exciting new theme at SIGIR on 23 July 2010 in Geneva, Switzerland.

Current search systems are not adequate for individuals with specific needs: children, older adults, people with visual or motor impairments, and people with intellectual disabilities or low literacy. Search services are typically created for average users (young or middle-aged adults without physical or mental disabilities) and information retrieval methods are based on their perception of relevance as well. The workshop will be the first ever to raise the discussion on how to make search engines accessible for different types of users, including those with problems in reading, writing or comprehension of complex content. Search accessibility means that people whose abilities are considerably different from those that average users have will be able to use search systems with the same success.

The objective of the workshop is to provide a forum and initiate collaborations between academics and industrial practitioners interested in making search more usable for users in general and for users with specific needs in particular. We encourage presentation and participation from researchers working at the intersection of information retrieval, natural language processing, human-computer interaction, ambient intelligence and related areas. The workshop will be a mix of oral presentations for long papers (maximum of 8 pages), a session for posters (maximum of 2 pages) and a panel discussion. All submissions will be reviewed by at least two PC members. Workshop proceedings will be available at the workshop. The workshop welcomes, but is not limited to, contributions on a range of the following key issues:

  • Understanding of search behavior of users with specific needs
  • Understanding of relevance criteria of users with specific needs
  • Understanding the effects of domain expertise, age, user experience and cognitive abilities on search goals and results evaluation
  • Non-topical aspects of relevance: text style, readability, appropriateness of language (harassment and explicit content detection)
  • Development of test collections for evaluation of accessible search systems
  • Collaborative search techniques for assisting users with specific needs (e.g. parents helping children)
  • Potential of search personalization techniques to satisfy users with specific needs
  • Search interfaces and result representation for people with specific needs
  • Using assistive technologies for interaction with search systems, e.g. speech recognition or eye tracking software for querying and browsing.

See the Workshop website.

Automatic summarisation of discussion fora

Thursday, March 25th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar, Rieks op den Akker, and Djoerd Hiemstra

Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by ‘cherry-picking’ sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

Published in Natural Language Engineering 16(2): 161–192, Cambridge University Press. [download pdf]