Archive for the 'Paper abstracts' Category

Query Performance Prediction

Friday, April 23rd, 2010, posted by Djoerd Hiemstra

Evaluation Contrasted with Effectiveness

by Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, and Franciska de Jong

Query performance predictors are commonly evaluated by reporting correlation coefficients to denote how well the methods perform at predicting the retrieval performance of a set of queries. Despite the amount of research dedicated to this area, one aspect remains neglected: how strong does the correlation need to be in order to realize an improvement in retrieval effectiveness in an operational setting? We address this issue in the context of two settings: Selective Query Expansion and Meta-Search. In an empirical study, we control the quality of a predictor in order to examine how the strength of the correlation achieved, affects the effectiveness of an adaptive retrieval system. The results of this study show that many existing predictors fail to achieve a correlation strong enough to reliably improve the retrieval effectiveness in the Selective Query Expansion as well as the Meta-Search setting.

[download pdf]

A Case for Automatic System Evaluation

Friday, April 16th, 2010, posted by Djoerd Hiemstra

by Claudia Hauff, Djoerd Hiemstra, Leif Azzopardi, and Franciska de Jong

Ranking a set retrieval systems according to their retrieval effectiveness without relying on relevance judgments was first explored by Soboroff et al. (Soboroff, I., Nicholas, C., Cahan, P. Ranking retrieval systems without relevance judgments. In: SIGIR 2001, pp. 66-73) Over the years, a number of alternative approaches have been proposed, all of which have been evaluated on early TREC test collections. In this work, we perform a wider analysis of system ranking estimation methods on sixteen TREC data sets which cover more tasks and corpora than previously. Our analysis reveals that the performance of system ranking estimation approaches varies across topics. This observation motivates the hypothesis that the performance of such methods can be improved by selecting the “right” subset of topics from a topic set. We show that using topic subsets improves the performance of automatic system ranking methods by 26% on average, with a maximum of 60%. We also observe that the commonly experienced problem of underestimating the performance of the best systems is data set dependent and not inherent to system ranking estimation. These findings support the case for automatic system evaluation and motivate further research.

[download pdf]

Query-based sampling using only snippets

Thursday, April 1st, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results.

The paper will be presented at the SIGIR 2010 Workshop on Large-Scale Distributed Systems for Information Retrieval, on July 23rd, 2010 in Geneva, Switzerland

[download pdf]

Automatic summarisation of discussion fora

Thursday, March 25th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar, Rieks op den Akker, and Djoerd Hiemstra

Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by ‘cherry-picking’ sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

Published in Natural Language Engineering 16(2): 161–192, Cambridge University Press. [download pdf]

QueryBased Sampling: Can we do Better than Random?

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

[download pdf]

Learning to Merge Search Results

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

Learning to Merge Search Results for Efficient Distributed Information Retrieval

Kien Tjin-Kam-Jet and Djoerd Hiemstra

Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which learn a function that merges results based on information that is readily available, i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM methods do not improve over those baselines

[download pdf]

Remko Nolten graduates on automatic hyperlinking

Friday, January 22nd, 2010, posted by Djoerd Hiemstra

WikiLink: Anchor Detection and Link Generation in Wiki’s

by Remko Nolten

In this research we try to automate the process of link generation in Wiki’s by looking at existing link generation techniques and enhancing these with our own ideas. We started the research by analyzing a large document corpus to find out more about the links we want to create. In our analysis we looked into three aspects of our datasets. First, we wanted to know more about the relation between the text that is used to display the link and the title of the page where the link points to. We showed that a large majority of the links could theoretically be identified by matching the text of the link with the page title of the appropriate page, but we also identified several problems with this approach. Second, we wanted to learn more about the existing link structure in our dataset. Here, we confirmed most advantages and disadvantages of using existing links in a link generation algorithm that were also identified by other studies. Finally, we decided to analyze the grammatical structure of links, to see if we could use this later on. Our analysis showed that a very large majority of the links were nouns or noun phrases, which suggests that this would be a good way to identify links in a text.

Based on the results of this analysis, we built a framework in which we could implement new and existing methods for link generation. In the framework, the process of ‘anchor detection’ (the technique of discovering phrases in a larger text that could be used a basis for a link) and ‘destination finding’ (the process of finding a suitable destination page for a short piece of text) where separated. This way we could try multiple combinations to see which would work best. Using this framework, we found that our grammar based anchor detection algorithm combined with multiple destination finding algorithms resulted in the best performance. Our final performance figures were better than most competitors which showed the potential of our techniques.

Towards Affordable Disclosure of Spoken Heritage Archives

Friday, December 11th, 2009, posted by Djoerd Hiemstra

by Roeland Ordelman, Willemijn Heeren, Franciska de Jong, Marijn Huijbregts, and Djoerd Hiemstra

This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken heritage archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, we at least want to provide search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is not yet satisfactory, and requires additional research.

To be published in the Journal of Digital Information 10(6).

[download pdf]

Indexing half a billion web pages

Tuesday, October 27th, 2009, posted by Djoerd Hiemstra

by Claudia Hauff and Djoerd Hiemstra

The University of Twente participated in three tasks of TREC 2009: the adhoc task, the diversity task and the relevance feedback task. All experiments are performed on the English part of ClueWeb09. In this draft paper, we describe our approach to tuning our retrieval system in absence of training data in Section 3. We describe the use of categories and a query log for diversifying search results in Section 4. Section 5 describes preliminary results for the relevance feedback task.

[download pdf]

Matthijs Ooms graduates on Provenance Management for Bioinformatics

Monday, September 28th, 2009, posted by Djoerd Hiemstra

by Matthijs Ooms

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals:

  1. Using a provenance archive effectively and efficiently as cache for workflow tasks.
  2. Designing techniques to support browsing and navigation through a provenance archive.
The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two.

Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-BioFlow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype. For the OPM implementation, a profile for the OPM to collect provenance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved. A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement of each optimisation was measured using a query set taken from an OligoRAP cache run. Three tasks in OligoRAP were cached, resulting in an additional performance improvement of 19%. A provenance archive based on the OPM can be used to effectively cache workflow tasks. A provenance browser is introduced that incorporates several techniques to help browsing through large provenance archives. Its primary visualisation is the graph representation specified by the OPM.

More information at the e-BioFlow project page at SourceForge, or in Matthijs’ master thesis in ePrints.