Archive for the 'Paper abstracts' Category

Sander Bockting graduates on collection selection for distributed web search

Tuesday, February 17th, 2009, posted by Djoerd Hiemstra

Using Highly Discriminative Keys, Query-driven Indexing and ColRank

Current popular web search engines, such as Google, Live Search and Yahoo!, rely on crawling to build an index of the World Wide Web. Crawling is a continuous process to keep the index fresh and generates an enormous amount of data traffic. By far the largest part of the web remains unindexed, because crawlers are unaware of the existence of web pages and they have difficulties crawling dynamically generated content.

These problems were the main motivation to research distributed web search. We assume that web sites, or peers, can index a collection consisting of local content, but possibly also content from other web sites. Peers cooperate with a broker by sending a part of their index. Receiving indices from many peers, the broker gains a global overview of the peers’ content. When a user poses a query to a broker, the broker selects a few peers to which it forwards the query. Selected peers should be promising to create a good result set with many relevant documents. The result sets are merged at the broker and sent to the user. This research focuses on collection selection, which corresponds to the selection of the most promising peers. The use of highly discriminative keys is employed as a strategy to select those peers. A highly discriminative key is a term set which is an index entry at the broker. The key is highly discriminative with respect to the collections because the posting lists pointing to the collections are relatively small. Query-driven indexing is applied to reduce the index size by only storing index entries that are part of popular queries. A PageRank-like algorithm is also tested to assign scores to collections that can be used for ranking. The Sophos prototype was developed to test these methods. Sophos was evaluated on different aspects, such as collection selection performance and index sizes. The performance of the methods is compared to a baseline that applied language modeling onto merged documents in collections. The results show that Sophos can outperform the baseline with ad-hoc queries on a web based test set. Query-driven indexing is able to substantially reduce index sizes against a small loss in collection selection performance. We also found large differences in the level of difficulty to answer queries on various corpus splits.

More information in [E-Prints]

WikiTranslate: translations using only Wikipedia

Monday, February 9th, 2009, posted by Djoerd Hiemstra

WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia

by Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf Trieschnigg, Djoerd Hiemstra and Franciska de Jong

This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline.

[download pdf]

Digital museum of information retrieval research

Tuesday, February 3rd, 2009, posted by Djoerd Hiemstra

by Djoerd Hiemstra, Tristan Pothoven, Marijn van Vliet, and Donna Harman

As more and more of the world becomes digital, and documents become easily available over the Internet, we are suddenly able to access all kinds of information. The downside of this however is that information that is not digital becomes less accessed, and is liable to be lost to us and to future generations. Whereas there are many scanning projects underway, such as Google books and the Open Library Alliance, these projects are not going to know about, much less find, the specialized scientific literature within various fields. This short paper describes the beginnings of a project to digitize some of the older literature in the information retrieval field. The paper finishes with some thoughts for future work on making more of our IR literature available for searching.

[abstract] [more information]

Welcome to DIR 2009

Friday, January 30th, 2009, posted by Djoerd Hiemstra

Welcome to the 9th Dutch-Belgian Information Retrieval Workshop (DIR). I very well remember the DIR workshop in 2001 that was also organized in Twente. It took place exactly one day before my PhD defense, to give us the opportunity to have one of the PhD committee members, Stephen Robertson, as the keynote speaker. I am proud to see that DIR does not need PhD defenses any more to attract excellent keynotes. This year, DIR presents Rene van Erk, Director of Product- and Business Development Europe at Wolters Kluwer as the industry keynote; we present professor Gerhard Weikum, Scientific Director at the Max-Planck Institute for Informatics, as the academic keynote. We also present comedian Daniel van Veen as the “cultural keynote” to let our participants taste a bit of our university’s unique campus activities.

We have tried this year to especially encourage PhD students and researchers from industry to submit their research. Each submission was reviewed by at least two and often three or four program committee members. We thank the program committee members for their high quality reviews. Of 15 submissions to DIR, 12 were accepted. Another 5 submissions were accepted as poster presentations. Four papers were written with participation from industry and most other papers have a PhD student as the first author.

Special thanks to the Netherlands Research School for Information and Knowledge Systems (SIKS) for sponsoring the participation of Dutch SIKS members, to the Werkgemeenschap Informatiewetenschap (WGI) for providing a solid financial basis for organizing DIR now and in the future, to the Netherlands Organization of Scientific Research (NWO) for sponsoring the travel and hotel costs of our international keynote speaker Gerhard Weikum, to the Centre for Telematics and Information Technology (CTIT) for sponsoring the proceedings, to the University of Twente and to Cultural Center the Vrijhof for sponsoring our cultural activity on Monday.

To an inspiring DIR 2009!

The technology behind StreetTiVo

Monday, January 26th, 2009, posted by Djoerd Hiemstra

StreetTiVo: Using a P2P XML Database System to Manage Multimedia Data in Your Living Room

by Ying Zhang, Arjen de Vries, Peter Boncz, Djoerd Hiemstra, and Roeland Ordelman

StreetTiVo is a project that aims at bringing research results into the living room; in particular, a mix of current results in the areas of Peer-to-Peer XML Database Management System (P2P XDBMS), advanced multimedia analysis techniques, and advanced information retrieval techniques. The project develops a plug-in application for the so-called Home Theatre PCs, such as set-top boxes with MythTV or Windows Media Center Edition installed, that can be considered as programmable digital video recorders. StreetTiVo distributes computeintensive multimedia analysis tasks over multiple peers (i.e., StreetTiVo users) that have recorded the same TV program, such that a user can search in the content of a recorded TV program shortly after its broadcasting; i.e., it enables near real-time availability of the meta-data (e.g., speech recognition) required for searching the recorded content. Street- TiVo relies on our P2P XDBMS technology, which in turn is based on a DHT overlay network, for distributed collaborator discovery, work coordination and meta-data exchange in a volatile WAN environment. The technologies of video analysis and information retrieval are seamlessly integrated into the system as XQuery functions.

The paper will be presented at the Joint International Conferences on Asia-Pacific Web Conference (APWeb) and Web-Age Information Management (WAIM) on 1-4 April, 2009 in Suzhou, China

[download pdf]

Information Extraction and Linking in a Retrieval Context

Saturday, January 17th, 2009, posted by Djoerd Hiemstra
Marie-Francine Moens and I will give a tutorial at ECIR 2009 on using the results of information extraction and linking for retrieval systems. The tutorial’s main goal is to give the participants a clear and detailed overview of content modeling approaches and tools, and the integration of their results into ranking functions. A small set of integrated and interactive exercises will sharpen the understanding by the audience. By attending the tutorial, attendants will:
  • Acquire an understanding of current information extraction, topic modeling and entity linking techniques;
  • Acquire an understanding of ranking models in information retrieval;
  • Be able to integrate the (probabilistic) content models into the ranking models;
  • Be able to choose a model for retrieval that is well-suited for a particular task and to integrate the necessary content models.
The tutorial includes several motivating examples and applications among which are expert search using output from named entity tagging, connecting names to faces in videos for person search using output from named entity tagging and face detection, video search using output from concept detectors, and spoken document retrieval using speech lattices and posterior probabilities of recognized words. The examples will be combined in a larger case study: Retrieval of news broadcast video.

[download pdf]

More info at the ECIR tutorial page.

Saving and Accessing the Old IR Literature

Thursday, January 8th, 2009, posted by Djoerd Hiemstra

SIGIR presents the first results of a project to digitize the older literature in the information retrieval field. So far 14 of the old reports, such as the Cranfield reports and the SMART reports have been scanned, along with Karen Sparck Jones’s Information Retrieval Experiment book. The PDF versions of these are available from the SIGIR Digital Museum of Information Retrieval Research, that provides room for exhibits of historic interest, and allows searching of the material using the PF/Tijah XML search system. The complete library is available for download on request. Requests can be directed to the SIGIR Information Director by sending an email to

[download pdf]

The Combination and Evaluation of Query Performance Prediction Methods

Friday, December 12th, 2008, posted by Djoerd Hiemstra

by Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra

In this paper, we examine a number of newly applied methods for combining pre-retrieval query performance predictors in order to obtain a better prediction of the query’s performance. However, in order to adequately and appropriately compare such techniques, we critically examine the current evaluation methodology and show how using linear correlation coefficients (i) do not provide an intuitive measure indicative of a method’s quality, (ii) can provide a misleading indication of performance, and (iii) overstate the performance of combined methods. To address this, we extend the current evaluation methodology to include cross validation, report a more intuitive and descriptive statistic, and apply statistical testing to determine significant differences. During the course of a comprehensive empirical study over several TREC collections, we evaluate nineteen pre-retrieval predictors and three combination methods.

The paper will be presented at the 31st European Conference on Information Retrieval (ECIR), April 6-9, 2009 in Toulouse, France.

[download pdf]

Efficient XML and Entity Retrieval with PF/Tijah

Wednesday, December 3rd, 2008, posted by Djoerd Hiemstra

by Henning Rode, Djoerd Hiemstra, Arjen de Vries, and Pavel Serdyukov

PF/Tijah is a research prototype created by the University of Twente and CWI Amsterdam with the goal to create a flexible environment for setting up search systems. PF/Tijah is first of all a system for structured retrieval on XML data. Compared to other open source retrieval systems it comes with a number or unique features:

  • It can execute any NEXI query without limits to a predefined set of tags. Using the same index, it can easily produce a “focused”, “thorough”, or “article” ranking, depending only on the specified query and retrieval options.
  • The applied retrieval model, score propagation and combination operators are set at query time, which makes PF/Tijah an ideal experimental platform.
  • PF/Tijah embeds NEXI queries as functions in the XQuery language. This way the system supports ad hoc result presentation by means of its query language. The INEX efficiency task submission described in the paper demonstrates this feature. The declared function INEXPath for instance computes a string that matches the desired INEX submission format.
  • PF/Tijah supports text search combined with traditional database querying, including for instance joins on values. The entity ranking experiments described in this article intensively exploit this feature.
With this year’s INEX experiments, we try to demonstrate the mentioned features of the system. All experiments were carried out with the least possible pre- and post-processing outside PF/Tijah.

[download draft paper]

TREC Video Workshop 2008

Friday, October 31st, 2008, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, Arjen de Vries, and Henning Rode

In this report we describe our experiments performed for TRECVID 2008. We participated in the High Level Feature extraction and the Search task. For the High Level Feature extraction task we mainly installed our detection environment. In the Search task we applied our new PRFUBE ranking model together with an estimation method which estimates a vital parameter of the model, the probability of a concept occurring in relevant shots. The PRFUBE model has similarities to the well known Probabilistic Text Information Retrieval methodology and follows the Probability Ranking Principle.

[download pdf]