Digital museum of information retrieval research

by Djoerd Hiemstra, Tristan Pothoven, Marijn van Vliet, and Donna Harman

As more and more of the world becomes digital, and documents become easily available over the Internet, we are suddenly able to access all kinds of information. The downside of this however is that information that is not digital becomes less accessed, and is liable to be lost to us and to future generations. Whereas there are many scanning projects underway, such as Google books and the Open Library Alliance, these projects are not going to know about, much less find, the specialized scientific literature within various fields. This short paper describes the beginnings of a project to digitize some of the older literature in the information retrieval field. The paper finishes with some thoughts for future work on making more of our IR literature available for searching.

[abstract] [more information]

PF/Tijah facts and figures

The MultimediaN project will finish later this year, and the MultimediaN board asks for the “economic impact” of PF/Tijah.

  • In 2008, the PF/Tijah web site was visited 1,885 times, 6,284 page views in total.
  • During that period, MonetDB/XQuery was downloaded 75 times via the PF/Tijah site. In total MonetDB/XQuery, including PF/Tijah, was downloaded over 2000 times in 2008.

Go to: PF/Tijah.

Saving and Accessing the Old IR Literature

SIGIR presents the first results of a project to digitize the older literature in the information retrieval field. So far 14 of the old reports, such as the Cranfield reports and the SMART reports have been scanned, along with Karen Sparck Jones’s Information Retrieval Experiment book. The PDF versions of these are available from the SIGIR Digital Museum of Information Retrieval Research, that provides room for exhibits of historic interest, and allows searching of the material using the PF/Tijah XML search system. The complete library is available for download on request. Requests can be directed to the SIGIR Information Director by sending an email to infodir_sigir@acm.org.

[download pdf]

Efficient XML and Entity Retrieval with PF/Tijah

by Henning Rode, Djoerd Hiemstra, Arjen de Vries, and Pavel Serdyukov

PF/Tijah is a research prototype created by the University of Twente and CWI Amsterdam with the goal to create a flexible environment for setting up search systems. PF/Tijah is first of all a system for structured retrieval on XML data. Compared to other open source retrieval systems it comes with a number or unique features:

  • It can execute any NEXI query without limits to a predefined set of tags. Using the same index, it can easily produce a “focused”, “thorough”, or “article” ranking, depending only on the specified query and retrieval options.
  • The applied retrieval model, score propagation and combination operators are set at query time, which makes PF/Tijah an ideal experimental platform.
  • PF/Tijah embeds NEXI queries as functions in the XQuery language. This way the system supports ad hoc result presentation by means of its query language. The INEX efficiency task submission described in the paper demonstrates this feature. The declared function INEXPath for instance computes a string that matches the desired INEX submission format.
  • PF/Tijah supports text search combined with traditional database querying, including for instance joins on values. The entity ranking experiments described in this article intensively exploit this feature.

With this year's INEX experiments, we try to demonstrate the mentioned features of the system. All experiments were carried out with the least possible pre- and post-processing outside PF/Tijah.

[download draft paper]

Towards Affordable Disclosure of Spoken Word Archives

by Roeland Ordelman, Willemijn Heeren, Marijn Huijbregts, Djoerd Hiemstra, and Franciska de Jong

This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition, supporting for instance within-document search, are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research.

[download pdf]

Search for the Future

Information Retrieval is the discipline that studies computer-based search tools. Many applications that handle information on the internet would be completely inadequate without the support of information retrieval technology. How would we manage our email without spam filtering? How would we find information on the world wide web if there were no web search engines? The rise of web search engines has been one of the major success stories in computer science of the last decade: Internet and search companies like Google and Yahoo are now among the world's most influential information technology companies.

Today, search technology is provided and developed by major search providers like Google and Yahoo, and by small specialized companies with specialized staff. But as search technology matures, it will have to be available to non-expert application developers as well. A major obstacle to achieve this, is the lack of theories and high-level abstractions of search systems and the lack of declarative query languages. Another obstacle is the lack of methods to handle non-textual data, such as images, audio and video. Several projects of the Database Group of the University of Twente try to solve these problems for application areas such as Entity Search, Expert Search, Video Search, and Distributed Search. The models and approaches that are developed in these projects are evaluated on large scale, realistic testbeds, and implemented in the group's open source search system PF/Tijah, a search system that combines keyword queries with structured queries on XML databases. The research contributes to the several courses in the university's graduate programs, for instance Information Retrieval, and XML & Databases 1 and XML & Databases 2.

Sound ranking algorithms for XML search

by Djoerd Hiemstra, Stefan Klinger, Henning Rode, Jan Flokstra, and Peter Apers

Ranking algorithms for XML should reflect the actual combined content and structure constraints of queries, while at the same time producing equal rankings for queries that are semantically equal. Ranking algorithms that produce different rankings for queries that are semantically equal are easily detected by tests on large databases: We call such algorithms not sound. We report the behavior of different approaches to ranking content-and-structure queries on pairs of queries for which we expect equal ranking results from the query semantics. We show that most of these approaches are not sound. Of the remaining approaches, only 3 adhere to the W3C XQuery Full-Text standard.

The paper will be presented at the SIGIR 2008 Workshop on Focused Retrieval in Singapore

[download pdf]