Archive for 2008

Opinion mining by Vox-Pop

Monday, December 22nd, 2008, posted by Djoerd Hiemstra

Vox-Pop ( provides funny gadgets that show the power of so-called opinion mining or sentiment analysis. The site uses natural language processing tools find named entities (person names), and to detect if the individual was mentioned in a positive, negative or neutral way. See below what the “vox populi” think of Dutch football players and trainers. But… why don’t they mention Blaise N’Kufo?

Reind van de Riet passed away

Friday, December 19th, 2008, posted by Djoerd Hiemstra

Prof. Reind van de Rietby Roel Wieringa

Yesterday evening Prof. Reind van de Riet, former chair and co-founder of SIKS passed away unexpectedly, at the age of 69. Reind has contributed to the growth and identity of SIKS in an important way and initiated a considerable number of the senior researchers of SIKS in the world of science. Some of us will remember him as an intellectual father. We will miss his unique presence at SIKS-related conferences.

The Combination and Evaluation of Query Performance Prediction Methods

Friday, December 12th, 2008, posted by Djoerd Hiemstra

by Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra

In this paper, we examine a number of newly applied methods for combining pre-retrieval query performance predictors in order to obtain a better prediction of the query’s performance. However, in order to adequately and appropriately compare such techniques, we critically examine the current evaluation methodology and show how using linear correlation coefficients (i) do not provide an intuitive measure indicative of a method’s quality, (ii) can provide a misleading indication of performance, and (iii) overstate the performance of combined methods. To address this, we extend the current evaluation methodology to include cross validation, report a more intuitive and descriptive statistic, and apply statistical testing to determine significant differences. During the course of a comprehensive empirical study over several TREC collections, we evaluate nineteen pre-retrieval predictors and three combination methods.

The paper will be presented at the 31st European Conference on Information Retrieval (ECIR), April 6-9, 2009 in Toulouse, France.

[download pdf]

How Cyril Cleverdon set the stage for IR research

Thursday, December 11th, 2008, posted by Djoerd Hiemstra

Cyril CleverdonCyril Cleverdon (9 September 1914 – 4 December 1997) was a British librarian and computer scientist who is best known for his work on the evaluation of information retrieval systems.

Cyril Cleverdon was born in Bristol, England. He worked at the Bristol Libraries from 1932 to 1938, and from 1938 to 1946 he was the librarian of the Engine Division of the Bristol Aeroplane Co. Ltd. In 1946 he was appointed librarian of the College of Aeronautics at Cranfield (later the Cranfield Institute of Technology), where he served until his retirement in 1979, the last two years as professor of Information Transfer Studies.

With the help of NSF funding, Cleverdon started a series of projects in 1957 that lasted for about 10 years in which he and his collegues set the stage for information retrieval research. In the Cranfield project, retrieval experiments were conducted on test databases in a controlled, laboratory-like setting. The aim of the research was to find ways to improve the retrieval effectiveness of information retrieval systems by developing better indexing languages and methods. The components of the experiments were: 1) a collection of documents, 2) a set of user requests or queries, and 3) a set of relevance judgments, that is a set of documents judged to be relevant to each query. Together, these components form an information retrieval test collection. The test collection serves as a golden standard for testing retrieval approaches, and the success of each approach is measured in terms of two measures: precision and recall. Test collections and evaluation measures based on precision and recall are driving forces behind research of search systems, today. Cleverdon’s research approach forms a blue print for the successful Text Retrieval Conference series that started in 1992.

Cleverdon’s Cranfield studies did not only introduce experimental research in computer science, the outcomes of the project also established the basis of the automatic indexing as done in today’s search engines. Basically Cleverdon found that using single terms from the documents, as opposed manually assigned thesaurus terms, synonyms, etc. achieved the best retrieval performance. These results were very controversial at the time. In the Cranfield 2 Report, Cleverdon says:

This conclusion is so controversial and so unexpected that it is bound to throw considerable doubt on the methods which have been used (…) A complete recheck has failed to reveal any discrepancies (…) there is no other course except to attempt to explain the results which seem to offend against every canon on which we were trained as librarians.

Cyril Cleverdon also ran, for many years, the Cranfield conferences, which provided a major international forum for discussion of ideas and research in information retrieval. This function was taken over by the SIGIR conferences in the 1970’s.

Cleverdon was awarded several times during his life. He received the Professional Award of the Special Libraries Association (1962), the Award of Merit of the American Society for Information Science (1971), and the Gerard Salton Award of the Special Interest Group on Information Retrieval of the Association for Computing Machinery (1991).

Written for Wikipedia.

Efficient XML and Entity Retrieval with PF/Tijah

Wednesday, December 3rd, 2008, posted by Djoerd Hiemstra

by Henning Rode, Djoerd Hiemstra, Arjen de Vries, and Pavel Serdyukov

PF/Tijah is a research prototype created by the University of Twente and CWI Amsterdam with the goal to create a flexible environment for setting up search systems. PF/Tijah is first of all a system for structured retrieval on XML data. Compared to other open source retrieval systems it comes with a number or unique features:

  • It can execute any NEXI query without limits to a predefined set of tags. Using the same index, it can easily produce a “focused”, “thorough”, or “article” ranking, depending only on the specified query and retrieval options.
  • The applied retrieval model, score propagation and combination operators are set at query time, which makes PF/Tijah an ideal experimental platform.
  • PF/Tijah embeds NEXI queries as functions in the XQuery language. This way the system supports ad hoc result presentation by means of its query language. The INEX efficiency task submission described in the paper demonstrates this feature. The declared function INEXPath for instance computes a string that matches the desired INEX submission format.
  • PF/Tijah supports text search combined with traditional database querying, including for instance joins on values. The entity ranking experiments described in this article intensively exploit this feature.
With this year’s INEX experiments, we try to demonstrate the mentioned features of the system. All experiments were carried out with the least possible pre- and post-processing outside PF/Tijah.

[download draft paper]

New group member: Anandeshwar Singh

Sunday, November 23rd, 2008, posted by Djoerd Hiemstra
Anandeshwar Singh will work on an XQuery Full-text version of PF/Tijah. XQuery Full-text is a W3C Candidate Recommendation that extends XQuery for text search in XML data. Welcome Anandeshwar!

GIS Presentations and Demonstrations

Wednesday, November 5th, 2008, posted by Djoerd Hiemstra

The best project results of the course Advanced Database Systems will be presented on Friday 7 November 2008, 13.45h. - 15.30h. in room HO-B1220. In the projects, students built or analyzed applications of Geographic Information Systems (GIS). We will have the following presentations:

  1. Study of the Design of a GIS application by Haihan Yin, Rabah Khilfeh, and Ravi Khadka
  2. Betonning & Peiling by Menno Tammens and Steven ten Brinke
  3. ReBata: Replaying the Batavierenrace by Michiel Hakvoort, Gido Hakvoort, and Ronald Peterson

User study for concept retrieval available

Monday, November 3rd, 2008, posted by Djoerd Hiemstra

In our recent TRECVID experiments we evaluated a concept retrieval approach to video retrieval, i.e. the user searches a collection of video shots by using automatically detected concepts such as face, people, indoor, sky, building, etc. The performance of such systems is still far from sufficient to be usable in reality, but is this because automatic detectors are bad? because users cannot write concept queries? because systems cannot rank concepts queries? or possibly, all of the above?

To help researchers answering this question, we made the data from a user study involving 24 users available. In the experiment, users had to select from a set of 101 concepts those concepts they expect to be helpful for finding certain information. For instance, suppose one needs to find shots of “one or more palm trees”. Most people, 18 out of 24, choose the concept tree, but others choose outdoor (15), vegetation (9), sky (8), beach (8), or desert (4). The summarized results can be accessed now from Robin Aly’s page.

Download the user study data.

TREC Video Workshop 2008

Friday, October 31st, 2008, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, Arjen de Vries, and Henning Rode

In this report we describe our experiments performed for TRECVID 2008. We participated in the High Level Feature extraction and the Search task. For the High Level Feature extraction task we mainly installed our detection environment. In the Search task we applied our new PRFUBE ranking model together with an estimation method which estimates a vital parameter of the model, the probability of a concept occurring in relevant shots. The PRFUBE model has similarities to the well known Probabilistic Text Information Retrieval methodology and follows the Probability Ranking Principle.

[download pdf]

Towards Affordable Disclosure of Spoken Word Archives

Thursday, October 30th, 2008, posted by Djoerd Hiemstra

by Roeland Ordelman, Willemijn Heeren, Marijn Huijbregts, Djoerd Hiemstra, and Franciska de Jong

This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research.

[download pdf]