Archive for 2011

Search Result Caching in P2P Information Retrieval Networks

Tuesday, March 15th, 2011, posted by Djoerd Hiemstra

by Almer Tigelaar, Djoerd Hiemstra, and Dolf Trieschnigg

See Almer’s post: For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

The paper will be presented at the Information Retrieval Facility Conference IRFC 2011 on 6 June in Vienna, Austria

[download preprint]

Answers to homework series 1 available

Friday, March 4th, 2011, posted by Maurice van Keulen

I wrote a document with many good alternative solutions to the assignments. At some places I added a comment explaining an important point or pointing out common mistakes. Note that the document is not complete! There may be more correct answers. Just compare your solutions to these to find out if you made some mistakes and why. If you are not sure if your solution to a particular question is correct or not, just send me an e-mail

Welcome to XML and Databases 1

Monday, January 31st, 2011, posted by Djoerd Hiemstra

Welcome to the course “XML & Databases 1″. Following this course means you will read about some of the latest research on XML databases. We will discuss XML querying with XPath and XQuery, publishing of relational data, a lot about relational storage structures for XML data and query processing techniques, and lectures on full-text querying, distributed querying and querying stand-off data. To obtain the study material necessary for this course, you need to buy the reader “XML & Databases 1 & 2″.It is possible to use the old readers from 2008/2009 or 2009/2010. We hope you will enjoy the course.

Free-Text Search versus Complex Web Forms

Thursday, January 13th, 2011, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We investigated the use of free-text queries as an alternative means for searching “behind” web forms. We conducted a user study where we evaluated our prototype free-text interface in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

The paper will be presented in April at the 33rd European Conference on Information Retrieval (ECIR 2011) in Dublin, Ireland

[download pdf]

Query Load Balancing in P2P Search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Query Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks

by Almer Tigelaar and Djoerd Hiemstra

For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we explore the limitations of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that cache hit ratios of at least thirty-three percent are attainable.

The paper will be presented at the 11th Dutch-Belgian Information Retrieval Workshop (DIR) on February 4 in Amsterdam

[download pdf]

Eelco Eerenberg graduates on economic models for distributed search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Towards Distributed Information Retrieval based on Economic Models

by Eelco Eerenberg

The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

more information

Solutions on Blackboard

Tuesday, January 4th, 2011, posted by Djoerd Hiemstra

Solutions for Assignment 4 (Sawzall) and for Assignment 5 (HBase Schema) are now on Blackboard.

AXES: Access to Audiovisual Archives

Monday, January 3rd, 2011, posted by Djoerd Hiemstra


AXES is a large-scale integrating project (IP) project funded by the European Unions’s 7th Framework Programme that starts in January 2011. The goal of AXES is to develop tools that provide various types of users with new engaging ways to interact with audiovisual libraries, helping them discover, browse, navigate, search and enrich archives. In particular, apart from a search-oriented scheme, we will explore how suggestions for audiovisual content exploration can be generated via a myriad of information trails crossing the archive. This will be approached from three perspectives (or axes): users, content, and technology.

Within AXES innovative indexing techniques are developed in close cooperation with a number of user communities through tailored use cases and validation stages. Rather than just starting new investments in technical solutions, the co-development is proposed of innovative paradigms of use and novel navigation and search facilities. We will target media professionals, educators, students, amateur researchers, and home users.

Based on an existing Open Source service platform for digital libraries, novel navigation and search functionalities will be offered via interfaces tuned to user profiles and workflow. To this end, AXES will develop tools for content analysis deploying weakly supervised classification methods. Information in scripts, audio tracks, wikis or blogs will be used for the cross-modal detection of people, places, events, etc., and for link generation between audiovisual content. Users will be engaged in the annotation process: with the support of selection and feedback tools, they will enable the gradual improvement of tagging performance. AXES technology will open up audiovisual digital libraries, increasing their cultural value and their exposure to the European public and academia at large.

The consortium is a perfect match to the multi-disciplinary nature of the project, with professional content owners, academic and industrial experts in audiovisual analysis, retrieval, and user studies, and partners experienced in system integration and project management. Our partners in AXES are: GEIE ERCIM, Katholieke Universiteit Leuven, University of Oxford, Institut National de Recherche en Informatique et en Automatique (INRIA), Dublin City University, Fraunhofer Gesellschaft, BBC, Netherlands Institute for Sound and Vision, Deutsche Welle, Technicolor, EADS, and Erasmus University Rotterdam.