Archive for the 'Distributed Search' Category

Peer-to-Peer Information Retrieval: An Overview

Friday, February 10th, 2012, posted by Djoerd Hiemstra

by Almer Tigelaar, Djoerd Hiemstra, Dolf Trieschnigg

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom.

The paper will appear in ACM Transactions on Information Systems.

Treinplanner on Dutch television

Sunday, January 29th, 2012, posted by Djoerd Hiemstra

Dutch broadcaster BNN tests the intuitive train planner developed at the Database Group. Their verdict: “ingenious”, and “approved for elderly”. Picture of Kien Tjin-Kam-Jet proudly in the back (in Dutch). See the treinplanner in action at: http://treinplanner.info

Treinplanner

Monday, January 23rd, 2012, posted by Djoerd Hiemstra

We released a demo today: The Treinplanner built by Kien that allows you to search the search the Dutch Railways Journey planner with a single search box. (in Dutch)

Free-Text Search over Complex Web Forms

Monday, April 4th, 2011, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

This paper investigates the problem of using free-text queries as an alternative means for searching ‘behind’ web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

The paper will be presented at the Information Retrieval Facility Conference IRFC 2011 on 6 June in Vienna, Austria

[download preprint]

Search Result Caching in P2P Information Retrieval Networks

Tuesday, March 15th, 2011, posted by Djoerd Hiemstra

by Almer Tigelaar, Djoerd Hiemstra, and Dolf Trieschnigg

See Almer’s post: For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

The paper will be presented at the Information Retrieval Facility Conference IRFC 2011 on 6 June in Vienna, Austria

[download preprint]

Free-Text Search versus Complex Web Forms

Thursday, January 13th, 2011, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We investigated the use of free-text queries as an alternative means for searching “behind” web forms. We conducted a user study where we evaluated our prototype free-text interface in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

The paper will be presented in April at the 33rd European Conference on Information Retrieval (ECIR 2011) in Dublin, Ireland

[download pdf]

Query Load Balancing in P2P Search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Query Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks

by Almer Tigelaar and Djoerd Hiemstra

For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we explore the limitations of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that cache hit ratios of at least thirty-three percent are attainable.

The paper will be presented at the 11th Dutch-Belgian Information Retrieval Workshop (DIR) on February 4 in Amsterdam

[download pdf]

Eelco Eerenberg graduates on economic models for distributed search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Towards Distributed Information Retrieval based on Economic Models

by Eelco Eerenberg

The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

more information

Bertold van Voorst graduates on collection selection using database clustering

Monday, July 26th, 2010, posted by Djoerd Hiemstra

Cluster-based collection selection in uncooperative distributed information retrieval

by Bertold van Voorst

The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

[download pdf]

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.