Archive for the 'Distributed Search' Category

Query Load Balancing in P2P Search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Query Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks

by Almer Tigelaar and Djoerd Hiemstra

For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we explore the limitations of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that cache hit ratios of at least thirty-three percent are attainable.

The paper will be presented at the 11th Dutch-Belgian Information Retrieval Workshop (DIR) on February 4 in Amsterdam

[download pdf]

Eelco Eerenberg graduates on economic models for distributed search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Towards Distributed Information Retrieval based on Economic Models

by Eelco Eerenberg

The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

more information

Bertold van Voorst graduates on collection selection using database clustering

Monday, July 26th, 2010, posted by Djoerd Hiemstra

Cluster-based collection selection in uncooperative distributed information retrieval

by Bertold van Voorst

The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

[download pdf]

MIREX: MapReduce IR Experiments

Wednesday, April 28th, 2010, posted by Djoerd Hiemstra

MIREXMIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.

Query-based sampling using only snippets

Thursday, April 1st, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results.

The paper will be presented at the SIGIR 2010 Workshop on Large-Scale Distributed Systems for Information Retrieval, on July 23rd, 2010 in Geneva, Switzerland

[download pdf]

QueryBased Sampling: Can we do Better than Random?

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

by Almer Tigelaar and Djoerd Hiemstra

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

[download pdf]

Learning to Merge Search Results

Tuesday, March 16th, 2010, posted by Djoerd Hiemstra

Learning to Merge Search Results for Efficient Distributed Information Retrieval

Kien Tjin-Kam-Jet and Djoerd Hiemstra

Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which learn a function that merges results based on information that is readily available, i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM methods do not improve over those baselines

[download pdf]

Sander Bockting wint ENIAC scriptieprijs

Monday, December 7th, 2009, posted by Djoerd Hiemstra

Sander Bockting heeft dit jaar de ENIAC scriptieprijs gewonnen. ENIAC is de de alumnivereniging voor oud-studenten van Informatica, Bedrijfsinformatietechnologie en Telematica. ENIAC reikt elk jaar een prijs uit voor de beste afstudeerscriptie. Het juryrapport luidt:

De jury heeft besloten de ENIAC scriptieprijs 2009 toe te kennen aan de scriptie “Collection Selection for Distributed Web Search: Using Highly Discriminative Keys, Query-driven Indexing and ColRank”, van Sander Bockting. De jury heeft gekozen voor deze scriptie, vanwege de relevantie van het onderzoek, de wetenschappelijke benadering en het grote deel ‘ontwerp’ (het prototype Sophos) dat in het werk besloten ligt. Hiernaast biedt Sanders onderzoek een (mogelijk) antwoord op het toegankelijke houden van het internet. Zoeken op internet en de bijbehorende zoekmachines vervullen een maatschappelijke functie in het ontsluiten van informatie. Door de sterke groei van het internet is het echter onmogelijk om het gehele internet centraal te blijven indexeren. Tevens geeft deze methode veel macht aan de eigenaren van enkele centrale zoekmachines. Sander laat zien dat het toepassen van gedistribueerde zoeksystemen een veelbelovende aanpak is, die in potentie gegevens beter ontsluit terwijl de afhankelijkheid van enkele centrale zoekmachines afneemt. De vijf door hem vergelijken technieken zijn dan ook een prima basis voor maatschappelijk en wetenschappelijk relevant vervolgonderzoek.

Kien Tjin-Kam-Jet joins Database Group

Friday, August 7th, 2009, posted by Djoerd Hiemstra
Kien Tjin-Kam-Jet joined the Database Group to work on Distributed Information Retrieval. Welcome Kien!

Open Search Validator

Wednesday, June 10th, 2009, posted by Djoerd Hiemstra

Pretty useful, the Open Search Validator shows that my site is all green.

More information at Tatham Oddie’s blog.