Archive for the 'Distributed Search' Category

Overview of the TREC 2014 Federated Web Search Track

Wednesday, November 26th, 2014, posted by Djoerd Hiemstra

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Ke Zhou, and Djoerd Hiemstra

The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared.

[download pdf]

Presented at the 23rd Text Retrieval Conference (TREC) in Gaithersburg, USA

Workshop on Heterogeneous Information Access

Tuesday, September 23rd, 2014, posted by Djoerd Hiemstra

We organize a workshop on Heterogeneous Information Access hosted by the 8th International Conference on Web Search and Data Mining on 6 February 2015 in Shanghai, China

Invited speakers: Mounia Lalmas (Yahoo) and Milad Shokouhi, (Microsoft Research)

Information access is becoming increasingly heterogeneous. Especially when the user’s information need is for exploratory purpose, returning a set of diverse results from different resources could benefit the user. For example, when a user is planning a trip to China on the Web, retrieving and presenting results from vertical search engines like travel, flight information, map and Q2A sites could satisfy the user’s rich and diverse information need. This heterogeneous search aggregation paradigm is useful in many contexts and brings many new challenges.

Aggregated search and composite retrieval are two in- stances of this new heterogeneous information access paradigm. They are applied on the Web with heterogeneous vertical search engines. This paradigm can be useful in many other scenarios: a user aims to re-find comprehensive information about his query in his personal search (emails, slides); or a user searches and gathers different nugget information (e.g. an entity) from a set of RDF Web datasets (e.g., DBpedia, IMDB, etc.); or the user searches a set of different files (e.g., images, documents) in a peer-to-peer online file sharing systems.

This is an emerging area as different services provided are becoming more heterogeneous and complex. Therefore, there are a number of directions that might be interesting for the research and industrial community. How to select the most relevant resources and present them concisely in order to best satisfy the user? How to model the complex user behaviour in this search scenario? How can we evaluate the performance of these systems? Those are a few key interesting research questions to study for heterogeneous information access.

The workshop topics of interest are within the context of heterogeneous information access. They include but are not limited to:

  • User modeling for Heterogeneous Information Access, Personalization
  • Metrics, measurements, and test collections
  • Optimization: Resource and vertical selection, Result presentation and diversification
  • Applications: Aggregated/Federated search, Composite retrieval, Structured/Semantic search, P2P search

The workshop includes invited talks by leading researchers in the field from both industry and academia, presentations by contributed submissions as well as organized and open discussion on heterogeneous information access.

More information at: http://hia-workshop.com/.

The future of TREC FedWeb

Tuesday, September 16th, 2014, posted by Djoerd Hiemstra

Thanks everyone for submitting runs to one of the TREC Federated Web Search tasks. We had roughly the same number of participants as last year; not bad, although our goal was to grow. Interestingly, our automatic submission system received an amazing 917 runs.

We discussed the future of the FedWeb track, and we decided that we will not propose a FedWeb 2015 track as coordinators. We were unable to secure funding. Combined with the fact that we created the FedWeb collection for three years in a row (although the first time independently of TREC), we believe it is best to properly finish the TREC this year, but not to run again next year. Read more…

Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Ke Zhou, and Djoerd Hiemstra

Aligning Vertical Collection Relevance with User Intent

Wednesday, September 10th, 2014, posted by Djoerd Hiemstra

by Ke Zhou Thomas Demeester Dong Nguyen, Djoerd Hiemstra, and Dolf Trieschnigg

Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In this work we propose different approaches to define the set of relevant verticals based on document judgments. We correlate the collection-based relevant verticals obtained from these approaches to the real user vertical intent, and show that they can be aligned relatively well. The set of relevant verticals defined by those approaches could therefore serve as an approximate but reliable ground-truth for evaluating vertical selection, avoiding the need for collecting explicit user vertical intent, and vice versa.

To be presented at the ACM International Conference on Information and Knowledge Management (CIKM 2014) in Shanghai, China on 3-7 November 2014

[download pdf]

Evaluate FedWeb runs online

Wednesday, July 30th, 2014, posted by Djoerd Hiemstra

The TREC Federated web track provides a new online tool to check the syntax of your runs and provide preliminary evaluation results on 10 of the 75 provided topics. Now you can easily see how you compare to other runs submitted to the system. The official TREC evaluation results will be based on at least 50 of the remaining topics in your run. Check your run at:
http://circus.ewi.utwente.nl/fedweb/.

FedWeb Please note that the site does NOT submit runs to TREC. Submit your runs at TREC via the TREC active participants site: before August 18, 2014 (Resource & Vertical Selection); before September 15, 2014 (Results Merging).

Follow @TRECFedWeb on Twitter.

Yoran Heling graduates on peer selection in Direct Connect

Thursday, June 19th, 2014, posted by Djoerd Hiemstra

by Yoran Heling

In a distributed Peer-to-peer (P2P) system such as Direct Connect, files are often distributed over multiple source peers. It is up to the downloading peer to decide from how many and from which source peers to download the particular file of interest. Biased Random Period Switching (BRPS) is an algorithm, implemented at the downloading peer, that determines at what point to download from which source peer. The number of source peers that a downloading peer downloads from at a certain point is called the Degree of Parallelism (DoP). This research focussed on implementing BRPS in an existing Direct Connect client and comparing the downloading performance against an unmodified client. Two implementations of BRPS in Direct Connect have been made. A simple implementation that follows the original BRPS algorithm as closely as possible, with minor modifications that were required to ensure that the downloading process would not get stuck on an unavailable source peer. An improved implementation has also been made with slight modifications to the original BRPS algorithm. The improved implementation incorporates two improvements to ensure that the DoP does not drop below its desired value in the face of unavailable source peers.

The original client and the two BRPS implementations have been evaluated in a controlled Direct Connect network with 50 downloading peers and a variable number of source peers. The source peers have been configured to throttle their available bandwidth to an average of 500 KB/s, and following a realistic bandwidth distribution based on measurements from the Tor P2P network. The experiments consisted of all downloading peers downloading the same file at the same time, and taking measurements on the side of these downloading peers. Four experiments have been performed, with one varying parameter in each experiment. The size of the file being downloaded was varied between 100 MB and 1024 MB in the first experiment, the second experiment varied the DoP between 1 and 15. The number of source peers was varied between 10 and 100 in the third experiment, and in the last experiment between 0% and 80% unavailable source peers were added to the network.

In all experiments, both BRPS implementations performed close to the optimal average download time, and were consistently faster than the original client by a factor of 2 to 5. In the last experiment, the improved BRPS implementation did keep the measured DoP closer to its desired value than the simple implementation, but this has not resulted in a significant difference in the measured download times.

[download pdf]

Test our simple train planner

Wednesday, March 12th, 2014, posted by Djoerd Hiemstra

The train planner of the U. Twente startup Q-Able is now live at the site of the Dutch Railways. Please, give it a try and fill in the feedback — see the button at the right. (in Dutch)

NS Planner

See: ns.nl/proefmetplanner.

Overview of TREC FedWeb 2013

Monday, March 10th, 2014, posted by Djoerd Hiemstra

Overview of the TREC 2013 Federated Web Search Track

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Djoerd Hiemstra

The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants’ individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.

TREC FedWeb
Ellen Voorhees presenting FedWeb at TREC 2013

The FedWeb task is organized as part of the Text REtrieval Conference (TREC)

[download pdf]

The Lowlands at TREC

Monday, March 10th, 2014, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra, Dolf Trieschnigg, and Thomas Demeester

We describe the participation of the Lowlands at the Web Track and the FedWeb track of TREC 2013. For the Web Track we used the MIREX MapReduce library with out-of-the-box approaches. For the FedWeb Track we adapted our shard selection method Taily for resource selection. Our results are above the median performance of TREC participants.

Presented at the 22nd Text REtrieval Conference (TREC) at the USA National Institute of Standards and Technology (NIST) in Gaithersburg, USA

[download pdf]

Ilya Markov defends Phd thesis on Distributed Information Retrieval

Friday, January 31st, 2014, posted by Djoerd Hiemstra

Today, Ilya Markov successfully defended his PhD thesis at the Università della Svizzera italiana in Lugano, Switzerland.

Uncertainty in Distributed Information Retrieval

by Ilya Markov

Large amounts of available digital information call for distributed processing and management solutions. Distributed Information Retrieval (DIR), also known as Federated Search, provides techniques for performing retrieval over such distributed data. In particular, it studies approaches to aggregating multiple searchable sources of information within a single interface.
DIR provides an efficient and low-cost solution to a distributed retrieval problem. As opposed to a centralized retrieval system, which acquires, stores and processes all available information locally, DIR delegates the search task to distributed sources. This way, DIR lowers the storage and processing costs and provides a user with up-to-date information even if this information is not crawlable (i.e. cannot be reached using hyperlinks).
DIR is usually based on a brokered architecture, according to which distributed retrieval is managed by a single broker. The broker-based DIR can be divided into five steps: resource discovery, resource description, resource selection, score normalization and results presentation. Among these steps, resource description, resource selection and score normalization are actively studied within DIR research, while the resource discovery step is addressed by the database community and results presentation is studied within aggregated search.
Despite the large volume of research on resource selection and score normalization, no unified framework of developed techniques exists, which makes difficult the application and comparison of available methods. The first goal of this dissertation is to summarize, analyze and evaluate existing resource selection and score normalization techniques within a unified framework. This should improve the understanding of available methods, reveal their underlying assumptions and limitations and describe their properties. This, in turn, will help to improve existing resource selection and score normalization techniques and to apply the right method in the right setting.
The second and the main contribution of this dissertation is in stating and addressing the problem of uncertainty in DIR. In Information Retrieval (IR) this problem has been recognized for a long time and numerous techniques have been proposed to deal with uncertainty in various IR tasks. This dissertation raises the question of uncertainty in DIR, outlines the sources of uncertainty on different DIR phases and proposes methods for measuring and reducing this uncertainty.