Archive for the 'Distributed Search' Category

Robert Neumayer defends thesis on distributed entity search

Friday, March 8th, 2013, posted by Djoerd Hiemstra

Today, Robert Neumayer defended his Ph.D. thesis Semantic and Distributed Entity Search in the Web of Data at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway

by Robert Neumayer

Both the growth and ubiquitious character of the Internet have had a profound effect on how we access and consume ata and information. More recently, the Semantic Web, an extension of the current Web has come increasingly relevant due to its widespread adoption.
The Web of Data (WoD) is an extension of the current web, where not only docu- ments are interlinked by means of hyperlinks but also data in terms of predicates. Specifically, it describes objects, entities or “things” in terms of their attributes and their relationships, using RDF data (and often is used equivalently to Linked Data). Given its growth, there is a strong need for making this wealth of knowl- edge accessible by keyword search (the de-facto standard paradigm for accessing information online).
The overall goal of this thesis is to provide new techniques for accessing this data, i.e., to leverage its full potential to end users. We therefore address the following four main issues: a) how can the Web of Data be searched by means of keyword search?, b) what sets apart search in the WoD from traditional web search?, c) how can these elements be used in a theoretically sound and effective way?, and d) How can the techniques be adapted to a distributed environment? To this end, we develop techniques for effectively searching WoD sources. We build upon and formalise existing entity modelling approaches within a generative language modelling framework, and compare them experimentally using standard test collections. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness. We further show how these approaches can be applied in the distributed context, both with low (federated search) and high numbers (Peer- to-peer or P2P) of independent repositories, collections, or nodes.
The main contributions are as follows:

  • We develop a hybrid approach to search in the Web of Data, using elements from traditional information retrieval and structured retrieval alike.
  • We formalise our approaches in a language model setting.
  • Our extensions are successfully evaluated with respect to their applicability in different distributed environments such as federated search and P2P.
  • We discuss and analyse based on our empirical evaluation and provide insights into the entity search problem.

Snippet-based Relevance Predictions

Wednesday, February 13th, 2013, posted by Djoerd Hiemstra

Snippet-based Relevance Predictions for Federated Web Search

by Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Chris Develder, and Djoerd Hiemstra

How well can the relevance of a page be predicted, purely based on snippets? This would be highly useful in a Federated Web Search setting where caching large amounts of result snippets is more feasible than caching entire pages. The experiments reported in this paper make use of result snippets and pages from a diverse set of actual Web search engines. A linear classifier is trained to predict the snippet-based user estimate of page relevance, but also, to predict the actual page relevance, again based on snippets alone. The presented results confirm the validity of the proposed approach and provide promising insights into future result merging strategies for a Federated Web Search setting.

The paper will be presented at the 35th European Conference on Information Retrieval (ECIR) on 25 March 2013 in Moscow, Russia

[download pdf]

A probabilistic approach for mapping free-text queries to complex web forms

Friday, December 21st, 2012, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

Web applications with complex interfaces consisting of multiple input fields should understand free-text queries. We propose a probabilistic approach to map parts of a free-text query to the fields of a complex web form. Our method uses token models rather than only static dictionaries to create this mapping, offering greater flexibility and requiring less domain knowledge than existing systems. We evaluate different implementations of our mapping model and show that our system effectively maps free-text queries without using a dictionary. If a dictionary is available, the performance increases and is significantly better than a rule-based baseline.

[download pdf]

University of Twente Touch Stories

Monday, November 19th, 2012, posted by Djoerd Hiemstra

Probing the hidden depths of the web: Great performance of Kien Tjin-Kam-Jet, who explains what motivates him in research (in Dutch).

Students and researchers explain wat “High-Tech Human Touch” means for them at UT Touch Stories.

Join TREC FedWeb’13

Tuesday, November 13th, 2012, posted by Djoerd Hiemstra

FedWeb ‘13 is the new TREC (Text Retrieval Conference) Federated Web Search task, that will provide a test collection that organizes and stimulates research in many areas related to federated search, including aggregated search, distributed search, peer-to-peer search and meta-search engines. The track will evaluate federated and aggregated search in a large heterogeneous setting using the search results of existing search engines.

Join the mailing and keep up-to-date with FedWeb’13.

Ensemble clustering for result diversification

Friday, October 26th, 2012, posted by Djoerd Hiemstra

by Dong Nguyen and Djoerd Hiemstra

This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.

[download pdf]

Almer Tigelaar defends PhD thesis on P2P Search

Thursday, September 27th, 2012, posted by Djoerd Hiemstra

Peer-to-peer information retrieval

by Almer Tigelaar

The Internet has become an integral part of our daily lives. However, the essential task of finding information is dominated by a handful of large centralised search engines. In this thesis we study an alternative to this approach. Instead of using large data centres, we propose using the machines that we all use every day: our desktop, laptop and tablet computers, to build a peer-to-peer web search engine. We provide a definition of the associated research field: peer-to-peer information retrieval. We examine what separates it from related fields, give an overview of the work done so far and provide an economic perspective on peer-to-peer search. Furthermore, we introduce our own architecture for peer-to-peer search systems, inspired by BitTorrent. Distributing the task of providing search results for queries introduces the problem of query routing: a query needs to be send to a peer that can provide relevant search results. We investigate how the content of peers can be represented so that queries can be directed to the best ones in terms of relevance. While cooperative peers can provide their own representation, the content of uncooperative peers can be accessed only through a search interface and thus they can not actively provide a description of themselves. We look into representing these uncooperative peers by probing their search interface to construct a representation. Finally, the capacity of the machines in peer-to-peer networks differs considerably making it challenging to provide search results quickly. To address this, we present an approach where copies of search results for previous queries are retained at peers and used to serve future requests and show participation can be incentivised using reputations. There are still problems to be solved before a real-world peer-to-peer web search engine can be build. This thesis provides a starting point for this ambitious goal and also provides a solid basis for reasoning about peer-to-peer information retrieval systems in general.

[download pdf]

What snippets say about web pages

Wednesday, September 12th, 2012, posted by Djoerd Hiemstra

What Snippets Say About Pages in Federated Web Search

by Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Chris Develder, and Djoerd Hiemstra

What is the likelihood that a Web page is considered relevant to a query, given the relevance assessment of the corresponding snippet? Using a new federated IR test collection that contains search results from over a hundred search engines on the internet, we are able to investigate such research questions from a global perspective. Our test collection covers the main Web search engines like Google, Yahoo!, and Bing, as well as a number of smaller search engines dedicated to multimedia, shopping, etc., and as such reflects a realistic Web environment. Using a large set of relevance assessments, we are able to investigate the connection between snippet quality and page relevance. The dataset is strongly inhomogeneous, and although the assessors’ consistency is shown to be satisfying, care is required when comparing resources. To this end, a number of probabilistic quantities, based on snippet and page relevance, are introduced and evaluated.

The paper will be presented at the Asia Information Retrieval Societies Conference AIRS 2012 in Tianjin, China

[download pdf]

Shard Ranking and Cutoff Estimation for Topically Partitioned Collections

Monday, August 27th, 2012, posted by Djoerd Hiemstra

by Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan

Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Federated Search in the Wild

Thursday, August 16th, 2012, posted by Djoerd Hiemstra

The Combined Power of over a Hundred Search Engines

by Dong Nguyen, Thomas Demeester (Ghent University), Dolf Trieschnigg, and Djoerd Hiemstra

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgments for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]