Archive for the 'Paper abstracts' Category

Indexing half a billion web pages

Tuesday, October 27th, 2009, posted by Djoerd Hiemstra

by Claudia Hauff and Djoerd Hiemstra

The University of Twente participated in three tasks of TREC 2009: the adhoc task, the diversity task and the relevance feedback task. All experiments are performed on the English part of ClueWeb09. In this draft paper, we describe our approach to tuning our retrieval system in absence of training data in Section 3. We describe the use of categories and a query log for diversifying search results in Section 4. Section 5 describes preliminary results for the relevance feedback task.

[download pdf]

Matthijs Ooms graduates on Provenance Management for Bioinformatics

Monday, September 28th, 2009, posted by Djoerd Hiemstra

by Matthijs Ooms

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals:

  1. Using a provenance archive effectively and efficiently as cache for workflow tasks.
  2. Designing techniques to support browsing and navigation through a provenance archive.
The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two.

Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-BioFlow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype. For the OPM implementation, a profile for the OPM to collect provenance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved. A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement of each optimisation was measured using a query set taken from an OligoRAP cache run. Three tasks in OligoRAP were cached, resulting in an additional performance improvement of 19%. A provenance archive based on the OPM can be used to effectively cache workflow tasks. A provenance browser is introduced that incorporates several techniques to help browsing through large provenance archives. Its primary visualisation is the graph representation specified by the OPM.

More information at the e-BioFlow project page at SourceForge, or in Matthijs’ master thesis in ePrints.

Information Retrieval Models Tutorial

Thursday, August 20th, 2009, posted by Djoerd Hiemstra

Many applications that handle information on the internet would be completely inadequate without the support of information retrieval technology. How would we find information on the world wide web if there were no web search engines? How would we manage our email without spam filtering? Much of the development of information retrieval technology, such as web search engines and spam filters, requires a combination of experimentation and theory. Experimentation and rigorous empirical testing are needed to keep up with increasing volumes of web pages and emails. Furthermore, experimentation and constant adaptation of technology is needed in practice to counteract the effects of people that deliberately try to manipulate the technology, such as email spammers. However, if experimentation is not guided by theory, engineering becomes trial and error. New problems and challenges for information retrieval come up constantly. They cannot possibly be solved by trial and error alone. So, what is the theory of information retrieval? There is not one convincing answer to this question. There are many theories, here called formal models, and each model is helpful for the development of some information retrieval tools, but not so helpful for the development others. In order to understand information retrieval, it is essential to learn about these retrieval models. In this chapter, some of the most important retrieval models are gathered and explained in a tutorial style.

The tutorial will be published in Ayse Goker and John Davies (eds.), Information Retrieval: Searching in the 21st Century, Wiley, 2009.

[download draft]

[download exercise solutions]

Concept Detectors: How Good is Good Enough?

Monday, August 10th, 2009, posted by Djoerd Hiemstra

A Monte Carlo Simulation Based Approach

by Robin Aly and Djoerd Hiemstra

Today, semantic concept based video retrieval systems often show insufficient performance for real-life applications. Clearly, a big share of the reason is the lacking performance of the detectors of these concepts. While concept detectors are on their endeavor to improve, following important questions need to be addressed: “How good do detectors need to be to produce usable search systems?” and “How does the detector performance influence different concept combination methods?”. We use Monte Carlo Simulations to provide answers to the above questions. The main contribution of this paper is a probabilistic model of detectors which outputs confidence scores to indicate the likelihood of a concept to occur. This score is also converted into a posterior probability and a binary classification. We investigate the influence of changes to the model’s parameters on the performance of multiple concept combination methods. Current web search engines produce a mean average precision (MAP) of around 0.20. Our simulation reveals that the best performing video search methods achieve this performance using detectors with 0.60 MAP and is therefore usable in real-life. Furthermore, perfect detection allows the best performing combination method to produce 0.39 search MAP in a artificial environment with Oracle settings. We also find that MAP is not necessarily a good evaluation measure for concept detectors since it is not always correlated with search performance.

The paper will be presented at ACM Multimedia in Bejing, China.

[download pdf]

Pavel Serdyukov defends PhD thesis on Expert Search

Thursday, June 25th, 2009, posted by Djoerd Hiemstra

by Pavel Serdyukov

The automatic search for knowledgeable people in the scope of an organization is a key function which makes modern enterprise search systems commercially successful and socially demanded. A number of effective approaches to expert finding were recently proposed in academic publications. Although, most of them use reasonably defined measures of personal expertise, they often limit themselves to rather unrealistic and sometimes oversimplified principles. In this thesis, we explore several ways to go beyond state-of-the-art assumptions used in research on expert finding and propose several novel solutions for this and related tasks. First, we describe measures of expertise that do not assume independent occurrence of terms and persons in a document what makes them perform better than the measures based on independence of all entities in a document. One of these measures makes persons central to the process of terms generation in a document. Another one assumes that the position of the person’s mention in a document with respect to the positions of query terms indicates the relation of the person to the document’s relevant content. Second, we find the ways to use not only direct expertise evidence for a person concentrated within the document space of the person’s current employer and only within those organizational documents that mention the person. We successfully utilize the predicting potential of additional indirect expertise evidence publicly available on the Web and in the organizational documents implicitly related to a person. Finally, besides the expert finding methods we proposed, we also demonstrate solutions for tasks from related domains. In one case, we use several algorithms of multi-step relevance propagation to search for typed entities in Wikipedia. In another case, we suggest generic methods for placing photos uploaded to Flickr on the world map using language models of locations built entirely on the annotations provided by users with a few task specific extensions.

[download pdf]

Robin Aly presents at SIGIR Doctoral Consortium

Wednesday, June 10th, 2009, posted by Djoerd Hiemstra

Modeling Uncertainty in Video Retrieval: A Retrieval Model for Uncertain Semantic Representations of Videos

by Robin Aly

The need for content based multimedia retrieval increases rapidly because of ever faster growing collection sizes. However, retrieval systems often do not perform well enough for real-life applications. A promising approach is to detect semantic primitives at indexing time. Currently investigated primitives are: the uttering of the words and the occurrence of so-called semantic concepts, such as “Outdoor” and “Person”. We refer to a concrete instantiation of these primitives as the representation of the video document. Most detector programs emit scores reflecting the likelihood of each primitive. However, the detection is far from perfect and a lot of uncertainty about the real representation remains. Some retrieval algorithms ignore this uncertainty, which clearly hurts precision and recall. Other methods use the scores as anonymous features and learn their relationship to relevance. This has the disadvantage of requiring vast amounts of training data and has to be redone for every detector change.

The main contribution of our work is a formal retrieval model of treating this uncertainty. We conceptually consider the retrieval problem as two steps: (1) the determination of the posterior probability distribution given the scores over all representations (using existing methods) and (2) the derivation of a ranking status value (RSV) for each representation. We then take the expected RSV weighted by the respresentation’s posterior probability as the effective RSV of this shot for ranking. We claim that our approach has following advantages: (a) that step (2) is easier achieved than using the machine learning alternative and (b) that it benefits from all detector improvements.

[more information]

Parallel and Distributed Databases at Euro-Par 2009

Thursday, June 4th, 2009, posted by Djoerd Hiemstra

by Djoerd Hiemstra, Alfons Kemper, Manuel Prieto, and Alex Szalay

Euro-Par is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel and distributed computing. Euro-Par focuses on all aspects of hardware, software, algorithms and applications for parallel and distributed computing. The Euro-Par 2009 conference will take place in Delft, the Netherlands, from August 25th until August 28th, 2009.

Sessions at Euro-Par cover several topics. Euro-Par Topic 5 addresses data management issues in parallel and distributed computing. Advances in data management (storage, access, querying, retrieval, mining) are inherent to current and future information systems. Today, accessing large volumes of information is a reality: Data-intensive applications enable huge user communities to transparently access multiple pre-existing autonomous, distributed and heterogeneous resources (data, documents, images, services, etc.). Data management solutions need efficient techniques for exploiting and mining large datasets available in clusters, peer to peer and Grid architectures. Parallel and distributed file systems, databases, data warehouses, and digital libraries are a key element for achieving scalable, efficient systems that will cost-effectively manage and extract data from huge amounts of highly distributed and heterogeneous digital data repositories. Each paper submitted to Euro-Par’s topic Parallel and Distributed Databases was reviewed by at least three reviewers. Of 11 papers submitted to the topic this year, 3 were accepted, which makes an acceptance rate of 27 %. The three accepted papers discuss diverse issues: database transactions, efficient and reliable structured peer-to-peer systems, and selective replicated declustering.

In their paper Unifying Memory and Database Transactions, Ricardo Dias, and João Lourenço present a simple but powerful idea: Combining software transactional memory with database transactions. The paper proposes to provide unified memory and database transactions by integrating the database transaction control with a software framework for transactional memory. Experimental results show that the overhead for unified transactions is low. It is likely that the approach lowers the burden on the application developer. The paper by Hao Gong, Guangyu Shi, Jian Chen, and Lingyuan Fan, A DHT Key-Value Storage System with Carrier Grade Performance, tries to achieves reliability and efficiency in peer-to-peer systems in order so support Telecom services. The proposed design is based on: Adopting a two-layer distributed hash table, embedding location information into peer IDs; Providing one-hop routing by enhancing each peer with an additional one-hop routing table, where super-peers are in charge of updating and synchronizing this routing information. Finally, the approach replicates subscriber data on multiple peers. Finally, Kerim Oktay, Ata Turk, and Cevdet Aykanat present a paper on Selective Replicated Declustering for Arbitrary Queries. The authors present a new algorithm for selective replicated declustering for arbitrary queries. The algorithm makes use of query information available in order to decide on the data assignment to different disks and on which data to replicate respecting space constraints. Further, it is described how to apply the proposed algorithm in a recursive way for obtaining a multi-way replicated declustering. Experiments show the algorithm outperforms existing replicated declustering schemes, especially for low replication constraints.

[download pdf]

More info at the Euro-Par 2009 conference site.

Towards an Information Retrieval Theory of Everything

Monday, May 25th, 2009, posted by Djoerd Hiemstra

I present three well-known probabilistic models of information retrieval in tutorial style: The binary independence probabilistic model, the language modeling approach, and Google’s page rank. Although all three models are based on probability theory, they are very different in nature. Each model seems well-suited for solving certain information retrieval problems, but not so useful for solving others. So, essentially each model solves part of a bigger puzzle, and a unified view on these models might be a first step towards an Information Retrieval Theory of Everything.

The paper is published in the news letter of the NVTI, the “Nederlandse Vereniging voor Theoretische Informatica”. A more extensive overview of information retrieval theory, covering eight models is given in: Djoerd Hiemstra, Information Retrieval Models. In: Ayse Goker and John Davies (eds.), Information Retrieval: Searching in the 21st Century, Wiley, 2009.

[download pdf]

Reusing Annotation Labor for Concept Selection

Wednesday, May 6th, 2009, posted by Djoerd Hiemstra

by Robin Aly, Djoerd Hiemstra and Arjen de Vries

Describing shots through the occurrence of semantic concepts is the first step towards modeling the content of a video semantically. An important challenge is to automatically select the right concepts for a given information need. For example, systems should be able to decide whether the concept “Outdoor” should be included into a search for “Street Basketball”. In this paper we provide an innovative method to automatically select concepts for an information need. To achieve this, we provide an estimation for the occurrence probability of a concept in relevant shots, which helps us to quantify the helpfulness of a concept. Our method reuses existing training data which is annotated with concept occurrences to build a text collection. Searching in this collection with a text retrieval system and knowing about the concept occurrences allows us to come up with a good estimate for this probability. We evaluate our method against a concept selection benchmark and search runs on both the TRECVID 2005 and 2007 collections. These experiments show that the estimation consistently improves retrieval effectiveness.

[download pdf]

Two Twente-Yahoo papers at SIGIR 2009

Tuesday, April 21st, 2009, posted by Djoerd Hiemstra

Both Pavel Serdyukov and Claudia Hauff have joint papers accepted for the 32nd Annual ACM SIGIR Conference in Boston, USA.

Placing Flickr Photos on a Map
by Pavel Serdyukov, Vanessa Murdock (Yahoo!), and Roelof van Zwol (Yahoo!)

In this paper we investigate generic methods for placing photos uploaded to Flickr on the World map. As primary input for our methods we use the textual annotations provided by the users to predict the single most probable location where the image was taken. Central to our approach is a language model based entirely on the annotations provided by users. We define extensions to improve over the language model using tag-based smoothing and cell-based smoothing, and leveraging spatial ambiguity. Further we demonstrate how to incorporate GeoNames, a large external database of locations. For varying levels of granularity, we are able to place images on a map with at least twice the precision of the state-of-the-art reported in the literature.

Efficiency trade-offs in two-tier web search systems
by Ricardo Baeza-Yates (Yahoo!), Vanessa Murdock (Yahoo!), and Claudia Hauff

Search engines rely on searching multiple partitioned corpora to return results to users in a reasonable amount of time. In this paper we analyze the standard two-tier architecture for Web search with the difference that the corpus to be searched for a given query is predicted in advance. We show that any predictor better than random yields time savings, but this decrease in the processing time yields an increase in the infrastructure cost. We provide an analysis and investigate this trade-off in the context of two different scenarios on real-world data. We demonstrate that in general the decrease in answer time is justified by a small increase in infrastructure cost.

See: list of accepted papers at SIGIR’09.