Reusing Annotation Labor for Concept Selection

by Robin Aly, Djoerd Hiemstra and Arjen de Vries

Describing shots through the occurrence of semantic concepts is the first step towards modeling the content of a video semantically. An important challenge is to automatically select the right concepts for a given information need. For example, systems should be able to decide whether the concept “Outdoor” should be included into a search for “Street Basketball”. In this paper we provide an innovative method to automatically select concepts for an information need. To achieve this, we provide an estimation for the occurrence probability of a concept in relevant shots, which helps us to quantify the helpfulness of a concept. Our method reuses existing training data which is annotated with concept occurrences to build a text collection. Searching in this collection with a text retrieval system and knowing about the concept occurrences allows us to come up with a good estimate for this probability. We evaluate our method against a concept selection benchmark and search runs on both the TRECVID 2005 and 2007 collections. These experiments show that the estimation consistently improves retrieval effectiveness.

[download pdf]

Two Twente-Yahoo papers at SIGIR 2009

Both Pavel Serdyukov and Claudia Hauff have joint papers accepted for the 32nd Annual ACM SIGIR Conference in Boston, USA.

Placing Flickr Photos on a Map
by Pavel Serdyukov, Vanessa Murdock (Yahoo!), and Roelof van Zwol (Yahoo!)

In this paper we investigate generic methods for placing photos uploaded to Flickr on the World map. As primary input for our methods we use the textual annotations provided by the users to predict the single most probable location where the image was taken. Central to our approach is a language model based entirely on the annotations provided by users. We define extensions to improve over the language model using tag-based smoothing and cell-based smoothing, and leveraging spatial ambiguity. Further we demonstrate how to incorporate GeoNames, a large external database of locations. For varying levels of granularity, we are able to place images on a map with at least twice the precision of the state-of-the-art reported in the literature.

Efficiency trade-offs in two-tier web search systems
by Ricardo Baeza-Yates (Yahoo!), Vanessa Murdock (Yahoo!), and Claudia Hauff

Search engines rely on searching multiple partitioned corpora to return results to users in a reasonable amount of time. In this paper we analyze the standard two-tier architecture for Web search with the difference that the corpus to be searched for a given query is predicted in advance. We show that any predictor better than random yields time savings, but this decrease in the processing time yields an increase in the infrastructure cost. We provide an analysis and investigate this trade-off in the context of two different scenarios on real-world data. We demonstrate that in general the decrease in answer time is justified by a small increase in infrastructure cost.

See: list of accepted papers at SIGIR'09.

Kien Tjin-Kam-Jet graduates on result merging for distributed information retrieval

Centralized Web search has difficulties with crawling and indexing the Visible Web. The Invisible Web is estimated to contain much more content, and this content is even more difficult to crawl. Metasearch, a form of distributed search, is a possible solution. However, a major problem is how to merge the results from several search engines into a single result list. We train two types of Support Vector Machines (SVMs): a regression model and preference classification model. Round Robin (RR) is used as our merging baseline. We varied the number of search engines being merged, the selection policy, and the document collection size of the engines. Our findings show that RR is the fastest method and that, in a few cases, it performs as well as regression-SVM. Both SVM methods are much slower and, judging by performance, regression-SVM is the best of all three methods. The choice of which method to use depends strongly on the usage scenario. In most cases, we recommend using regression-SVM.

[download pdf]

Jobs: Three PhD student positions

Position: Distributed Information Retrieval

The Database Group of the University of Twente offers a job opening in the NWO Vidi Project “Distributed Information Retrieval by means of Keyword Auctions”. The project's aim is to distribute internet search functionality in such a way that communities of users and/or federations of small search systems provide search services in a collaborative way. Instead of getting all data to a centralized point and process queries centrally, as is done by today's search systems, the project will distribute queries over many small autonomous search systems and process them locally. In this project, the PhD student will research a new approach to distribute search: distributed information retrieval by means of keyword auctions. Keyword auctions like Google's AdWords give advertisers the opportunity to provide targeted advertisements by bidding on specific keywords. Analogous to these keyword auctions, local search systems will bid for keywords at a central broker. They “pay” by serving queries for the broker. The broker will send queries to those local search systems that optimize the overall effectiveness of the system, i.e., local search systems that are willing to serve many queries, but also are able to provide high quality results. The PhD student will work within a small team of researchers that approaches the problem from three different angles: 1) modeling the local search system, including models for automatic bidding and multi-word keywords, 2) modeling the search broker's optimization using the bids, the quality of the answers, and click-through rates, and 3) integration of structured data typically available behind web forms of local search systems with text search.

See official announcement. (Deadline: 19 April 2009)

Two positions: PuppyIR, Information Retrieval for Children

The Groups Human Media Interaction and Databases of the University of Twente offer two job openings in the European Project PuppyIR. Current Information Retrieval (IR) systems are designed for adults: they return information that is unsuitable for children, present information in lists that children find difficult to manage and make it difficult for children to ask for information. PuppyIR will create information search services that are tailored to the specific needs of children, giving children the opportunity to fully and safely exploit the power of the Internet. PuppyIR will develop new interaction paradigms to allow children to easily express their information need, to have results presented in an intuitive way and to engage children in system interaction. It will develop a set of Information Services: components to summarise textual and audiovisual content for children, to help children safely explore new information, to moderate information for children at different ages, to build new social networks and to intelligently aggregate and present information to children. PuppyIR will offer an open source platform that enables system designers to construct useful and usable information retrieval systems for children. The project will demonstrate the effectiveness of the PuppyIR modules through demonstrator systems constructed in collaboration with the Netherlands Public Library Association and the Emma Children's Hospital. At the university of Twente, a team of six senior researchers and three PhD students will cooperate in PuppyIR. One PhD student will work on user interaction design. The other two positions are described below.

Position 1: Analyzing and structuring textual information (at Human Media Interaction) Analyzing and structuring textual information studies how natural language processing tools can assist the organization of information in a way that enables children to easily access the information. The PhD student at Human Media Interaction will focus on information extraction, text classification, and story understanding and summarization on written and spoken data, for instance for questions or comments created by children (e.g., chats, blogs) and content created explicitly for children (e.g., stories).

Position 2: Multimedia content mining (at Databases) Multimedia content mining will develop database search technology that enables better understanding of the individual behavior of the child and consequently his/her information need. The PhD student at Databases will focus on concept retrieval, faceted search, query formulation assistance, and intuitive relevance feedback mechanisms that allow children to easily access the content of multimedia data sources, for instance for content sharing within online groups including moderated discovery.

See official announcement. (Deadline: 15 April 2009)

PuppyIR: IR for Children

As adults we are keen to help children maximize their full potential. Developing children’s abilities to find and understand information is key to their development as young adults. The Internet offers children exciting new ways to meet people, learn about different cultures and develop their creative potential. In a world where Internet and technology play such an important role as it does today, it is absolutely necessary that children can assess the meaning of gathered information and can in child-friendly ways get engaged in interaction with content.

However, children’s ability to use the Internet is severely hampered by the lack of appropriate search tools. Most Information Retrieval (IR) systems are designed for adults: they return information that is unsuitable for children, present information in lists that children find difficult to manage and make it difficult for children to identify the relevant parts. Worse, almost all Internet search engines confront children with inappropriate material.

PuppyIR is an FP7 project that will help children search the Internet safely and successfully by the design of an Open-Source platform of child-friendly information services. These Information Services will be able to summarise content for children, moderate information for children, help children safely build social networks and intelligently aggregate for presentation to children. PuppyIR aims to facilitate the creation of child-centric information access, based on the understanding of the behaviour and needs of children. PuppyIR will provide a suite of components that can be used by system designers to construct usable and tailored IR systems for children and the opportunity for children to fully exploit the Internet. PuppyIR will develop new interaction paradigms that allow children to express their information needs simply and have results presented in an intuitive way. PuppyIR will contribute to the evaluation of children’s IR systems by the development of child-centred evaluation methods.

More info at: PuppyIR project page at NIRICT.

Sander Bockting graduates on collection selection for distributed web search

Using Highly Discriminative Keys, Query-driven Indexing and ColRank

Current popular web search engines, such as Google, Live Search and Yahoo!, rely on crawling to build an index of the World Wide Web. Crawling is a continuous process to keep the index fresh and generates an enormous amount of data traffic. By far the largest part of the web remains unindexed, because crawlers are unaware of the existence of web pages and they have difficulties crawling dynamically generated content.

These problems were the main motivation to research distributed web search. We assume that web sites, or peers, can index a collection consisting of local content, but possibly also content from other web sites. Peers cooperate with a broker by sending a part of their index. Receiving indices from many peers, the broker gains a global overview of the peers’ content. When a user poses a query to a broker, the broker selects a few peers to which it forwards the query. Selected peers should be promising to create a good result set with many relevant documents. The result sets are merged at the broker and sent to the user. This research focuses on collection selection, which corresponds to the selection of the most promising peers. The use of highly discriminative keys is employed as a strategy to select those peers. A highly discriminative key is a term set which is an index entry at the broker. The key is highly discriminative with respect to the collections because the posting lists pointing to the collections are relatively small. Query-driven indexing is applied to reduce the index size by only storing index entries that are part of popular queries. A PageRank-like algorithm is also tested to assign scores to collections that can be used for ranking. The Sophos prototype was developed to test these methods. Sophos was evaluated on different aspects, such as collection selection performance and index sizes. The performance of the methods is compared to a baseline that applied language modeling onto merged documents in collections. The results show that Sophos can outperform the baseline with ad-hoc queries on a web based test set. Query-driven indexing is able to substantially reduce index sizes against a small loss in collection selection performance. We also found large differences in the level of difficulty to answer queries on various corpus splits.

More information in [E-Prints]