Archive for 2012

Query Recommendation for Children

Tuesday, September 11th, 2012, posted by Djoerd Hiemstra

by Sergio Duarte Torres, Djoerd Hiemstra, Ingmar Weber (Yahoo), Pavel Serdyukov (Yandex)

One of the biggest problems that children experience while searching the web occurs during the query formulation process. Children have been found to struggle formulating queries based on keywords given their limited vocabulary and their difficulty to choose the right keywords. In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Seminar on Distributing Search

Friday, September 7th, 2012, posted by Djoerd Hiemstra

The 7th SSR on Distributing Search will take place on 26 September 2012 at the University of Twente. Invited speakers are:

  • Jamie Callan (Carnegie Mellon University, USA)
  • Johan Pouwelse (Delft University of Technology)
  • Fabio Crestani (University of Lugano, Switzerland)

SSR-7 will take place at the campus of the University of Twente at the Ravelijn, lecture hall 2504. The event is sponsored by the Netherlands research School for Information and Knowledge Systems (SIKS), the Netherlands Organisation for Scientific Research (NWO), and the Centre for Telematics and Information Technology (CTIT). Please visit the SSR-7 home page for more information.

Joost Wolfswinkel graduates on enriching ontologies

Friday, August 31st, 2012, posted by Djoerd Hiemstra

Semi-Automatically Enriching Ontologies: A Case Study in the e-Recruiting Domain

by Joost Wolfswinkel

The thesis is inspired by a practical problem that was identified by Epiqo. Epiqo is an Austrian company that wants to expand to other countries within Europe and to other domains within Austria with their e-Recruiter system. For the e-Recruiter system to work, it needs domain specific ontologies. These ontologies need to be built from the ground up by domain experts, which is a time-consuming and thus expensive endeavor. This fueled the question from Epiqo whether this could be done (semi-)automatically.

The current research presents a solution for semi-automatically enriching domain specific ontologies. We adapt the general Ontology-Based Information Extraction (OBIE) architecture of Wimalasuriya and Dou (2010), to be more suitable for domain-specific applications by automatically generating a domain-specific semantic lexicon. We then apply this general solution to the case-study of Epiqo. Based on this architecture we develop a proof-of-concept tool and perform some explorative experiments with domain experts from Epiqo. We show that our solution has the potential to provide qualitative “good” enough ontologies to be comparable to standard ontologies.

[download pdf]

Shard Ranking and Cutoff Estimation for Topically Partitioned Collections

Monday, August 27th, 2012, posted by Djoerd Hiemstra

by Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan

Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Federated Search in the Wild

Thursday, August 16th, 2012, posted by Djoerd Hiemstra

The Combined Power of over a Hundred Search Engines

by Dong Nguyen, Thomas Demeester (Ghent University), Dolf Trieschnigg, and Djoerd Hiemstra

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgments for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Emiel Mols graduates on sharding Spotify search

Wednesday, August 15th, 2012, posted by Djoerd Hiemstra

Today, Emiel Mols graduated when presenting the master thesis project he did at Spotify in Stockholm, Sweden. Emiel got quite some attention last year when he launched SpotifyOnTheWeb, leaving Spotify “no choice but to hire him”.

In the master thesis, Emiel describes a prototype implementation of a term sharded full text search architecture. The system’s requirements are based on the use case of searching for music in the Spotify catalogue. He benchmarked the system using non-synthethic data gathered from Spotify’s infrastructure.

The thesis will be available from ePrints.

ACM SIGIR honors Norbert Fuhr

Tuesday, August 14th, 2012, posted by Djoerd Hiemstra

Norbert Fuhr For pioneering contributions to approaches that now dominate the search industry, ACM SIGIR honors Norbert Fuhr from the University of Duisburg-Essen (Germany) with the 2012 Gerard Salton Award. Fuhr developed probabilistic retrieval models for databases and XML, and his research on probabilistic models anticipated the current interest in learning to rank approaches in search operations. Fuhr received the award at the ACM SIGIR Conference in Portland, Oregon, USA, where he gave the opening keynote address. Read more in the ACM Press release.

A framework for concept-based video retrieval

Friday, August 10th, 2012, posted by Djoerd Hiemstra

The Uncertain Representation Ranking Framework for Concept-Based Video Retrieval

by Robin Aly, Aiden Doherty (DCU, Ireland), Djoerd Hiemstra, Franciska de Jong, and Alan Smeaton (DCU, Ireland)

Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with multiple concept-based representations per video segment and it allows the re-use of effective text retrieval functions which are defined on similar representations. The final ranking status value is a weighted combination of two components: the expected score of the possible scores, which represents the risk-neutral choice, and the scores’ standard deviation, which represents the risk or opportunity that the score for the actual representation is higher. The framework consistently improves the search performance in the shot retrieval task and the segment retrieval task over several baselines in five TRECVid collections and two collections which use simulated detectors of varying performance.

[more information]

STW grant for StructWeb

Friday, August 10th, 2012, posted by Djoerd Hiemstra

StructWeb Wim Korevaar received a valorization grant from the Dutch Technology Foundation STW for his proposal StructWeb: Structuring the Web for Organizations. The concept is based on an innovative information system developed on the basis of the latest insights on search technology and making use of an intuitive user interface: StructWeb. The new technology will be used to help businesses and organizations to structure their vast information resources and make it more easy for their staff and clients to access them.

More information at:

ECIR 2013 Call for Tutorials

Friday, July 6th, 2012, posted by Djoerd Hiemstra

35th ECIR
European Conference on Information Retrieval

Moscow, Russia
24-27 March 2013

The goal of the ECIR 2013 Tutorials is to offer conference attendees and local participants a stimulating and informative selection of tutorials reflecting current topics in information retrieval and related areas. Proposals are invited for tutorials of either a half-day (3 hours plus breaks) or full day (6 hours plus breaks). Each tutorial should cover a single topic in detail on state-of-the-art methods in core information retrieval, related research or novel and emerging applications. The tutorials will take place on 24 March, 2013. Deadline for tutorial submission is 16 September 2012.

More information at the ECIR 2013 web site.