The Lowlands at TREC

by Robin Aly, Djoerd Hiemstra, Dolf Trieschnigg, and Thomas Demeester

We describe the participation of the Lowlands at the Web Track and the FedWeb track of TREC 2013. For the Web Track we used the MIREX MapReduce library with out-of-the-box approaches. For the FedWeb Track we adapted our shard selection method Taily for resource selection. Our results are above the median performance of TREC participants.

Presented at the 22nd Text REtrieval Conference (TREC) at the USA National Institute of Standards and Technology (NIST) in Gaithersburg, USA

[download pdf]

Cum laude PhD degree for Sergio Duarte Torres

Sergio Duarte Torres and defense committee

Sergio Duarte Torres' PhD defense last Friday February 14th resulted in a exceptional PhD degree cum laude. His PhD thesis: “Information Retrieval for Children: Search Behavior and Solutions” was written at the Database Group as part of the European project PuppyIR, a joint project with amongst others Human Media Interaction. Sergio's research shows an extraordinary diversity and heterogeneity, touching many areas of computer science, including Information Retrieval, Big Data analysis, and Machine Learning. Sergio sought cooperation with leading search engine companies in the field: Yahoo and Yandex. He did a three-month internship at Yahoo Research in Barcelona. Sergio's work is well-received. His paper on vertical selection for search for children was nominated for the Best Student Paper Award at the joint ACM/IEEE conference on Digital Libraries in Indianapolis, USA. His work is accepted at two important journals in the field: the ACM Transactions on the Web, and the Journal of the American Society of Information Science and Technology. Specifically worth mentioning is the user study with children aged 8 to 10 years old done by Sergio to evaluate the child-friendly search approaches that he developed. We are proud of the achievements of Sergio Duarte Torres. He will be an excellent ambassador of the University of Twente.

[download pdf]

Ilya Markov defends Phd thesis on Distributed Information Retrieval

Today, Ilya Markov successfully defended his PhD thesis at the Università della Svizzera italiana in Lugano, Switzerland.

Uncertainty in Distributed Information Retrieval

by Ilya Markov

Large amounts of available digital information call for distributed processing and management solutions. Distributed Information Retrieval (DIR), also known as Federated Search, provides techniques for performing retrieval over such distributed data. In particular, it studies approaches to aggregating multiple searchable sources of information within a single interface.
DIR provides an efficient and low-cost solution to a distributed retrieval problem. As opposed to a centralized retrieval system, which acquires, stores and processes all available information locally, DIR delegates the search task to distributed sources. This way, DIR lowers the storage and processing costs and provides a user with up-to-date information even if this information is not crawlable (i.e. cannot be reached using hyperlinks).
DIR is usually based on a brokered architecture, according to which distributed retrieval is managed by a single broker. The broker-based DIR can be divided into five steps: resource discovery, resource description, resource selection, score normalization and results presentation. Among these steps, resource description, resource selection and score normalization are actively studied within DIR research, while the resource discovery step is addressed by the database community and results presentation is studied within aggregated search.
Despite the large volume of research on resource selection and score normalization, no unified framework of developed techniques exists, which makes difficult the application and comparison of available methods. The first goal of this dissertation is to summarize, analyze and evaluate existing resource selection and score normalization techniques within a unified framework. This should improve the understanding of available methods, reveal their underlying assumptions and limitations and describe their properties. This, in turn, will help to improve existing resource selection and score normalization techniques and to apply the right method in the right setting.
The second and the main contribution of this dissertation is in stating and addressing the problem of uncertainty in DIR. In Information Retrieval (IR) this problem has been recognized for a long time and numerous techniques have been proposed to deal with uncertainty in various IR tasks. This dissertation raises the question of uncertainty in DIR, outlines the sources of uncertainty on different DIR phases and proposes methods for measuring and reducing this uncertainty.

Eenvoudige modellen en Big Data beter dan slimme modellen

Eenvoudige modellen en Big Data troeven slimme modellen af

Big Data – of het beter allitererende “Grote Gegevens” – is een term die sinds het begin van deze eeuw wordt gebruikt om gegevensverzamelingen aan te duiden die moeilijk verwerkt konden worden met behulp van de software van die tijd, verzamelingen van vele terabytes of petabytes in grootte. Technieken om zulke enorme verzamelingen gegevens te kunnen verwerken en analyseren werden met name ontwikkeld door Google. Het uitgangspunt van Google: Zet heel veel goedkope machines bij elkaar in grote datacentra, en gebruik slimme gereedschappen zodat applicatieontwikkelaars en gegevensanalisten het hele datacentrum kunnen gebruiken voor hun gegevensanalyses. Het datacentrum is de nieuwe computer! De slimme gereedschappen van Google raken veel kernelementen van de Informatica: bestandssystemen (Google File System), nieuwe programmeerparadigma's (MapReduce), nieuwe programmeertalen (bijvoorbeeld Sawzall) en nieuwe aanpakken voor het beheren van gegevens (BigTable), allemaal ontwikkeld om grote gegevensverzamelingen gemakkelijk toegankelijk te maken. Deze technieken zijn inmiddels ook beschikbaar in open source varianten. De bekendste, Hadoop, werd voor een belangrijk deel ontwikkeld bij Googles concurrent Yahoo. Aan de Universiteit Twente worden de technieken sinds 2009 onderwezen in het masterprogramma Computer Science. Nu we in staat zijn om te trainen op grootschalige gegevensverzamelingen doet zich het volgende fenomeen voor: Eenvoudige modellen getraind met grote gegevens troeven complexe modellen op basis van minder gegevens af…

[Lees verder]

Verschenen in STAtOR 14(3-4), Vereniging voor Statistiek en Operationele Research

Exploiting User Disagreement for Web Search Evaluation

Exploiting User Disagreement for Web Search Evaluation: An experimental approach

by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Dolf Trieschnigg, and Chris Develder

To express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. Whereas previous works have generally considered disagreement as a negative effect, this paper proposes a method to exploit this user disagreement by integrating it into the evaluation procedure. First, we present experiments that investigate the user disagreement. We argue that, with a high disagreement, lower relevance levels might need to be promoted more than in the case where there is global consensus on the top results. This is formalized by introducing the User Disagreement Model, resulting in a weighting of the relevance levels with a probabilistic interpretation. A validity analysis is given, and we explain how to integrate the model with well-established evaluation metrics. Finally, we discuss a specific application of the model, in the estimation of suitable weights for the combined relevance of Web search snippets and pages.

To be presented at the 7th ACM Conference on Web Search and Data Mining (WSDM) in New York City, USA on 24-28 February.

[Read more]

Empirical Co-occurrence Rate Networks For Sequence Labeling

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher.

Structured prediction has wide applications in many areas. Powerful and popular models for structured prediction have been developed. Despite the successes, they suffer from some known problems: (i) Hidden Markov models are generative models which suffer from the mismatch problem. Also it is difficult to incorporate overlapping, non-independent features into a hidden Markov model explicitly. (ii) Conditional Markov models suffer from the label bias problem. (iii) Conditional Random Fields (CRFs) overcome the label bias problem by global normalization. But the global normalization of CRFs can be expensive which prevents CRFs from applying to big data. In this paper, we propose the Empirical Co-occurrence Rate Networks (ECRNs) for sequence labeling. ECRNs are discriminative models, so ECRNs overcome the problems of HMMs. ECRNs are also immune to the label bias problem even though they are locally normalized. To make the estimation of ECRNs as fast as possible, we simply use the empirical distributions as the estimation of parameters. Experiments on two real-world NLP tasks show that ECRNs reduce the training time radically while obtain competitive accuracy to the state-of-the-art models.

Presented at International Conference on Machine Learning and Applications (ICMLA) in Miami, Florida

[download pdf]