Archive for the 'Paper abstracts' Category

Joost Wolfswinkel graduates on enriching ontologies

Friday, August 31st, 2012, posted by Djoerd Hiemstra

Semi-Automatically Enriching Ontologies: A Case Study in the e-Recruiting Domain

by Joost Wolfswinkel

The thesis is inspired by a practical problem that was identified by Epiqo. Epiqo is an Austrian company that wants to expand to other countries within Europe and to other domains within Austria with their e-Recruiter system. For the e-Recruiter system to work, it needs domain specific ontologies. These ontologies need to be built from the ground up by domain experts, which is a time-consuming and thus expensive endeavor. This fueled the question from Epiqo whether this could be done (semi-)automatically.

The current research presents a solution for semi-automatically enriching domain specific ontologies. We adapt the general Ontology-Based Information Extraction (OBIE) architecture of Wimalasuriya and Dou (2010), to be more suitable for domain-specific applications by automatically generating a domain-specific semantic lexicon. We then apply this general solution to the case-study of Epiqo. Based on this architecture we develop a proof-of-concept tool and perform some explorative experiments with domain experts from Epiqo. We show that our solution has the potential to provide qualitative “good” enough ontologies to be comparable to standard ontologies.

[download pdf]

A framework for concept-based video retrieval

Friday, August 10th, 2012, posted by Djoerd Hiemstra

The Uncertain Representation Ranking Framework for Concept-Based Video Retrieval

by Robin Aly, Aiden Doherty (DCU, Ireland), Djoerd Hiemstra, Franciska de Jong, and Alan Smeaton (DCU, Ireland)

Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with multiple concept-based representations per video segment and it allows the re-use of effective text retrieval functions which are defined on similar representations. The final ranking status value is a weighted combination of two components: the expected score of the possible scores, which represents the risk-neutral choice, and the scores’ standard deviation, which represents the risk or opportunity that the score for the actual representation is higher. The framework consistently improves the search performance in the shot retrieval task and the segment retrieval task over several baselines in five TRECVid collections and two collections which use simulated detectors of varying performance.

[more information]

Free-Text Search versus Complex Web Forms

Thursday, January 13th, 2011, posted by Djoerd Hiemstra

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

We investigated the use of free-text queries as an alternative means for searching “behind” web forms. We conducted a user study where we evaluated our prototype free-text interface in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

The paper will be presented in April at the 33rd European Conference on Information Retrieval (ECIR 2011) in Dublin, Ireland

[download pdf]

Query Load Balancing in P2P Search

Monday, January 10th, 2011, posted by Djoerd Hiemstra

Query Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks

by Almer Tigelaar and Djoerd Hiemstra

For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we explore the limitations of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that cache hit ratios of at least thirty-three percent are attainable.

The paper will be presented at the 11th Dutch-Belgian Information Retrieval Workshop (DIR) on February 4 in Amsterdam

[download pdf]

University of Twente at TREC 2010

Thursday, October 28th, 2010, posted by Djoerd Hiemstra

MapReduce for Experimental Search

by Djoerd Hiemstra and Claudia Hauff

This draft report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.

[download pdf]

Dolf Trieschnigg defends PhD thesis on Biomedical IR

Monday, September 6th, 2010, posted by Djoerd Hiemstra

Proof of Concept: Concept-based Biomedical Information Retrieval

by Dolf Trieschnigg

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.

The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.

The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.

We investigated the following five topics.

  1. How documents are represented in a concept-based representation.
  2. To what extent such a document representation can be obtained automatically.
  3. To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
  4. To what extent a concept-based representation is effective in representing information needs.
  5. How the relationship between text and concepts can be used to determine the relatedness of concepts.

We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).

For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.

We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.

In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practice, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.

[download pdf]

A Cross-lingual Framework for Monolingual Biomedical Information Retrieval

Tuesday, August 24th, 2010, posted by Djoerd Hiemstra

by Dolf Trieschnigg, Djoerd Hiemstra, Franciska de Jong, and Wessel Kraaij

An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed to deal with this challenge. In this paper, we approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a cross-lingual IR framework for monolingual biomedical IR if basic translations models are combined.

The paper will be presented at the 19th ACM International Conference on Information and Knowledge Management on October 26-30 in Toronto, Canada.

Bertold van Voorst graduates on collection selection using database clustering

Monday, July 26th, 2010, posted by Djoerd Hiemstra

Cluster-based collection selection in uncooperative distributed information retrieval

by Bertold van Voorst

The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

[download pdf]

Let’s quickly test this on 12TB of data

Thursday, June 24th, 2010, posted by Djoerd Hiemstra

MapReduce for Information Retrieval Evaluation

by Djoerd Hiemstra and Claudia Hauff

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at:

The paper will be presented at the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation on 20-23 September 2010 in Padua, Italy

Tangible Information Retrieval for Children

Sunday, May 16th, 2010, posted by Djoerd Hiemstra

by Michel Jansen, Wim Bos, Paul van der Vet, Theo Huibers and Djoerd Hiemstra

Despite several efforts to make search engines more child-friendly, children still have trouble using systems that require keyboard input. We present TeddIR: a system using a tangible interface that allows children to search for books by placing tangible figurines and books they like/dislike in a green/red box, causing relevant results to be shown on a display. This way, issues with spelling and query formulation are avoided. A fully functional prototype was built and evaluated with children aged 6-8 at a primary school. The children understood TeddIR to a large extent and enjoyed the playful interaction.

TeddIR in the set-up used during evaluation.

TeddIR will be presented at 9th International Conference on Interaction Design and Children, Barcelona June 9-11, 2010.

[download pdf]