October 10th, 2011, posted by Djoerd Hiemstra
Effective Focused Retrieval by Exploiting Query Context and Document Structure
by Rianne Kaptein
The classic IR (Information Retrieval) model of the search process consists of three elements: query, documents and search results. A user looking to fulfill an information need formulates a query usually consisting of a small set of keywords summarizing the information need. The goal of an IR system is to retrieve documents containing information which might be useful or relevant to the user. Throughout the search process there is a loss of focus, because keyword queries entered by users often do not suitably summarize their complex information needs, and IR systems do not sufficiently interpret the contents of documents, leading to result lists containing irrelevant and redundant information. The main research question of this thesis is to exploit query context and document structure to provide for more focused retrieval.
More info
Posted in Uncategorized | Comments Off
September 22nd, 2011, posted by Djoerd Hiemstra
Scientific programmer: folktale search and visualisation
The FACT project will investigate new possibilities for humanities researchers (folktale researchers, narratologists, documentalists, etc.) to study folktales based on annotations and relations that have been automatically assigned using data-driven methods. The Dutch Folktale Database (Nederlandse Volksverhalenbank) of the Meertens Institute is a very large and varied collection of Dutch Folktales. Within FACT, software will be developed to automatically enrich the folktales in this collection with metadata such as names, keywords, genre, a summary and type. An additional research goal is to investigate if automatic analysis of the folktale collection can reveal relations between folktales that are difficult to discover through human inspection. The annotation and clustering methods to be developed will be integrated in a user-friendly XML-based platform for the annotation and exploration of folktales, to support research on the variability of human oral and written transmission.
The University of Twente has vacancies for a PhD-student, a postdoc and a scientific programmer, who will be working together as a team to achieve the project goals. In addition there will be close cooperation with the Tunes & Tales project (funded under the Computational Humanities programme of KNAW) that is aimed at investigating sequences of motifs in, and variability of, melodies and folktales in oral transmission.
The scientific programmer will work on the development of user-friendly tools for folktale researchers that incorporate the annotation and clustering techniques developed by the postdoc and the PhD student. The annotation tool should allow for (semi) automatic annotation of folktales with language, genre, keywords, names, summary and type. The visualization tool should enable easy inspection of document clusters. In addition, the programmer will develop an XML-based search system that allows the general public to search for folktales in the Folktale Database based on their annotations.
Apply on-line (Deadline: 1 November 2011)
Posted in Cultural heritage | Comments Off
August 18th, 2011, posted by Djoerd Hiemstra
by Sergio Duarte Torres and Ingmar Weber (Yahoo! Research)
The Internet has become an important part of the daily
life of children as a source of information and leisure
activities. Nonetheless, given that most of the content available
on the web is aimed at the general public, children are
constantly exposed to inappropriate content, either because the
language goes beyond their reading skills, their attention
span differs from grown-ups or simple because the content
is not targeted at children as is the case of ads and adult
content. In this work we employed a large query log sample
from a commercial web search engine to identify the
struggles and search behavior of children of the age of 6 to
young adults of the age of 18. Concretely we hypothesized
that the large and complex volume of information to which
children are exposed leads to ill-defined searches and to
dis-orientation during the search process. For this purpose, we
quantified their search difficulties based on query metrics
(e.g. fraction of queries posed in natural language), session
metrics (e.g. fraction of abandoned sessions) and click
activity (e.g. fraction of ad clicks). We also used the search
logs to retrace stages of child development. Concretely we
looked for changes in the user interests (e.g. distribution of
topics searched), language development (e.g. readability of
the content accessed) and cognitive development (e.g.
sentiment expressed in the queries) among children and adults.
We observed that these metrics clearly demonstrate an
increased level of confusion and unsuccessful search sessions
among children. We also found a clear relation between the
reading level of the clicked pages and the demographics
characteristics of the users such as age and average educational
attainment of the zone in which the user is located.

The paper will be presented at the 20th ACM
International Conference on Information and Knowledge
Management (CIKM) in Glasgow, 24-28 October 2011
[download pdf]
Posted in IR for children | Comments Off
July 27th, 2011, posted by Djoerd Hiemstra
This year’s SIGIR best paper award was presented to Mikhail Ageev (Moscow State University), and Qi Guo, Dmitry Lagun, and Eugene Agichtein (Emory University) for their paper Find It If You Can: A Game for Modeling Different Types of Web Search Success Using Interaction Data in which they propose a principled formalization of different types of success for informational search, and a scalable game-like infrastructure for crowdsourcing search behavior studies.
The best student paper award was awarded to Shuang-Hong Yang (Georgia Institute of Technology),
Bo Long and Alexander J. Smola (Yahoo! Labs), Hongyuan Zha (Georgia Institute of Technology), and
Zhaohui Zheng (Yahoo! Labs Beijing) for their paper Collaborative Competitive Filtering: Learning Recommender using Context of User Choice. The paper proposes Collaborative Competitive Filtering (CCF), a framework for learning user preferences by modeling the choice process in recommender systems.
There were honorable mentions for the papers: Parameterized Concept Weighting in Verbose Queries, Understanding Re-finding Behaviour in Naturalistic Email Interaction Log, Out of sight, not out of mind: On the effect of social and physical detachment on information need, Enhanced Results for Web Search, and Recommending Ephemeral Items at Web Scale.
Posted in Conference & Workshop | Comments Off
July 25th, 2011, posted by Djoerd Hiemstra
The Dutch Belgian Database Day (DBDBD) will be in Twente this year on
2 December 2011. The DBDBD is a yearly one-day workshop
organized by a Belgian or Dutch university, whose general topic is
database research. DBDBD invites submissions (1 page abstract) on a
broad range of database and database-related topics, including but
not limited to data storage and management, theoretical database
issues, database performance, data integration, data mining, data
security, and data search.
At the DBDBD, junior researchers from the Netherlands and Belgium
can present their recent results, and meet senior researchers in
the field of databases. It is an excellent opportunity to meet up
with your Belgian/Dutch colleagues, and to get informed about the
(recent) database-related research performed in Belgian/Dutch
universities. The workshop is also open to non-Belgian/Dutch
participants (presentations are in English). The workshop
consists of oral presentations. There are no printed proceedings.
Abstracts of talks will be published on the workshop’s website.
Keynote speaker at the DBDBD will be prof. Stefano Ceri from Politecnico
di Milano, Italy.
See the call for abstracts on the DBDBD 2011 web site.
Posted in Conference & Workshop, SIKS | 1 Comment »
July 1st, 2011, posted by Djoerd Hiemstra
FACT, Folktales As Classifiable Texts, is a project funded by the NWO Catch program. In the FACT project, the HMI group and DB group of the University of Twente will cooperate with the Meertens Institute to study new possibilities for researchers from humanities disciplines (folktale and narratology researchers, documentalists, etc.) to explore folktales based on annotations and links generated by data-driven methods. To this end, FACT will develop software enabling the computer to automatically enrich a corpus of Dutch folktales with metadata such as names, genre, type, and a summary. In addition, FACT represents the first effort to systematically apply and evaluate various clustering techniques on a very large (40.000+) and diverse collection of folktales. The algorithms developed in the project will be integrated in a user-friendly platform that supports annotation as well as exploratory research into variability in oral and written transmission, using XML database technology to model all folktale data (both annotations and the text of the tale itself) in one unifying framework. A large part of the scientific research in FACT will deal with the pros and cons of human classification and computerized clustering to investigate variation in (oral) transmission. By using document clustering, we hope to discover relationships between documents that cannot be readily identified by human annotators. The main challenge will be to make the computer decide which texts are related and which are not. This is not a black-or-white issue: folktales may be related to each other on different dimensions and to varying degrees. Will the computer be able to recognize the cultural DNA of tales, and make a distinction between different types (no kinship) and versions of the same type (kinship)?
See also: Nederlandse Volksverhalenbank.
Posted in Cultural heritage | Comments Off
June 6th, 2011, posted by Djoerd Hiemstra
Simulating the future of concept-based video retrieval under improved detector performance
by Robin Aly, Djoerd Hiemstra, Franciska de Jong and Peter Apers
In this paper we address the following important questions for concept-based video retrieval: (1) What is the impact of detector performance on the performance of concept-based retrieval engines, and (2) will these engines be applicable to real-life search tasks if detector performance improves in the future? We use Monte Carlo simulations to answer these questions. To generate the simulation input, we propose to use a probabilistic model of two Gaussians for the confidence scores that concept detectors emit. Modifying the model’s parameters affects the detector performance and the search performance. We study the relation between these two performances on two video collections. For detectors with similar discriminative power and a concept vocabulary of around 100 concepts, the simulation reveals that in order to achieve a search performance of 0.20 mean average precision (MAP)—which is considered sufficient performance for real-life applications—one needs detectors with at least 0.60 MAP . We also find that, given our simulation model and low detector performance, MAP is not always a good evaluation measure for concept detectors since it is not strongly correlated with the search performance.
This article is published with open access at Springer.com
[download pdf]
Posted in Multimedia Search | No one commented »
May 17th, 2011, posted by Djoerd Hiemstra
by Saskia Akkersdijk, Merel Brandon, Hanna Jochmann-Mannak, Djoerd Hiemstra, and Theo Huibers
Recent work shows that children are very well capable of searching with Google, due to their familiarity with the interface. However, children do have difficulties with the vertical list representation of the results. In this paper, we present an alternative result representation for a touch interface, the ImagePile. The ImagePile displays the results as a pile of images where the user navigates through via horizontal swiping. This representation was tested on a search engine for the Emma child hospital’s library. Using a within subject experiment, both representations were tested with children to compare the usability of both systems. The vertical representation was perceived as easier to use, but the ImagePile system was considered more fun to use. Also, with the ImagePile system more relevant results were chosen by the children, and they were more aware of the number of results.
[download pdf]
Posted in Course IR, Photos, IR for children | No one commented »
May 17th, 2011, posted by Djoerd Hiemstra

The university’s main entrance honours the FC
Posted in Photos | No one commented »
May 16th, 2011, posted by Djoerd Hiemstra
The Database Group of the University of Twente offers a PhD student position in the Dutch national project COMMIT, a 100M Euro project involving 10 universities and 70 companies. The program brings together leading researchers in search engines, parallel computing, databases, interaction in context, embedded systems and knowledge technology.
A large part of the web, the invisible web or deep web, cannot be indexed by web crawlers, for instance dynamic web pages that are returned in response to filling in a web form, or performing a search in a search engine. Instead of crawling deep web data, the approach will monitor web pages for certain (types of) queries. The objective is to develop approaches for monitoring web data that allow users to see a page’s full history of relevant/important changes by identifying entities: people, organizations, products, geographic locations, events, etc. The approach should relate changes in multiple web sites, giving the user a data-warehouse-like overview of the pages they monitor; drilling down to time periods, persons, events, etc.
The research will be done in co-operation with WCC. WCC, started in 1996 and is a successful software company based in Utrecht (NL) and Reston (USA). WCC’s current focus areas are the Employment and Identification Security markets. Both commercial and government customers worldwide use WCC’s smart search & match solutions to support their primary processes. Both WCC and the Database Group of the University of Twente have made significant advances in entity matching and entity ranking applied to for instance Employment Matching and Expert Search. This project will extend this work to monitoring of deep web pages, such a social networking sites, micro-blogging sites, job sites, etc. The candidate will spend part of the time at WCC in Utrecht.
[official vacancy text] (deadline: July 3rd, 2011)
Posted in Uncategorized | Comments Off