Archive for 2011

AXES at TRECVid 2011

Friday, December 23rd, 2011, posted by Djoerd Hiemstra

by Kevin McGuinness, Robin Aly, et al.

The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments.

[download pdf]

New team member: Mohammad Khelghati

Thursday, December 15th, 2011, posted by Djoerd Hiemstra
Mohammad Khelghati joined the database group to work on deep web entity monitoring. Welcome Mohammad!

DIR 2012 in beautiful Ghent

Wednesday, December 14th, 2011, posted by Djoerd Hiemstra

Ghent

On February 23rd and 24th 2012, the “12th Dutch-Belgian Information Retrieval Workshop” (DIR 2012) will be organised at the Ghent University conference center Het Pand. One of the primary goals of this year’s edition of DIR is to create an informal meeting place between the major IR research groups and related companies in Belgium and the Netherlands, to exchange information and to present innovative research developments. If your field of expertise is situated in the broad area of Information Retrieval, you are warmly invited to submit a “short paper” (4 pages) containing new research results, or a “compressed contribution” (2 pages) from a recent high-standard conference or journal paper. For companies active in this area, we offer the opportunity to submit demo papers, focusing on novel IR technology and applications. All further information regarding DIR 2012 can be found on http://dir2012.intec.ugent.be/.

2011’s DB colloquium

Monday, December 12th, 2011, posted by Djoerd Hiemstra
Below you find last year’s DB colloquia, usually Tuesday’s from 13:45h - 14:30h. in ZI-3126.
6 January 2011 (Thursday at 15.00 h.)
Andreas Wombacher
Data driven inter-model conformance checking
Sensor data document changes in the physical world, which can be understood based on metadata modeling part of the physical world. In the digital world, information systems are used for handling exchange of information between different actors, where some information is related to physical objects. Since these objects are potentially the same as observed by sensors, the sensor model (metadata) and the information system should describe the handling of physical objects in the same way, i.e., the information system and the sensor model should conform. So far conformance checking has been done on model level. I propose to use observed sensor and potentially information system data to check conformance.

13 January 2011 (Thursday, 10:45-12:30h. in CR-3E)
Data Warehousing and Data Mining guest lectures

Tom Jansen (Distimo)
Data warehousing for app store analytics. Distimo is an innovative app store analytics company built to solve the challenges created by a widely fragmented app store marketplace filled with equally fragmented information and statistics.

Jacques Niehof and Alexandra Molenaar (SIOD)
Data Mining for selection profiles for fraud and misuse of social security. The Social Intelligence and Investigation Service (SIOD) of the Ministry of Social Affairs and Employment (Ministerie van Sociale Zaken en Werkgelegenheid) fights criminality in the field of social security.

8 February 2011
Paul Stapersma
A probabilistic XML database on top of MayBMS
We use the probabilistic XML model proposed by Van Keulen and De Keijzer to create a prototype of an probabilistic XML database. One disadvantage of the XML data model is that queries cannot be executed as efficiently as in the relational database model. Many non-probabilistic mapping techniques have been developed to map semi structured data into relational databases to overcome this disadvantage. In this research, we use the schema-less mapping technique ‘XPath Accelerator’ to build a probabilistic XML database (PXML-DBMS) based on an URDBMS. A working prototype can be found at http://code.google.com/p/pxmlconverter/.

24 May 2011 (11:30h. in ZI-5126)
Almer Tigelaar
Search Result Caching in P2P Information Retrieval Networks
We explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

31 May 2011 (11:30h. in ZI-5126)
Kien Tjin-Kam-Jet
Free-Text Search over Complex Web Forms
This paper investigates the problem of using free-text queries as an alternative means for searching ‘behind’ web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

7 June 2011
CTIT Symposium Security and Privacy - something to worry about?
The list of invited speakers includes Prof.dr.ir. Vincent Rijmen (TU Graz, Austria and KU Leuven), Dr. George Danezis (Microsoft Research), Dr. Steven Murdoch (University of Cambridge), Prof. Bert-Jaap Koops (TILT), Prof.mr.dr. Mireille Hildebrandt (RU Nijmegen), Dr.ir. Martijn van Otterlo (KU Leuven) and Prof. Pieter Hartel (UT).
Read more…

21 June 2011
Andreas Wombacher
Sensor Data Visualization & Aggregation: A self-organizing approach
In this year’s Advanced Database Systems course the students had the assignment to design and implement the database functionality to visualize sensor data based on user requests in a Web based system. To guarantee good response times of the database it is necessary to pre-aggregate data. The core of the assignment was to find good pre-aggregations which minimize the query response times while using only a limited amount of storage space. The pre-aggregation levels must adjust in case the characteristics of the user requests changes. In this talk I will present the different approximation approaches of the students and present an optimal solution NP-hard solution.

28 June 2011
Marijn Koolen (University of Amsterdam)
Relevance, Diversity and Redundancy in Web Search Evaluation
Diversity performance is often evaluated with a measure that combines relevance, subtopic coverage and redundancy. Although this is understandable from a user perspective, it is problematic when analysing the impact of diversifying techniques on diversity performance. Such techniques not only affect subtopic coverage, but often the underlying relevance ranking as well. A evaluation measure that conflates these aspects hampers our progress in developing systems that provide diverse search results. In this talk, I argue that to further our understanding of how system components affect diversity, we need to look at relevance, coverage and redundancy individually. Using the official runs of the TREC 2009 Diversity task, I show that differences in diversity performance are mainly due to difference in the relevance ranking, with only minimal differences in how the relevant documents are ordered amongst themselves. If we measure diversity independent of the relevance ranking, we fin d that some of systems that perform badly on conflated measures have the most diverse ordering of relevant documents.

30 June 2011 (Thursday, 10:30-11:30h. in ZI-3126)
Design Project Presentations

Tristan Brugman, Maarten Hoek, Mustafa Radha, Iwan Timmer, Steven van der Vegt
UT-Search. UT Search is gemaakt als een vervanger voor google custom search. Er is gepoogd om een zoekmachine te maken waarin de kennis van de verschillende systemen binnen de UT wordt gebruikt om ook de vele informatie te vinden die niet via custom search gevonden kunnen worden. We maken gebruik van aggregated search om de verschillende systemen van de universiteit aan te spreken. Middels faceted search kan de gebruiker selecties maken van de systemen die hij wil doorzoeken om zo de zoekopdracht te verfijnen. Het doel is om een centrale zoekmachine te hebben voor de universiteit waar vanaf alle verschillende systemen doorzocht kunnen worden om de informatie toegankelijker te maken.

Ralph Broenink, Rick van Galen, Jarmo van Lenthe, Bas Stottelaar, Niek Tax
Alexia: Het borrelbeheersysteem voor Stichting Borrelbeheer Zilverling Met de activiteiten van vijf verenigingen, afstudeerborrels en andere activiteiten worden de beide borrelruimtes van het Educafé druk bezet. Er is gebleken dat door deze hoge bezettingsgraad er behoefte is aan een managementsysteem dat borrels kan plannen, voorraad kan bijhouden, tappers kan beheren en ook het verbruik tijdens de borrels kan registreren. Als extraatje is het ook mogelijk om op basis van RFID-kaarten op rekening te drinken. Tijdens de presentatie laten wij zijn hoe we uit de eisen van deze vijf verenigingen een webapplicatie hebben ontwikkeld dat bovenstaande functies kan vervullen, gebruikmakende van de laatste technologieën zoals Django, CSS3 en HTML5.

12 July 2011
Sergio Duarte
Sergio will talk about his internship at Yahoo! Research, Barcelona.

23 August 2011
Rezwan Huq
Inferring Fine-grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy Fine-grained data provenance ensures reproducibility of results in decision making, process control and e-science applications. However, maintaining this provenance is challenging in stream data processing because of its massive storage consumption, especially with large overlapping sliding windows. In this paper, we propose an approach to infer fine-grained data provenance by using a temporal data model and coarse-grained data provenance of the processing. The approach has been evaluated on a real dataset and the result shows that our proposed inferring method provides provenance information as accurate as explicit fine-grained provenance at reduced storage consumption.

30 August 2011 at 14:30h.
Maarten Fokkinga
Aggregation - polymorphic and polytypic
Repeating the work of Meertens “Calculate Polytypically!'’ we show how to define in a few lines a very general “aggregation'’ function. Our intention is to give a self-contained exposition that is, compared to Meertens’ work, more accessible for the uninitiated reader who wants to see the idea with a minimum of formal details.

5 September 2011 (Monday at 13.30h.)
Robin Aly
Towards a Better Understanding of the Relationship Between Probabilistic Models in IR
Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work.

20 September 2011
Juan Amiguet
Annotation propagation and topology based approaches
When making sensor data stream applications more robust to changes in the sensing environment by introducing annotations representing the changes. We find ourselves with the need to propagate such annotations across processing elements. We present here a technique for performing such propagation exploiting the relations amongst the inputs and outputs both from an information theoretic and topological perspective. Topologies are used to describe the structure of the inputs and the outputs separately. Whilst information theory techniques are used to model the transform as a channel enabling the topological transformations to be treated as optimisation problems. We present here a framework of functions which is generic in the light of all transforms, and which enables for the maximisation of the entropy across the transform.

26 October 2011
Sergio Duarte
What and How Children Search on the Web
In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to dis-orientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.
Read more

29 November 2011
Rezwan Huq
Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at Lower Storage Costs
In stream data processing, data arrives continuously and is processed by decision making, process control and e-science applications. To control and monitor these applications, reproducibility of result is a vital requirement. However, it requires massive amount of storage space to store fine-grained provenance data especially for those transformations with overlapping sliding windows. In this paper, we propose techniques which can significantly reduce storage costs and can achieve high accuracy. Our evaluation shows that adaptive inference technique can achieve more than 90% accurate provenance information for a given dataset at lower storage costs than the other techniques. Moreover, we present a guideline about the usage of different provenance collection techniques described in this paper based on the transformation operation and stream characteristics.

See also: 2010’s DB colloquium.

Keynote talk by Stefano Ceri at DBDBD

Thursday, November 17th, 2011, posted by Djoerd Hiemstra

Prof. Stefano CeriStefano Ceri will give a keynote talk at the DBDBD on 2 December 2011. Ceri is professor of Database Systems at Politecnico di Milano, Italy. He co-authored over 250 articles in International Journals and Conference Proceedings, and is co-author or editor of many international books, including best-selling classics like “Conceptual database design: an Entity-relationship approach” with Carlo Batini and Shamkant Navathe. His research interests cover many aspects of database management systems, including distributed databases, deductive and active databases, streaming data, object orientation, XML query languages, as well as design methods for data-intensive web sites.

Professor Ceri was awarded the prestigious IDEAS Advanced Grant, funded by the European Research Council (ERC), on Search Computing (search-computing.it): Search computing enables answering questions via a constellation of dynamically selected, cooperating, search services. Search computing should enable answering complex queries like: “Who are the strongest European competitors on software ideas?”, “Who is the best doctor to cure insomnia in a nearby hospital?”, or very important for poor PhD students, “Where can I attend an interesting conference in my field closest to a sunny beach?”

More information on: dbdbd.nl

Keynote lecture by Jimmy Lin at Big Data tutorial

Tuesday, October 18th, 2011, posted by Djoerd Hiemstra

Jimmy Lin will give a keynote lecture at the SIKS/BigGrid Big Data tutorial that preceeds the DBDBD on 30 November and 1 December 2011. Dr. Lin, who holds a PhD from MIT, is associate professor in the iSchool at the University of Maryland. He also has appointments in the Institute for Advanced Computer Studies (UMIACS) and the Department of Computer Science at Maryland. Lin works at the intersection of natural language processing (NLP) and information retrieval (IR), with a recent emphasis on scalable algorithm design and large-data issues. He directs the recently-formed Cloud Computing Center, an interdisciplinary group which explores the many aspects of cloud computing as it impacts technology, people, and society. He is also a member of both the Computational Linguistics and Information Processing Lab (CLIP) and the Human-Computer Interaction Lab (HCIL). Lin worked on Cloudera, which aims to bring Hadoop MapReduce to the enterprise, and is currently spending a sabbatical at Twitter

See: DBDBD Big Data Tutorial

Rianne Kaptein defends PhD thesis on Focused Retrieval

Monday, October 10th, 2011, posted by Djoerd Hiemstra

Effective Focused Retrieval by Exploiting Query Context and Document Structure

by Rianne Kaptein

The classic IR (Information Retrieval) model of the search process consists of three elements: query, documents and search results. A user looking to fulfill an information need formulates a query usually consisting of a small set of keywords summarizing the information need. The goal of an IR system is to retrieve documents containing information which might be useful or relevant to the user. Throughout the search process there is a loss of focus, because keyword queries entered by users often do not suitably summarize their complex information needs, and IR systems do not sufficiently interpret the contents of documents, leading to result lists containing irrelevant and redundant information. The main research question of this thesis is to exploit query context and document structure to provide for more focused retrieval.

More info

OLC-IT Jaarverslag 2010-2011

Tuesday, September 27th, 2011, posted by Djoerd Hiemstra

De opleidingscommissie IT (OLC-IT) houdt zich bezig met examenregelingen en het onderwijsprogramma van de bacheloropleidingen Informatica en Telematica en de masteropleidingen Computer Science, en Telematics. Ze heeft wettelijk het recht om gevraagd en ongevraagd advies uit te brengen aan de opleidingsdirecteur en de decaan. Dit jaar in het jaarverslag van de OLC:

  • Curriculumwijzigingen
  • Kwaliteitszorg
  • Werkgroepen
  • Nieuwe bacheloropleidingen

Lees het hele jaarverslag 2010-2011.

Job Vacancy: Scientific Programmer

Thursday, September 22nd, 2011, posted by Djoerd Hiemstra

Scientific programmer: folktale search and visualisation

The FACT project will investigate new possibilities for humanities researchers (folktale researchers, narratologists, documentalists, etc.) to study folktales based on annotations and relations that have been automatically assigned using data-driven methods. The Dutch Folktale Database (Nederlandse Volksverhalenbank) of the Meertens Institute is a very large and varied collection of Dutch Folktales. Within FACT, software will be developed to automatically enrich the folktales in this collection with metadata such as names, keywords, genre, a summary and type. An additional research goal is to investigate if automatic analysis of the folktale collection can reveal relations between folktales that are difficult to discover through human inspection. The annotation and clustering methods to be developed will be integrated in a user-friendly XML-based platform for the annotation and exploration of folktales, to support research on the variability of human oral and written transmission.

The University of Twente has vacancies for a PhD-student, a postdoc and a scientific programmer, who will be working together as a team to achieve the project goals. In addition there will be close cooperation with the Tunes & Tales project (funded under the Computational Humanities programme of KNAW) that is aimed at investigating sequences of motifs in, and variability of, melodies and folktales in oral transmission.

The scientific programmer will work on the development of user-friendly tools for folktale researchers that incorporate the annotation and clustering techniques developed by the postdoc and the PhD student. The annotation tool should allow for (semi) automatic annotation of folktales with language, genre, keywords, names, summary and type. The visualization tool should enable easy inspection of document clusters. In addition, the programmer will develop an XML-based search system that allows the general public to search for folktales in the Folktale Database based on their annotations.

Apply on-line (Deadline: 1 November 2011)

What and How Children Search on the Web

Thursday, August 18th, 2011, posted by Djoerd Hiemstra

by Sergio Duarte Torres and Ingmar Weber (Yahoo! Research)

The Internet has become an important part of the daily life of children as a source of information and leisure activities. Nonetheless, given that most of the content available on the web is aimed at the general public, children are constantly exposed to inappropriate content, either because the language goes beyond their reading skills, their attention span differs from grown-ups or simple because the content is not targeted at children as is the case of ads and adult content. In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to dis-orientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.

Tag Cloud

The paper will be presented at the 20th ACM International Conference on Information and Knowledge Management (CIKM) in Glasgow, 24-28 October 2011

[download pdf]