Archive for 2010

2010’s DB colloquium

Tuesday, December 21st, 2010, posted by Djoerd Hiemstra
Below you find this year’s DB colloquia, usually Tuesday’s from 13:45h - 14:30h. in ZI-3126.
24 March 2010 (Wednesday)
Robin Aly
Beyond shot retrieval: Searching for Broadcast News Items Using Language Models of Concepts
In this paper we use a method to evaluate the performance of story retrieval, based on the TRECVID shot-based retrieval ground truth. Our experiments on the TRECVID 2005 collection show a significant performance improvement against four standard methods.
Read more…

30 March 2010
Maarten Fokkinga
A Greedy Algorithm for Team Formation that is Fair over Time
In terms of a concrete example we derive a “fast” so-called greedy algorithm for a “hard” problem (having exponential time complexity). The concrete problem is: the formation of teams from a given set of players such that, when repeated many times, each player is equally often teammate of each other player. We also formalize our greedy algorithm in a general setting.

6 April 2010
Sergio Duarte
An analysis of queries intended to search information for children
In this paper we analyze queries and groups of queries intended to satisfy children’s information needs using a large-scale query log to compare the characteristics of these queries. The aim of this analysis is twofold: i) To identify differences in the query space, content space, user sessions and user behavior of these two types of queries. ii) To enhance this query log by including annotation on children queries, sessions and actions.
Read more…

13 April 2010
Djoerd Hiemstra
MapReduce information retrieval experiments
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages (ClueWeb09, Category A) showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort.
Read more…

20 April 2010
Maurice van Keulen
Climbing trees in parallel universes
When doing research on data uncertainty, one enters the realm of science fiction, where the same data can co-exist in parallel universes in slight different forms. An uncertain XML document is like a tree that exists in many parallel universes where its leafs and branches may exist in one, but possibly not in another. Querying XML is navigating through XML trees … so I’m going to teach you how to efficiently climb trees in parallel universes. The trick is to simultaneously climb the tree in all parallel universes at once, while not stepping on branches that do not exist … because then you will fall down in one universe and not in the other, splitting yourself into a life and a dead person like Schroedinger’s cat … sorry, my imagination carries me away …

27 April 2010
Almer Tigelaar
Query-Based Sampling using Only Snippets
Query-based sampling is a popular approach to model the content of an uncooperative server. It works by sending queries to the server and downloading the returned documents in the search results in full. This sample of documents then represents the server’s content. We present an approach that uses the document snippets as samples instead of downloading entire documents. This yields more stable results at the same amount of bandwidth usage as the full document approach. Additionally, we show that using snippets does not necessarily incur more latency, but can actually save time.

4 May 2010
Riham Abdel Kader
Run-time Optimization for Pipelined Systems
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for XQueries, which interleaves optimization and execution of full tables. ROX has proved to be robust, even in the presence of strong correlations, but it has one limitation: it uses full materialization of intermediate results making it unsuitable for pipelined systems. Therefore, this paper proposes ROX-sampled, a variant of ROX, which executes small data samples, thus generating smaller intermediates. We conduct extensive experiments which proved that ROX-sampled is comparable to ROX in performance, and that it is still robust against correlations. The main benefit of ROX-sampled is that it allows the large number of pipelined databases to import the ROX idea into their optimization paradigm.
Read more…

11 May 2010
Peter Apers
How bright looks your research future.
In the beginning of your research career you set out your own goals and take decisions that may negatively affect your career. With this presentation I want to make you aware of this. Topics that are discussed: Research Challenges, Trends in Research Funding, Funding agencies with their own programs, What is expected from you?

18 May 2010
Robin Aly
Uncertainty in Information Retrieval
Today, information retrieval systems take many things for granted. For example: (1) A classifier decides whether a concepts occurs in a medical document. (2) Every document that contains an expert’s email address describes his competence. (3) The systems parameters in a retrieval function are constant for all queries. Finally, a score is the final number to rank documents by - ignoring the other documents in the ranking. In this talk, I will first identify three main sources of uncertainty in an information retrieval system. Afterwards, I will describe existing approaches to this uncertainty and propose future directions in this field of research. This talk contains the core ideas of the research I would like to conduct in the future and therefore has a more visionary character.

25 May 2010
Andreas Wombacher
Uncertainty principle in stream processing
In environmental applications sensor data are processed online as a stream of measurements for warning, decision support, forecasting and controlling applications. The stream processing can be (i)accurate but delayed, since it requires to wait for delayed measurements, or it can be (ii) timely but inaccurate, since the processing is done based on available data. In my talk I will discuss this uncertainty principle and present research challenges derived from it.

1 June 2010
CTIT Symposium Dependable ICT: who cares?
The central theme of this CTIT’s annual symposium is Dependable ICT. A system is dependable, if we can justifiably rely on its services. A dependable system should be robust against unavoidable physical faults, for instance a jammed communication channel. Also, a dependable system should resist human error, be it during operation or at design time, for instance software errors. Dependable ICT systems should even defend themselves against malicious attacks by intrusion or abuse.
Read more…

8 June 2010
Rezwan Huq
Facilitating Fine Grained Data Provenance using Temporal Data Model
E-science applications use fine grained data provenance to maintain the reproducibility of scientific results, i.e., for each processed data tuple, the data used to process the data tuple as well as the used approach is documented. Since most of the e-science applications perform on-line processing of sensor data using overlapping time windows, the overhead of maintaining fine grained data provenance is huge especially in longer data processing chains. This is because data items are used by many time windows. Here, we propose an approach to reduce storage costs for fine-grained data provenance by maintaining data provenance on the relation level instead on the tuple level and make the content of the used database reproducible. The approach has prototypically been implemented for streaming and manually sampled data.

15 June 2010
Juan Amiguet Vercher
Annotations: Purposeful Stream Data Processing
In E-Science data provenance and technical data quality measurements are two major recent contributions, aiming to make data processing more accountable and verifiable. Both techniques have a series of drawbacks. Important changes impacting data interpretation can not be recognised from changes in data quality measurements. A sensor may continue reporting correctly yet its environment can change without it being reflected in its data. Annotations can address this by conveying information about the data. Annotations take the form of tokens, manually or automatically generated, which are streamed separately from the data. The information they convey can help drive the data transform or explain the impact of the latter. Issues discussed are: Stream non-update principle violation, Stream synchronisation, Incomplete annotation understanding, and Data Invalidation through partial annotation implementation.

22 June 2010
Dolf Trieschnigg
A Cross-lingual Framework for Monolingual Biomedical Information Retrieval
We approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a cross-lingual IR framework for monolingual biomedical IR if basic translations models are combined.
Read more…

29 June 2010
Robin Aly
From Stars and Planets to Information Retrieval:
Events, Event Spaces and Random Variables in IR.
Recently, a discussion about the event spaces used for probability functions in IR emerged in the research community. Based on practical examples I will explain the different assumptions of a selection of models.

6 July 2010
Design Project Groups
Ontwerpproject Jurybeheer Nederlandse AtletiekUnie:
Voor de AtletiekUnie hebben vijf studenten een jurybeheer applicatie ontwikkeld. Hiervoor is een framework en een daarop gebouwde webapplicatie gemaakt, die de inzet van juryleden bij wedstrijden administreert en verroosteringsproblemen tackelt.
XML-DataSet Converter
Voor het converteren van een XML-Document naar een DataSet (relationele database) is er een adapter ontwikkeld. De data in de DataSet moet door een andere applicatie kunnen worden gemanipuleerd (records wijzigen/toevoegen/verwijderen). Vervolgens moet de adapter de DataSet weer omzetten in een XML-Document. De structuur van de XML-Documenten dient hierbij gelijk te blijven.

28 September 2010 (Wednesday in ZI-2126)
Robin Aly
Guest lecture on Multimedia Information Retrieval
More and more information is stored as multimedia: From rap courses for kids over historical documents to academic publications in video format. Because of this data explosion, multimedia information retrieval quickly gains importance. This talk will give an overview of the field starting from the main difference to text information retrieval — the human incomprehensible data format of multimedia documents. Four different approaches for the understanding of multimedia documents are presented: human annotations, low level feature vectors, spoken document words and concept-based representations. The overview of the first three approaches is kept at a high level and the focus of the talk is on concept-based retrieval.

6 October 2010 (Wednesday in ZI-2126)
Thijs Westerveld (Teezir B.V., Utrecht)
Guest lecture: Automatically Analyzing Word of Mouth
In this talk I will demonstrate Teezir’s Opinion Analysis dashboards and discuss the underlying technology. For collecting content from web sites we developed advanced crawling technology that automatically identifies relevant news, blog and forum pages and extracts the relevant content and metadata. The collected content is then further analyzed to identify the main sentiments before everything is indexed to be disclosed in the online dashboards. Various sentiment analysis variants that have proven successful in an academic setting have been evaluated on our live collections. I will demonstrate that success on academic test collections does not necessarily imply the practical use of a sentiment analysis algorithm.
Read more…

20 October 2010 (Wednesday in ZI-2126)
Arjen de Vries (CWI, Amsterdam)
How search logs can help improve future searches
In the European project Vitalas, we had the opportunity to analyze the search log data from a commercial picture portal of a European news agency, which offers access to photographic images to professional users. I will discuss how these logs can be used in various ways to improve image search: to expand the image representation, to make suggestions of alternative queries, to adapt the search results to user context, and to build automatically concept detectors for content-based image retrieval.
Read more…

10 November 2010 (Wednesday, 11:30h. - 12.15h.)
Robin Aly
Exploiting Uncertainty about the Knowledge of Objects for Searching
The aim of this project is to improve the experience of users searching the internet for complex objects by exploiting the uncertainty a search engine has about the object’s representation.

17 November 2010 (Wednesday, 11:30h. - 12.15h.)
Mena Habib
Neogeography: The Challenge of Channeling Large and Ill-behaved Data Streams
In this project, our wide objective is to propose a new portable, domain-independent XML-based technology that involves set of free services that: enable end-users communities to express and share their spatial knowledge using free text; extract specific spatial information from this text; build a database from all the users’ contributions; and make use of this collective knowledge to answer - natural language - users’ questions through a question answering service.

2 December 2010 (Thursday)
Thomas Demeester (Ghent University, Belgium)
INTEC’s Broadband Communication Networks group and Information Retrieval
The main purpose of this short and high-level presentation is to present our research group at Ghent University, in the light of a future collaboration. An overview of the different activities within the group will be followed by our work in the field of Information Retrieval, for a project with the Flemish digital audiovisual archive. Furthermore, as our collaboration might be initiated with a short stay of myself in your group, it could be interesting for you to know my background. I will briefly introduce myself, and how I made the change from the field of electromagnetics (my Ph.D.) to machine learning and information retrieval.

CAES & Co football team

Monday, December 20th, 2010, posted by Djoerd Hiemstra

CAES Futsal team

I did some research into Computer Architecture for Embedded Systems by joining the CAES & Co football team at the university’s christmas tournament, starring: Djoerd, Ricardo, Paul, Arend, Harm, Sergio, and Yacob.

Solutions to Assignment 3

Monday, December 6th, 2010, posted by Djoerd Hiemstra

The solutions to Assignment 3 are now on-line in the Course Material Section on Blackboard. You need the solutions for Assignment 4, deadline next Friday, 10 December.

Small Haskell wrap-up meeting

Friday, December 3rd, 2010, posted by Djoerd Hiemstra

Next Monday, 6 December at 14.30 - 15.15h. in ZI-3126, there is a short meeting to discuss the solutions for Assignment 2 and 3. The solutions, which are helpful for Assignment 4, will also be put on Blackboard.
Next Tuesday, 7 December: the Hadoop Hackathon!

Solution for Assignment 2

Monday, November 29th, 2010, posted by Djoerd Hiemstra

The grades for Assignment 2 are now on Blackboard’s Grade Center. A correct solution for Assignment 2, which is needed for Assignment 3, can be found under “Course Materials” on Blackboard.

Grades for Assignment 1 on Blackboard Grade Center

Tuesday, November 23rd, 2010, posted by Djoerd Hiemstra

The grades for Assignment 1 are now on Blackboard’s Grade Center. Please, send me an email as soon as possible, if you cannot find your grades, if you cannot find an explanation of your grade (including a per question result), or if you did not submit solutions at all for Assignment 1, but still want to participate in the course. Deadline for Assignment 2 is next Friday, 26 November.

Guest lecture by Peter Dickman from Google

Tuesday, November 16th, 2010, posted by Djoerd Hiemstra

Friday 26 November, Peter Dickman from Google will talk about Google’s infrastructure. The lecture will start at 10:30 h. (so 15 minutes earlier than usual) in RA-1501.

This a rapid overview of the approach Google uses to develop and offer global products. I will briefly (and somewhat superficially) cover the whole of our infrastructure from physical systems, such as the data centers, through the software stack to our software development methodology and the corporate engineering culture that both builds and utilizes the infrastructure.

Peter Dickman is an engineering manager in Google’s main European engineering centre in Zurich. He is involved with both the internals of the Google search engine and projects to protect user data in Google’s systems. Prior to working at Google, Peter was an academic in the UK, researching large-scale distributed systems (though on arrival at Google he discovered what large really meant).

Crash course Functional Programming

Friday, November 12th, 2010, posted by Maarten Fokkinga

The crash course Functional Programming, intended to be able to describe the word count program in a functional language, will be given by Maarten Fokkinga in room Zilverling, West 1, on Friday Nov 19, 13:45-15:30. We’ll use programming language Amanda (one executable running under Windows), but to do the homework any other functional programming language, such as Haskell, may be used as well. A download for Amanda is given at the material for Assignment 2.

Open source alternatives for Blackboard?

Thursday, November 11th, 2010, posted by Djoerd Hiemstra

Starting in 2009, the University of Twente uses Blackboard as on-line learning management system. However, Blackboard turns out to be very insecure; see for instance the news item (in Dutch) Universiteitssoftware blijkt langdurig lek. Among other things, it is not only possible but actually easy for students to hack into a teacher’s account and invisibly change grades. As it turns out, this has been known amongst our students for quite some time.

Blackboard is a commercial system and its internals are a company secret. Kerckhoff’s Principle states that a secure system must not require secrecy. This way, it can be stolen by the enemy without causing trouble. In the design of software systems, this argument is used in favour of open source software security: Security through obscurity is considered bad practice, see for instance Jaap-Henk Hoepman and Bart Jacobs’ Communications of the ACM article Increased security through open source (CACM 50-1, 2007). So, maybe it is time to look at some of the open source alternatives out there, such as Sakai or Moodle. Both come with commercial support, in case our technical university does not want to invest in the expertise to deploy such a system in-house.

PhD-position: semantic linking of multimedia content

Friday, November 5th, 2010, posted by Djoerd Hiemstra

The digital library of the future will be a dynamic and highly networked entity, consisting of both the original documents and user-generated annotations and links to and from external resources. Among other things, the Human Media Interaction (HMI) group of the University of Twente investigates the possibilities for multimedia content analysis and information linking to support and provide facilities for navigating and exploring digital libraries with content in a variety of formats including text, audio, images and video. There is funding available for a PhD position starting from January 2010.

The PhD research will be carried out in the context of AXES, a multidisciplinary research project funded by the EU (FP7, Digital Libraries). The research will focus on deploying diverse, automatically generated, time-labeled annotations -for example those coming from automatic speech recognition- for connecting heterogeneous data sources, and will be strongly evaluation-driven.

More information (deadline: 21 November)