A probabilistic approach for mapping free-text queries to complex web forms

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

Web applications with complex interfaces consisting of multiple input fields should understand free-text queries. We propose a probabilistic approach to map parts of a free-text query to the fields of a complex web form. Our method uses token models rather than only static dictionaries to create this mapping, offering greater flexibility and requiring less domain knowledge than existing systems. We evaluate different implementations of our mapping model and show that our system effectively maps free-text queries without using a dictionary. If a dictionary is available, the performance increases and is significantly better than a rule-based baseline.

[download pdf]

2012’s DB colloquium

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

26 January 2012 (Thursday at 16.00 – in Dutch)
Design Project Group
Quetis – a self-configuring interface between devices
Quetis is a software tool that allows communication with paralyzed patients by interactively calibrating and configuring any given set of human input devices that have a compatible middleware driver for the patient. Quetis detects user capabilities per device and configures itself to only utilize that subset of actions that the patient can actually use and maps these to a configuration driving a premade GUI, though it could use any generic output system with proper middleware. In short: Quietis is a generalized and self-configuring interface between specialistic input devices and the environment for paralyzed patients in ICU’s.

27 February 2012
Tjitze Rienstra (University of Luxembourg)
Argumentation Theory
In the theory of abstract argumentation, the acceptance status of arguments is normally determined for the complete set of arguments at once, under a single semantics. However, this is not always desired. For example, in multi-agent systems, the provenance of the arguments and the competence of the agents often suggest different evaluation criteria for different arguments.

13 March 2012
Robin Aly
AXES: Access to Audiovisual Archives
The EU Project AXES aims at opening large audio-visual archives to a variety of user groups. My main task within the project is to provide search and linking functionality. Meanwhile, the project the first year of the project has passed and its progress will be reviewed next week. This colloquium is a rehearsal of the presentation I will give there. I will also provide a general overview of the project and demos of existing work.

28 March 2012 (Wednesday at 15.00h.-15.45h.)
Dong Nguyen
Evaluating federated search on the Web
Federated search systems have previously been evaluated by reusing existing TREC datasets. However, these datasets do not reflect realistic search systems found on the Web. As a result, it has been difficult to assess whether these systems are suitable for federated search on the Web. We therefore introduce a new dataset containing more than hundred actual search engines. We first discuss the design and present several analyses of the dataset. We then compare several popular resource selection methods and discuss the results. Several suggestions/modifications to incorporate more Web specific features are then presented.

28 March 2012 (Wednesday at 10.45h.-11.45h. in CR-1B)
Henning Rode (Textkernel)
Structured Retrieval in Practice
In this talk I will give a demo of a CV search system build for job recruiters, and describe the challenges of building the system such as e.g. user-friendly faceted search, synonym handling, and location search. When searching on richly structured documents such as CVs we also encountered a number of ranking problems using the standard language modelling approach for retrieval. The second part of the presentation will therefore discuss these issues in more detail and explain why they require field-specific solutions. Finally, I will share some ideas on how to further improve the search experience by making use of large domain knowledge sources.

3 April 2012
Maarten Fokkinga
Database Design
what you’ve been doing always but never was fully aware of
We show how to construct (in an almost algorithmic way) a query formulation for a database schema, out of a (arguably simpler) query formulation in terms of an Entity-Relationship diagram. To do so requires first a thorough understanding of the construction of a database schema out of the ER diagram. For this latter task, we show how express the relations between varies steps in the development of the database schema, and what proof obligations exist.

17 April 2012
Lesley Wevers
A functional database programming language
We explore the possibilities of using functional languages in database management. We will develop a prototype implementation and compare it to the traditional approach of general purpose language and database management systems on aspects of performance and usability.

1 May 2012
Juan Amiguet, Rezwan Huq, and Andreas Wombacher
Data Processing – provenance and propagation
We want to shortly introduce two case studies we are currently working on and which Juan and Rezwan will use for evaluating their research. This talk hopefully allows to discuss the differences between propagation investigated by Juan and provenance investigated by Rezwan.

2 May 2012 (Wednesday at 11:30h. in ZI-4126)
Dolf Trieschnigg
An Exploration of Language Identification Techniques for Dutch Folktales
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low.
Read more

12 June 2012
Robin Aly
Statistical Shard Selection
Large search engines partitions their documents into so-called shards (each shard contains the index of many documents). For a query, the search engine has to decide which shards should be used for searching (usually the top-n). To decide which shards to use, the shards are represented by sample documents which are retrieved in a first retrieval step. However, generating good document samples is not trivial and requires storage space. In this talk, I want to reflect my ideas for the shard selection problem. The ideas are based on the fact most current retrieval models are simply weighted sums of features, for which simple statistic laws exist: the expectation of the sum is the sum of its expectations, and the variance of the sum is the sum of its variances. I propose to represent shards by their expected feature value, the feature variance and co-variances. Using this representation, I hope one can determine the score distribution for the current query in each shard. The shards to select are those which have a fair chance to contain documents with a high score than a certain threshold, according to this distribution.

18 June 2012 (Monday all day)
CTIT Symposium 2012
ICT: The Innovation Highway
In EU’s Horizon 2020 three objectives have been set: excellent research, competitive industries, and better society. ICT plays a central role in reaching these goals. At this year’s CTIT symposium we will take up the challenges defined at the national and European level. Challenges which, when solved, will lead us to a better future. Together with you we will show that ICT is the true innovation highway.
Read more

19 June 2012
Design Projects
Yasona: a peer-to-peer social network
Our goal was to design a peer-to-peer network structure to create a decentralized social network. The design should not focus on social aspects of networking, or the various possibilities a social medium could have, per se, but should instead offer a platform on wich such elements can be build. Yasona was developed in order to give people the opportunity to communicate with each other and share media without being being dependent on a central server. We will demonstrate our prototype and discuss our goals, design choices and recommendations.

Design, implementation and evaluation of the Bata App
For the Batavierenrace an Android smart-phone app has been designed and implemented which –in a well organized way– shows all available information from the official Batavierenrace organization (like teams, running times, standing, etc). After some initial updates, the app has performed satisfactorily, without bugs, and was downloaded over two thousand times (being in the top 10 of Google market). The design, problems encountered, and experiences will be discussed during the talk.

17 July 2012
Brend Wanders
Semi-structured data in a wiki
Wikis offer free form editing of, and collaboration on, texts. These texts are usually of an informative nature, and are intended for consumption by people. By embedding semi-structured information in a wiki, the information can also be used by other systems. In this short talk I will present my take on using a wiki as a basis for the collaborative creation and curation of data sets by offering ad-hoc data entry and querying.

15 August 2012 (Wednesday at 13.30h.)
Nick Barkas (Spotify, Sweden)
Search at Spotify
Nick Barkas is a software developer at Spotify in Stockholm, Sweden, working mostly with backend/server-side systems. He studied scientific computing at KTH in Stockholm and the University of Washington in Seattle. Barkas will talk about how Spotify serves music metadata to users and how that relates to search.

11 September 2012
Mohammed Salem (Humboldt University, Berlin)
Journalistic Multimedia Data Analytics
In this project we propose to develop applications and tools around content-based journalistic data management, analysis, retrieval and visualization. New algorithms are needed for automatic extraction of content related metadata and annotation not only for text documents but also for news videos, images, audio signals and animations. Moreover, new retrieval methods are needed that utilize the multimodality nature of the news data and are able to return different materials related to a certain news story.

25 September 2012
Mena Habib
Toponym extraction and disambiguation
Toponym extraction and disambiguation are key topics recently addressed by fields of Information Extraction and Geographical Information Retrieval. Toponym extraction and disambiguation are highly dependent processes. Not only toponym extraction effectiveness affects disambiguation, but also disambiguation results may help improving extraction accuracy. In this paper we propose a hybrid toponym extraction approach based on Hidden Markov Models (HMM) and Support Vector Machines (SVM). Hidden Markov Model is used for extraction with high recall and low precision. Then SVM is used to find false positives based on informativeness features and coherence features derived from the disambiguation results. Experimental results conducted with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms showed that the proposed approach outperform the state of the art methods of extraction and also proved to be robust. Robustness is proved on three aspects: language independence, high and low HMM threshold settings, and limited training data.

28 September 2012 (Friday at 14.00h.)
Hans Wormer (Almere Data Capital)
Growing with Big Data
Hans Wormer is program mamanager of Almere Data Capital. The term Data Capital refers to the concentration of companies, services, knowledge and facilities that support the collection, storage, access, sharing, editing and visualization of big data. The program Almere Data Capital brings together supply and demand safely and efficiently. Hans Wormer will address: Almere's vision on the developments around Big Data; Almer's approach to stimulate new activities; the creation of new jobs for the city and region; and finally, how to get involved in Almere Data Capital.

1 October 2012 (Monday 13.30h. in ZI-3126)
Victor de Graaff (with an introduction from Djoerd Hiemstra)
The theory behind scrum
Djoerd will give a 6 minutes and 40 seconds, “pecha kucha” introduction on the plans for the module “Data & Information” of the new Computer Science bachelor.
Victor will give a 45 minute presentation on the theory behind Scrum, an increasingly popular software development methodology. Scrum is an implementation of Agile development, and is based on the concept of the capabilities of the team to plan and review their own work.

23 October 2012
Rezwan Huq
From Scripts Towards Provenance Inference
Scientists require provenance information either to validate their model or to investigate the origin of an unexpected value. However, they do not maintain any provenance information and even designing the processing workflow is rare in practice. Therefore, in this paper, we propose a solution that can build the workflow provenance graph by interpreting the scripts used for actual processing. Further, scientists can request fine-grained provenance information facilitating the inferred workflow provenance. We also provide a guideline to customize the workflow provenance graph based on user preferences. Our evaluation shows that the proposed approach is relevant and suitable for scientists to manage provenance.

12 November 2012 (Monday, 12.30h. in ZI-2126)
Iwe Muiser
Cleaning up and Standardizing a Folktale Corpus for Humanities Research
Recordings in the field of folk narrative have been made around the world for many decades. By digitizing and annotating these texts, they are frozen in time and are better suited for searching, sorting and performing research on. This paper describes the first steps of the process of standardization and preparation of digital folktale metadata for scientific use and improving avail- ability of the data for humanities and, more specifically, folktale research. The Dutch Folktale Database has been used as case study but, since these problems are common in all corpora with manually created metadata, the explanation of the process is kept as general as possible.

14 November 2012 (Wednesday, 13.45h. in CR-3E)
Thijs Westerveld (Teezir, Utrecht)
Analysing Online Sentiments: Big Data, Small Building Blocks
The term big data has become mainstream to the point that its showing up in lists of most annoying management buzzwords. Teezir helps its customers to find value in big data beyond the hype. By collecting and analysing almost half a million documents on a daily basis and ordering, summarizing and aggregating the gathered information, we turn big data into valuable insights. To process a continuous stream of Tweets, Facebook updates, forum and blog posts and online and offline news articles we have developed a series of building blocks. In this talk I will discuss some of these including our smart crawlers that learn which links to follow based on user interaction, our text analysis components to detect the language and sentiment of a document, and the index structures we use to quickly produce suitable aggregates in a faceted search like fashion. To conclude, I will give a demonstration of our analytics dashboards and show some examples of how our customers interact with this data and how they incorporate our technology in their daily process.

20 November 2012
Sergio Duarte
Query Recommendation for Children
In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.

27 November 2012
Mohammad Khelghati
Size Estimation of Non-Cooperative Data Collections
In this paper, the approaches for estimating the size of non-cooperative databases and search engines are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

18 December 2012
Fabian Panse (University of Hamburg)
Indeterministic Handling of Uncertain Decisions in Deduplication
In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
Read more

See also: 2011's DB colloquium.

Een open email aan de rector

Published today in the UT Nieuws (in Dutch).
With response (also at the UT Nieuws).

Aan: Prof. dr. Ed Brinksma
Van: Dr. Djoerd Hiemstra
Datum: Do, 6 Dec 2012 10:32:04 +0100
Onderwerp: TOM is als Windows zonder browser

Geachte prof. Brinksma, beste Ed,

Ik weet dat verhalen uit de Informatica je na aan het hart liggen, en het volgende verhaal kun je je ongetwijfeld herinneren: In 1995 lanceerde Microsoft het besturingssysteem Windows 95 zonder een web browser, en Bill Gates gaf later toe dat de opkomst van het world wide web hem had verrast. Maar Microsoft was groot en rijk, en de competitie te klein om de fout af te straffen.

Je diesrede van vrijdag j.l. was inspirerend. Je stond ook stil bij het lanceren van een nieuw product: TOM 2013, ons Twentse Onderwijs Model. Net als in 1995 voor Microsoft is inmiddels het world wide web ook voor de universitaire onderwijswereld een belangrijke rol gaan spelen, zoals je treffend aangaf in je rede. Bekende universiteiten zoals Stanford en MIT bieden hun beste colleges en studiemateriaal aan op het web, gratis, voor iedereen te volgen op de on-line leeromgevingen van bijvoorbeeld Coursera en Udacity. Ze geven certificaten af, en onze studenten zijn al bij je geweest met de vraag: “Hoeveel studiepunten is dit on-line Stanford certificaat waard, meneer de rector?” Ik ben benieuwd naar wat je de studenten gezegd hebt, want net als Bill Gates bij Microsoft ben jij ons boegbeeld. Hoe gaan we hier als Universiteit Twente mee om? Waar willen we staan over 5 jaar?

Je antwoord heeft me teleurgesteld. Je zei dat de UT zich zal richten op: “het opdoen van ervaring, het ontmoeten van mensen, persoonlijk contact, kortom het creëren van een leerervaring.” Met andere woorden, we gaan ons niet richten op het internet, maar juist op die dingen die het internet niet bij de universiteit weg kan halen – nog niet in elk geval. De UT lanceert dus TOM 2013 zonder open studiemateriaal, zonder open leeromgeving, zonder ondersteuning aan ons docenten om te concurreren met onze collega's van Stanford: TOM is als Windows zonder web browser. Over vijf jaar zullen we misschien nog steeds 10.000 studenten hebben, maar onze studenten zullen on-line colleges van andere universiteiten volgen, on-line opgaven van andere universiteiten doen, en wij docenten zullen met persoonlijk contact een leerervaring creëren. Zijn dat echt onze ambities?

Waar zouden we kunnen zijn in vijf jaar? Over vijf jaar zouden miljoenen studenten over de hele wereld de vakken van de Universiteit Twente kunnen volgen, met on-line colleges, on-line opdrachten, en on-line certificaten van de UT. Sommige van onze professoren zullen bekend zijn bij studenten over de hele wereld. En als ik ooit nog eens bij Stanford solliciteer, dan wil ik ze met trots kennis laten nemen van mijn open, on-line TOM studiemateriaal.

In deze email vraag ik: laat ons alsjeblieft weten dat de opkomst van on-line cursussen je verrast heeft. Wellicht dat Bill Gates daarbij een inspirator kan zijn, want zonder open, on-line cursussen zullen we als Universiteit Twente voor altijd worden ingehaald door onze internationale concurrenten. Laten we een ambitieuze stap maken naar uitdagend, eigentijds en *open* onderwijs in TOM 2013. Ik, collega's en studenten van onze universiteit zijn van harte bereid hierbij te helpen.

Met vriendelijke groet,
Djoerd Hiemstra
(Lid van de Jonge Akademie van de Universiteit Twente)

Conditional Random Fields on Steroids

I have never been more excited about a paper that I contributed to! In this technical report Zhemin Zhu introduces a new theory for factorizing undirected graphical models, with astonishing results, reducing the training time for conditional random fields from weeks till seconds on a part-of-speech tagging task. Reducing the training time from weeks to seconds is like approaching the moon up to a distance of about 100 meter, or buying a Ferrari F12 for 10 cents!!

Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher

The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. In this paper, we present separate training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to piecewise training, separate training is exact. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.

[download pdf]

ECIR Workshop on Group Membership and Search

From Republicans to Teenagers

We organize a workshop at ECIR’13 in Moscow that takes a group-centric approach to Information Retrieval and invites contributions that either (i) propose and evaluate IR systems for a particular user group or that (ii) describe how the search behavior of specific groups differ, potentially requiring a different way of addressing their needs.

Papers deadline: 18 January 2013
Workshop: 24 March 2013

More information at the Group Membership and Search site.

Participate in the Dutch Common Crawl Challenge

What can you do with 6 billion webpages?

Together with Common Crawl and SARA, we invite students and researchers studying at or employed by research institutes or universities in the Netherlands to dive into the Common Crawl web corpus using the SARA Hadoop service. The best submission will receive the The Norvig Web Data Science Award, a tablet, and 1500 Euro to spend on travel, accommodation, and conference registration fee for SIGIR 2013 to be held in Dublin, Ireland.

The award is named after Peter Norvig, Google's director of research with a resume too impressive to summarize. Peter is on the advisory board of Common Crawl, and is chair of the jury for this award. Other jury members are Ricardo Baeza-Yates (Yahoo!), Hilary Mason (bit.ly), Jimmy Lin (University of Maryland), and Evert Lammerts (SARA).

Find out more at the Norvig Award page at Github, the Common Crawl Blog, or come to the Inter-Actief Challenges Information Lunch on 22 November at 12.30h. in Absint.

Join TREC FedWeb’13

FedWeb '13 is the new TREC (Text Retrieval Conference) Federated Web Search task, that will provide a test collection that organizes and stimulates research in many areas related to federated search, including aggregated search, distributed search, peer-to-peer search and meta-search engines. The track will evaluate federated and aggregated search in a large heterogeneous setting using the search results of existing search engines.

Join the mailing and keep up-to-date with FedWeb'13.

Assigning reviewers to papers

Multi-Aspect Group Formation using Facility Location Analysis

by Mahmood Neshati, Hamid Beigy, and Djoerd Hiemstra

In this paper, we propose an optimization framework to retrieve an optimal group of experts to perform a given multi-aspect task/project. Each task needs a diverse set of skills and the group of assigned experts should be able to collectively cover all required aspects of the task. We consider three types of multi-aspect team formation problems and propose a unified framework to solve these problems accurately and efficiently. Our proposed framework is based on Facility Location Analysis which is a well known branch of the Operation Research. Our experiments on a real dataset show significant improvement in comparison with the state-of-the art approaches for the team formation problem.

The paper will be presented at the 17th Australasian Document Computing Symposium ADCS 2012 at the University of Otago, Dunedin, New Zealand on the 5th and 6th December, 2012.

[download pdf]

Welcome to the Big Data course

Welcome to the new course Managing Big Data. We will closely follow developments to manage huge amounts of data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, AOL, Facebook, Hyves, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: We will discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (for instance Sawzall and Pig Latin), and 'noSQL' database paradigms (for instance BigTable and Dynamo) for managing big data. The first lecture is next Friday, 16 November at 10.45 h. in RA 2502.

More information on blackboard. (access restricted, sorry our university does not like me to share courses 🙁 )