• Wednesday, October 19th, 2011

Het WODC (wetenschappelijk onderzoek- en documentatiecentrum van het ministerie van veiligheid en justitie) en het Trimbos instituut zoeken een afstudeerder voor het onderstaande onderzoek naar omvangschattingen van zaken als “criminaliteit” of “drugsgebruik”.
Omvangschattingen van “dark numbers”
WODC en Trimbos Instituut
Fenomenen waarvan de omvang niet direct uit registraties is af te leiden worden ook wel aangeduid als “dark numbers”. Het verschil tussen de geregistreerde omvang en werkelijke omvang van de criminaliteit is een voorbeeld van een dark number. Een substantieel deel van de criminaliteit komt niet ter kennis van de politie. Dit deel van de criminaliteit is dus niet geregistreerd. Desalniettemin is het voor beleidsmakers noodzakelijk om inzicht te hebben in de totale omvang van de criminaliteit. Voor het schatten van dark numbers zijn er dan ook verschillende statistische technieken geïntroduceerd. Twee uit de literatuur bekende methoden zijn de zogenaamde “Capture Recapture” (CRC) en de “Treatment Multiplier” (TM) methode. Beide technieken zijn gebaseerd op de verzamelingleer/kansrekening. Theoretisch kan worden aangetoond dat de TM methode een bijzonder geval is van de CRC methode. Echter, als we de methoden empirisch toetsen op drugsgerelateerde data uit politie- en verslavingszorgbestanden en survey onderzoek, dan krijgen we verschillende uitkomsten voor de beide methoden. Het vermoeden bestaat dat e.e.a. te maken heeft met hoe de methoden omgaan met de schendingen van de aannames waarop de methoden gebaseerd zijn.
Het doel van de afstudeeropdracht is om empirisch te laten zien dat TM een bijzonder geval is van CRC. De methoden dienen toegepast te worden op gesimuleerde data die voldoen aan de (vier) aannames van beide methoden en op data uit de praktijk, waarvan het niet evident is aan welke aannames wel en niet voldaan wordt.

• Monday, September 05th, 2011

I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.

The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]

• Thursday, September 01st, 2011

A master student performed a problem exploration for the PayDIBI project. This is the report he wrote.
Integration of Biological Sources – Exploring the Case of Protein Homology
Tjeerd W. Boerman, Maurice van Keulen, Paul van der Vet, Edouard I. Severing (Wageningen University)
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heterogeneous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioinformatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Uncertain databases are able to contain several possible worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration.
[details]

• Monday, August 22nd, 2011

One of my PhD students, Mena Badieh Habib, and I submitted a paper about improving the effectiveness of named entity extraction (NEE) with what we call “the reinforcement effect” to the MUD workshop of VLDB2011.
Named Entity Extraction and Disambiguation: The Reinforcement Effect.
Mena Badieh Habib, Maurice van Keulen
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.

The paper will be presented at the MUD workshop co-located with VLDB 2011, 29 August 2011, Seattle, USA [details]

• Monday, January 24th, 2011

I have a vacancy for a PhD position in a project called “Pay-As-You-Go Data Integration for Bio-Informatics” (PayDIBI). In short, the objective is to develop data coupling and integration technology to support bio-informatics scientists in quickly constructing targeted data sets for researching questions that require the combination of information from more than one biological database. More information and a webform to apply can be found here.

• Thursday, January 20th, 2011

De studie Bedrijfsinformatietechnologie (BIT) aan de Universiteit Twente wordt door de Keuzegids Hoger Onderwijs Universiteiten 2011 een “echte hoogvlieger” genoemd. In een vergelijk tussen alle “Informatiekunde” studies in Nederland, krijgt BIT een totaalscore van 82, met kop en schouders boven de nummer 2, Informatiekunde in Groningen met 74 punten. Zie artikel in de weekkrant.
Ik heb een sterke band met BIT: Ik zit in de opleidingscommissie voor BIT die adviseert over het studieprogramma en andere zaken; bovendien ben ik actief in de voorlichting over BIT; en ik doceer BIT-vakken en begeleid BIT-studenten.

• Friday, January 07th, 2011

On 7 January 2011, Eelco Eerenberg defended his MSc thesis “Towards Distributed Information Retrieval based on Economic Models”. The MSc project was supervised by Djoerd Hiemstra, Kien Tjin-Kam-Jet, and me.
“Towards Distributed Information Retrieval based on Economic Models”[download]
The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

• Wednesday, December 22nd, 2010

One of my PhD students, Mena Badieh Habib, submitted a paper with his research plans in the Neogeography project to the PhD workshop of ICDE2011.
Neogeography: The Challenge of Channelling Large and Ill-Behaved Data Streams
Mena Badieh Habib
Neogeography is the combination of user generated data and experiences with mapping technologies. In this paper we propose a research project to extract valuable structured information with a geographic component from unstructured user generated text in wikis, forums, or SMSes. The project intends to help workers communities in developing countries to share their knowledge, providing a simple and cheap way to contribute and get benefit using the available communication technology.

The paper will be presented at the PhD workshop co-located with ICDE 2010, 11 April 2011, Hannover, Germany [details]

• Tuesday, December 14th, 2010

On November 16th (1st defense) and December 14th (public 2nd defense) Antoon Bronselaer of the University of Ghent defended his PhD thesis “Coreferentie van atomaire en complexe objecten” (Co-reference of atomatic and complex objects). I was member of his PhD committee. His research provides a well-founded and well-validated possibilistic framework for co-reference of objects, also known as entity resolution or record linkage. The possibilistic approach is especially suited for circumstances where data and its origins are incomplete and uncertain.
[Antoon Bronselaer, publications, Vakgroep Telecommunicatie en Informatieverwerking, University of Ghent, Belgium]

• Tuesday, December 07th, 2010

Op 7 december heb ik het tweede onderwijsseminar ingeleid over “Publiceren door studenten”, een onderwerp dat ik ook zelf had aangedragen. Samen met Nelly Litvak (TW/SOR) en Raymond Veldhuis (EL/SAS) zat ik ook in een panel tbv forumdiscussie over dit onderwerp geleid door de opleidingsdirecteur van EL, Wouter Olthuis. Het was een levendige discussie waaraan in totaal 25 docenten aan deelnamen.
[verslag en slides]