Tag-Archive for » uncertainty in data «

Monday, June 03rd, 2013 | Author:

Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden

Wednesday, February 27th, 2013 | Author:

Tomorrow, 28 Feb 2013, a PhD student of mine, Victor de Graaff, is going to give a presentation on how to estimate the boundaries for objects for which you only have a point and other public data such as Open Street Map [Announcement].
Point of Interest to Region of Interest Conversion
Date/Time: Thursday, February 28, 2013 – 13:30 to 14:30; Room: 0-142
GPS trajectories from a mobile device, such as a smartphone, indirectly contain a vast amount of information on the interests of the owner of the device. Collections of GPS trajectories even provide insight in the popularity of locations, and the time spent at those locations. To obtain this information, the visited places on such a trajectory need to be recognized. However, the location information on a point of interest (POI) in a database is normally limited to an address and a GPS coordinate, rather than a geometry describing its boundaries. To create a match with a GPS trajectory, a two-dimensional shape representing this place, a region of interest (ROI), is needed. In the absence of expensive and hard to obtain detailed spatial data like cadastral data, we need to estimate this ROI. In this research project, we bridge this gap by presenting several approaches to estimate the size and shape of the ROI, and validate these estimations against the cadastral data of the city of Enschede, The Netherlands.

Monday, December 10th, 2012 | Author:

Brend Wanders, a PhD student of mine, presents a poster at the BeNeLux Bioinformatics Conference (BBC 2012) in Nijmegen.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Background: Scientific research in bio-informatics is often data-driven and supported by biological databases. In a growing number of research projects, researchers like to ask questions that require the combination of information from more than one database. Most bio-informatics papers do not detail the integration of different databases. As roughly 30% of all tasks in workflows are data transformation tasks, database integration is an important issue. Integrating multiple data sources can be difficult. As data sources are created, many design decisions are made by their creators.
Methods: Our research is guided by two use cases: homologues, the representation and integration of groupings; metabolomics integration, with a focus on the TCA cycle
Results: We propose to approach the time consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defining a knowledge base of data mapping rules, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration. Through user feedback on query results and trust assessments, the integration can be improved upon over time.
Conclusions: We conclude that this direction of research is worthy of further exploration. [details]

Wednesday, November 21st, 2012 | Author:

Brend Wanders, a PhD student of mine, presents his research at the Dutch-Belgian Database Day (DBDBD 2012) in Brussels.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Scientific research in bio-informatics is often data-driven and supported by numerous biological databases. A biological database contains factual information collected from scientific experiments and computational analyses about areas including genomics, proteomics, metabolomics, microarray gene expression and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
In a growing number of research projects, bio-informatics researchers like to ask combined ques- tions, i.e., questions that require the combination of information from more than one database. We have observed that most bio-informatics papers do not go into detail on the integration of different databases. It has been observed that roughly 30% of all tasks in bio-informatics workflows are data transformation tasks, a lot of time is used to integrate these databases (shown by [1]).
As data sources are created and evolve, many design decisions made by their creators. Not all of these choices are documented. Some of such choices are made implicitly based on experience or preference of the creator. Other choices are mandated by the purpose of the data source, as well as inherent data quality issues such as imprecision in measurements, or ongoing scientific debates. Integrating multiple data sources can be difficult.
We propose to approach the time-consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defin- ing a knowledge base of data mapping rules, schema alignment, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration to serve the purposes of the user. By using user feedback on query results and trust assessments, the integration can be improved upon over time.
The research will be guided by a set of use cases. As the research is in its early stages, we have determined three use cases:

  • Homologues, the representation and integration of groupings. Homology is the relationship between two characteristics that have descended, usually with divergence, from a common ancestral characteristic. A characteristic can be any genic, structural or behavioural feature of an organism

  • Metabolomics integration, with a focus on the TCA cycle. The TCA cycle (also known as the citric acid cycle, or Krebs cycle) is used by aerobic organism to generate energy from the oxidation of carbohydrates, fats and proteins.
  • Bibliography integration and improvement, the correction and expansion of citation databases.

[1] I. Wassink. Work flows in life science. PhD thesis, University of Twente, Enschede, January 2010. [details]

Monday, September 05th, 2011 | Author:

I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.

The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]

Thursday, October 21st, 2010 | Author:

QloUDFabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.

Friday, July 30th, 2010 | Author:

For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.

The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]

Friday, June 18th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].

Tuesday, April 27th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.

Friday, March 12th, 2010 | Author:

To improve the integration of the new faculty ITC (Geo-Information Science and Earth Observation) into the university, the boards of directors of ITC and UT decided some time ago to subsidize several cooperation projects with each two PhD students, one at ITC and one at the UT. I am involved in one: “Neogeography: the challenge of channelling large and ill-behaved data streams” (see description below). Rolf de By (ITC) and I presented our Neogeography project on the Kick-off meeting 12 March 2010 [presentation]. Rolf’s PhD student is Clarisse Kagoyire and she arrived in The Netherlands just in time to make it to the meeting. My PhD student is Mena Badieh Habib; he will start 1 May 2010.

Neogeography: the challenge of channelling large and ill-behaved data streams
In this project, we develop XML-based data technology to support the channeling of large and ill-behaved neogeographic data streams. In neogeography, geographic information is derived from end-users, not from official bodies like mapping agencies, cadasters or other official, (para-)governmental organizations. The motivation is that multiple (neo)geographic information sources on the same phenomenon can be mutually enriching.
Content provision and feedback from large communities of end-users has great potential for sustaining a high level of data quality. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? The rise of sensor networks adds to the mix a badly needed verification mechanism for the real-time neogeographic data.
We strive for a proper mix of carefully integrated techniques in geoinformation handling, approaches to spatiotemporal imprecision and incompleteness, as well as data augmentation through sensors in a generic framework with which purpose- oriented end-user communities can be served appropriately.
The UT PhD position focuses on spatiotemporal data technology in XML databases and theory and support technology for storage, manipulation and reasoning with spatiotemporal and thematic uncertainty. The work is to be validated through testbed use cases, such as the H20 project with google.org (water consumers in Zanzibar), AGCommons project with the Gates Foundation (smallholder farmers in sub-Saharan Africa), and other projects with large user communities.