Tag-Archive for » uncertain databases «

Thursday, January 14th, 2016 | Author:

Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]

Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.

Thursday, March 08th, 2012 | Author:

A journal paper with my vision on data interoperability and a basis formalization has been accepted for a special issue of the Journal of IT volume 54, issue 3.
Managing Uncertainty: The Road Towards Better Data Interoperability.
Maurice van Keulen
Data interoperability encompasses the many data management activities needed for effective information management in anyone´s or any organization´s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention.

Thursday, September 01st, 2011 | Author:

A master student performed a problem exploration for the PayDIBI project. This is the report he wrote.
Integration of Biological Sources – Exploring the Case of Protein Homology
Tjeerd W. Boerman, Maurice van Keulen, Paul van der Vet, Edouard I. Severing (Wageningen University)
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heterogeneous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioinformatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Uncertain databases are able to contain several possible worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration.

Thursday, October 21st, 2010 | Author:

QloUDFabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.

Friday, July 30th, 2010 | Author:

For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.

The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]

Friday, June 18th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].

Tuesday, April 27th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.

Wednesday, November 25th, 2009 | Author:

As a product of my cooperation with Fabian Panse from the University of Hamburg, we got a paper accepted at the NTII-workshop co-located with ICDE 2010.
Duplicate Detection in Probabilistic Data
Fabian Panse, Maurice van Keulen, Ander de Keijzer, Norbert Ritter
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.

The paper will be presented at the Second International Workshop on New Trends in Information Integration (NTII 2010), Long Beach, California, USA [details]