Tag-Archive for » deduplication «

Monday, November 05th, 2012 | Author:

Fabian Panse, PhD student at the University of Hamburg, visits me for an entire week 17-21 December 2012. He will present at the DB seminar on Tuesday: “Duplicate detection in probabilistic data”.

Thursday, March 08th, 2012 | Author:

A journal paper with my vision on data interoperability and a basis formalization has been accepted for a special issue of the Journal of IT volume 54, issue 3.
Managing Uncertainty: The Road Towards Better Data Interoperability.
Maurice van Keulen
Data interoperability encompasses the many data management activities needed for effective information management in anyone´s or any organization´s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention.

Sunday, February 26th, 2012 | Author:

I wrote a journal paper with Fabian Panse (University of Hamburg) about handling ambiguous situations in deduplication in a probabilistic way. It has been accepted for the ACM Journal of Data and Information Quality.
Indeterministic Handling of Uncertain Decisions in Deduplication.
Fabian Panse, Norbert Ritter, and Maurice van Keulen
In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

Category: Probabilistic Data Integration  | Tags: , ,  | Comments off