August 28, Ander de Keijzer and I organized another MUD workshop (Management of Uncertain Data). We had 5 presentations of research papers, 2 invited speakers, Olivier Pivert on possibilistic databases and Christoph Koch on probabilistic databases, and a discussion on these two approaches for managing uncertain data. I regard this year’s MUD as very succesful: We counted about 30 participants and the aforementioned discussion was lively. We plan to submit a workshop report to SIGMOD record.
Tag-Archive for ◊ uncertainty in data ◊
I recently got a paper accepted for the upcoming special issue of VLDB journal on Uncertain and Probabilistic Databases. The special issue is not out yet, but Springer already published it on-line: see SpringerLink.
Qualitative Effects of Knowledge Rules and User Feedback in
Probabilistic Data Integration
Maurice van Keulen, Ander de Keijzer
In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort — and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.
The paper will appear shortly in The VLDB Journal’s Special Issue on Uncertain and Probabilistic Databases. [details]
The project description for Tjitze Rienstra’s Msc project has been finalized. The project is being supervised by me and Paul van der Vet.
Dealing with uncertainty in the Semantic Web
The notion of data integration is essential to the Semantic Web. Its real advantage is that it enables us to gather data from different sources, reason over this data and get results that may otherwise not have been easy to find. However, data integration can lead to conflicts. Different sources may provide contradicting information about the same real world objects. The result is uncertainty. The technologies of the Semantic Web are assertional, which means that they cannot deal with uncertainty very well.
The essential standards (RDF, RDFS, OWL, SPARQL) will be extended in order to deal with uncertainty. We will first make clear what is required in terms of expressiveness. We then specify an extension by formalizing a ‘possible world’ semantics for RDF. It will be necessary to consider what the consequences are for RDFS and OWL. Finally, querying with SPARQL must be adapted to work with this possible world model, while at the same time be computationally efficient. Validation will be done by testing a prototype against a movie database, containing conflicting data from different sources.
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used.