• Friday, January 22nd, 2010
On Friday 22 January 2010, Michiel Punter defended his MSc thesis “Multi-Source Entity Resolution“. The MSc project was supervised by me, Ander de Keijzer, and Riham Abdel Kader.
“Multi-Source Entity Resolution” [download]
Background: The focus of this research was on multi-source entity resolution in the setting of pair-wise data integration. In contrast to most existing approaches to entity resolution this research does not consider matching to be transitive. A consequence of this is that entity resolution on multiple sources is not guaranteed to be associative. The goal of this research was to construct a generic model for multi-source entity resolution in the setting of pair-wise data integration that is associative.
Results: The main contributions of this research are: (1) a formal model for multi-source entity resolution and (2) strategies that can be used to resolve matching conflicts in a way that renders multi-source entity resolution to be associative. The possible worlds semantics is used to handle uncertainty originating from possible matches. The presented model is generic enough to allow different match and merge function as well as allowing different strategies to resolve matching conflicts.
Conclusions: A formalization of an example of multi-source entity resolution is presented to show the utility of the proposed model. By using small examples in which three sources are integrated it is shown that the strategies resulted in associative behavior of the integrate function.
• Thursday, November 05th, 2009
On Thursday 5 November 2009, Tjitze Rienstra defended his MSc thesis “Dealing with uncertainty in the semantic web”. The MSc project was supervised by me, Paul van der Vet, and Maarten Fokkinga. The work was evaluated by the committee as excellent and received the rarely awarded grade of 10.
“Dealing with uncertainty in the semantic web” [download]
Standardizing the Semantic Web is still an ongoing process. For some aspects, the standardization seems to have completed. For example, the syntax layer, the RDF data model layer and the RDFS and OWL semantic extensions have proven to fulfill their purpose in real world applications. Other aspects, while necessary to realize the greater ideal of the Semantic Web, are yet to be standardized. One of these is dealing with uncertainty. Like classical logic, the languages of the Semantic Web (RDF, RDFS and OWL) work under the assumption that knowledge is certain. Many forms of knowledge, e.g. in computer vision, computational linguistics and information retrieval, exhibit notions of uncertainty. Uncertainty also arises as a side effect of knowledge integration and ontology mapping. This thesis describes an extension for the Semantic Web to deal with uncertainty. The extension, called URDF (Uncertain RDF), extends RDF with the capability to express uncertainty by allowing to associate RDF formulas with probabilities. It not only extends RDF, but also supports the semantics of RDFS and part of OWL. The main contribution is an extension that adheres to the incremental design of the Semantic Web language stack. It can act as a unifying framework for different kinds of probabilistic representation and reasoning, at different levels of expressivity (RDF, RDFS or OWL). In this thesis, we focus on two kinds of reasoning: rule based reasoning with RDFS/OWL knowledge and Bayesian networks and inference.
• Monday, September 07th, 2009
In cooperation with ITC (International Institute for Geo-Information Science and Earth Observation), we have a PhD position availble on Neogeography: the challenge of channeling large and ill-behaved data streams. In neogeography, geographic information is derived from end-users, not from official bodies. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? Anyone with a MSc degree interested in doing PhD research on this topic is welcome to apply before October 10 (see the vacancy for details).
• Monday, September 07th, 2009
August 28, Ander de Keijzer and I organized another MUD workshop (Management of Uncertain Data). We had 5 presentations of research papers, 2 invited speakers, Olivier Pivert on possibilistic databases and Christoph Koch on probabilistic databases, and a discussion on these two approaches for managing uncertain data. I regard this year’s MUD as very succesful: We counted about 30 participants and the aforementioned discussion was lively. We plan to submit a workshop report to SIGMOD record.
• Tuesday, August 11th, 2009
I recently got a paper accepted for the upcoming special issue of VLDB journal on Uncertain and Probabilistic Databases. The special issue is not out yet, but Springer already published it on-line: see SpringerLink.
• Friday, July 17th, 2009
Qualitative Effects of Knowledge Rules and User Feedback in
Probabilistic Data Integration
Maurice van Keulen, Ander de Keijzer
In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort — and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.
The paper will appear shortly in The VLDB Journal’s Special Issue on Uncertain and Probabilistic Databases. [details]
• Friday, February 06th, 2009
The project description for Tjitze Rienstra’s Msc project has been finalized. The project is being supervised by me and Paul van der Vet.
Dealing with uncertainty in the Semantic Web
The notion of data integration is essential to the Semantic Web. Its real advantage is that it enables us to gather data from different sources, reason over this data and get results that may otherwise not have been easy to find. However, data integration can lead to conflicts. Different sources may provide contradicting information about the same real world objects. The result is uncertainty. The technologies of the Semantic Web are assertional, which means that they cannot deal with uncertainty very well.
The essential standards (RDF, RDFS, OWL, SPARQL) will be extended in order to deal with uncertainty. We will first make clear what is required in terms of expressiveness. We then specify an extension by formalizing a ‘possible world’ semantics for RDF. It will be necessary to consider what the consequences are for RDFS and OWL. Finally, querying with SPARQL must be adapted to work with this possible world model, while at the same time be computationally efficient. Validation will be done by testing a prototype against a movie database, containing conflicting data from different sources.
• Wednesday, July 30th, 2008
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used.
The technical report was published as TR-CTIT-08-42 by the CTIT. [electronic version] [details]