Archive for the Category » MultimediaN «

Friday, June 18th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].

Tuesday, August 11th, 2009 | Author:

I recently got a paper accepted for the upcoming special issue of VLDB journal on Uncertain and Probabilistic Databases. The special issue is not out yet, but Springer already published it on-line: see SpringerLink.

Friday, July 17th, 2009 | Author:

Qualitative Effects of Knowledge Rules and User Feedback in
Probabilistic Data Integration

Maurice van Keulen, Ander de Keijzer
In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort — and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.

The paper will appear shortly in The VLDB Journal’s Special Issue on Uncertain and Probabilistic Databases. [details]

Tuesday, February 17th, 2009 | Author:

Last week (11 – 13 February) we held a Pathfinder project meeting. 20 people attended from 6 institutes (Uni.Tübingen, CWI, Uni.Twente, NFI Den Haag, Uni.Konstanz, ETH Zürich). We discussed about scalability, multiple front- and backends, full-text support, porting to MonetDB5, etc. Regarding the latter, we decided to indeed invest in porting Pathfinder to MonetDB5 starting in April. There were also a number of presentations. I presented the first attempts in adding spatial support to Pathfinder and Riham presented our ROX run-time join optimization approach for XQuery.

Wednesday, July 30th, 2008 | Author:

Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used.

The technical report was published as TR-CTIT-08-42 by the CTIT. [electronic version] [details]

Monday, June 26th, 2006 | Author:

MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine
Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the art with a number of new technical contributions, such as looplifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11 GB. The performance section also provides an extensive comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met.

The paper was presented at the ACM International Conference on Management of Data (SIGMOD 2006), 26-29 June 2006, Chicago, IL, USA. [electronic version] [details]