I gave a lecture of 4 hours including 1 hour exercise together with Ander de Keijzer for the SIKS Advanced Course “Probabilistic Methods for Entity Resolution and Entity Ranking” (see Program for the slides). Our lecture was about a “Probabilistic Data Integration approach to Entity Resolution”. In the exercise we tried to integrate information about movies from two independent sources using the probabilistic database Trio. This proved too difficult as a 1-hour exercise in such a course, but at the same time very interesting as an example use of a probabilistic database. We decided to try to turn it into a sample script both illustrating the power of probabilistic databases like Trio and at the same time how to do probabilistic data integration.
Archive for ◊ April, 2009 ◊
On Friday 30 January 2009, Luuk Peters defended his MSc thesis “Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”. The MSc project was carried out at Finan. It was supervised by me, Riham Abdel Kader (UT), Michiel Schipper (Finan) and Joost Willemse (Finan).
“Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”
Finan, a company offering solutions for financial analysis, uses Oracle’s XML-support to store, query and analyze financial reports obtained in XBRL, an open XML-based standard for defining and exchanging business and financial performance information. The goal of the project was to improve the performance of querying a high volume of financial XML documents with changing schemas. A secondary goal was to compare Oracle’s performance on this task with that of MonetDB/XQuery as a representative of a successful XML DBMS from academia.
Luuk thoroughly investigated many strategies for improvement: Oracle’s storage strategies (CLOB, Binary XML, XMLType, XMLTable) some schema-less some schema-based, other non-standard XML-document schemas, and variations in query formulation. He experimented with many possible combinations of these alternatives under varying conditions (database size, query complexity).
Since performance comparisons with Oracle is sensitive information, I cannot say anything specific about the outcomes. What I can say, is that Oracle offers a wide variety XML techniques, each with its own strengths and weaknesses. Much was learned on how these techniques work and how they affect execution performance. It proved hard to formulate a simple and concrete advise regarding the best strategy, because of the influence of so many factors. Nevertheless, considerable improvement could be obtained by choosing the right strategies. Furthermore, MonetDB/XQuery also proved up to the task of financial analysis. We believe that both industry and academia can learn from each other’s techniques. My hope is that this MSc project brought us a step closer to efficient and scalable general-purpose XML DBMS technology.
A public version of Luuk’s MSc thesis is not available yet, but I will of course immediately write about it here as soon as it does.