On 7 December 2012, Paul Stapersma defended his MSc thesis “Efficient Query Evaluation on Probabilistic XML Data”. The MSc project was supervised by me, Maarten Fokkinga and Jan Flokstra. The thesis is the result of a more than 2 year cooperation between Paul and me to build a probabilistic XML database system on top of a relational one: MayBMS.
“Efficient Query Evaluation on Probabilistic XML Data”[download]
In many application scenarios, reliability and accuracy of data are of great importance. Data is often uncertain or inconsistent because the exact state of represented real world objects is unknown. A number of uncertain data models have emerged to cope with imperfect data in order to guarantee a level of reliability and accuracy. These models include probabilistic XML (P-XML) –an uncertain semi-structured data model– and U-Rel –an uncertain table-structured data model. U-Rel is used by MayBMS, an uncertain relational database management system (URDBMS) that provides scalable query evaluation. In contrast to U-Rel, there does not exist an efficient query evaluation mechanism for P-XML.
In this thesis, we approach this problem by instructing MayBMS to cope with P-XML in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.
We present a specification of a P-XML to U-Rel data mapping and a corresponding XPath to SQL mapping. Additionally, we present two designs of this specification. The first design constructs a data mapping in such way that the corresponding query mapping is a traditional XPath to SQL mapping. The second design differs from the first in the sense that a component of the data mapping is evaluated as part of the query evaluation process. This offers the advantage that the data mapping is more efficient. Additionally, the second design allows for a number of optimizations that affect the performance of the query evaluation process. However, this process is burdened with the extra task of evaluating the data mapping component.
An extensive experimental evaluation on synthetically generated data sets and real-world data sets shows that our implementation of the second design is more efficient in most scenarios. Not only is the P-XML data mapping executed more efficient, the query evaluation performance is also improved in most scenarios.
Archive for the Category ◊ XML databases ◊
On November 25th, Riham Abdel Kader defended her thesis on her ROX-approach for run-time optimization of XQueries. Her work and thesis were well-received by the PhD committee. The ROX-approach brings more robustness to query optimizers in finding near-optimal execution plans and it can exploit intricate correlations in the data. Albeit meant for XML databases, the approach can be applied to ordinary relational databases as well RDF stores. Riham recently accepted a position at ASML. I am very proud of her and her work.
“ROX: Run-Time Optimization of XQueries”[download, OPAQUE project]
Query optimization is the most important and complex phase of answering a user query. While sufficient for some applications, the widely used type of relational optimizers are not always robust, picking execution plans that are far from optimal. This is due to several reasons. First, they depend on statistics and a cost model which are often inaccurate, and sometimes even absent. Second, they fail to detect correlations which can unexpectedly make certain plans considerably cheaper than others. Finally, they cannot efficiently handle the large search space of big queries.
The challenges faced by traditional relational optimizers and their impact on the quality of the chosen plans are aggravated in the context of XML and XQueries. This is due to the fact that in XML, it is harder to collect and maintain representative statistics since they have to capture more information about the document. Moreover, the search space of plans for an XQuery query is on average larger than that of relational queries, due to the higher number of joins resulting from the existence of many XPath steps in a typical XQuery.
To overcome the above challenges, we propose ROX, a Run-time Optimizer for XQueries. ROX is autonomous, i.e. it does not depend on any statistics and cost models, robust in always finding a good execution plan while detecting and benefiting from correlations, and efficient in exploring the search space of plans. We show, through experiments, that ROX is indeed robust and efficient, and performs better than relational compile-time optimizers. ROX adopts a fundamentally different internal design which moves the optimization to run-time, and interleaves it with query execution. The search space is efficiently explored by alternating optimization and execution phases, defining the plan incrementally. Every execution step executes a set of operators and materializes the results, allowing the next optimization phase to benefit from the knowledge extracted from the newly materialized intermediates. Sampling techniques are used to accurately estimate the cardinality and cost of operators. To detect correlations, we introduce the chain sampling technique, the first generic and robust method to deal with any type of correlated data. We also extend the ROX idea to pipelined architectures to allow most of the existing database systems to benefit from our research.
For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.
The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]
Mena Badieh Habib started his PhD research in the Neogeography-project today. For details, see my earlier post on “Kick-Off of Neogeography project”.
We designed a variant of our ROX approach for run-time query optimization that works on database systems with a pipelined architecture (i.e., almost all commercial relational databases).
Run-time Optimization for Pipelined Systems
Riham Abdel Kader, Maurice van Keulen, Peter Boncz, Stefan Manegold
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for XQueries, which interleaves optimization and execution of full tables. ROX has proved to be robust, even in the presence of strong correlations, but it has one limitation: it uses full materialization of intermediate results making it unsuitable for pipelined systems. Therefore, this paper proposes ROX-sampled, a variant of ROX, which executes small data samples, thus generating smaller intermediates. We conduct extensive experiments which proved that ROX-sampled is comparable to ROX in performance, and that it is still robust against correlations. The main benefit of ROX-sampled is that it allows the large number of pipelined databases to import the ROX idea into their optimization paradigm.
The paper will be presented at the IV Alberto Mendelzon Workshop on Foundations of Data Management (AMW2010), 17-20 May 2010, Buenos Aires, Argentina [details]
To improve the integration of the new faculty ITC (Geo-Information Science and Earth Observation) into the university, the boards of directors of ITC and UT decided some time ago to subsidize several cooperation projects with each two PhD students, one at ITC and one at the UT. I am involved in one: “Neogeography: the challenge of channelling large and ill-behaved data streams” (see description below). Rolf de By (ITC) and I presented our Neogeography project on the Kick-off meeting 12 March 2010 [presentation]. Rolf’s PhD student is Clarisse Kagoyire and she arrived in The Netherlands just in time to make it to the meeting. My PhD student is Mena Badieh Habib; he will start 1 May 2010.
Neogeography: the challenge of channelling large and ill-behaved data streams
In this project, we develop XML-based data technology to support the channeling of large and ill-behaved neogeographic data streams. In neogeography, geographic information is derived from end-users, not from official bodies like mapping agencies, cadasters or other official, (para-)governmental organizations. The motivation is that multiple (neo)geographic information sources on the same phenomenon can be mutually enriching.
Content provision and feedback from large communities of end-users has great potential for sustaining a high level of data quality. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? The rise of sensor networks adds to the mix a badly needed verification mechanism for the real-time neogeographic data.
We strive for a proper mix of carefully integrated techniques in geoinformation handling, approaches to spatiotemporal imprecision and incompleteness, as well as data augmentation through sensors in a generic framework with which purpose- oriented end-user communities can be served appropriately.
The UT PhD position focuses on spatiotemporal data technology in XML databases and theory and support technology for storage, manipulation and reasoning with spatiotemporal and thematic uncertainty. The work is to be validated through testbed use cases, such as the H20 project with google.org (water consumers in Zanzibar), AGCommons project with the Gates Foundation (smallholder farmers in sub-Saharan Africa), and other projects with large user communities.
Wired published an article “The Good Enough Revolution: When Cheap and Simple Is Just Fine” which perfectly describes what I want to achieve with building systems that connect to and integrate data sources and systems. “We now favor flexibility over high fidelity, convenience over features, quick and dirty over slow and polished. Having it here and now is more important than having it perfect.”
We developed a demonstration that shows and explains what happens behind the scenes of our ROX approach for run-time query optimization.
The Robustness of a Run-Time XQuery Optimizer against Correlated Data
Riham Abdel Kader, Peter Boncz, Stefan Manegold, Maurice van Keulen
We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or explore only a restrictive number of plans from the search space. ROX is developed to tackle these problems. ROX does not need any cost model, and defers query optimization to run-time intertwining optimization and execution steps. In every optimization step, sampling techniques are used to estimate the cardinality of unexecuted steps and joins to make a decision which sequence of operators to process next. Consequently, each execution step will provide updated and accurate knowledge about intermediate results, which will be used during the next optimization round. This demonstration will focus on: (i) illustrating the steps that ROX follows and the decisions it makes to choose a good join order, (ii) showing ROX’s robustness in the face of data with different degree of correlation, (iii) comparing the performance of the plan chosen by ROX to different plans picked from the search space, (iv) proving that the run-time overhead needed by ROX is restricted to a small fraction of the execution time.
The paper will be presented at the 26th International Conference on Data Engineering (ICDE2010), 1-6 Mar 2010, Long Beach, California, USA [details]
On Friday 30 January 2009, Luuk Peters defended his MSc thesis “Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”. The MSc project was carried out at Finan. It was supervised by me, Riham Abdel Kader (UT), Michiel Schipper (Finan) and Joost Willemse (Finan).
“Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”
Finan, a company offering solutions for financial analysis, uses Oracle’s XML-support to store, query and analyze financial reports obtained in XBRL, an open XML-based standard for defining and exchanging business and financial performance information. The goal of the project was to improve the performance of querying a high volume of financial XML documents with changing schemas. A secondary goal was to compare Oracle’s performance on this task with that of MonetDB/XQuery as a representative of a successful XML DBMS from academia.
Luuk thoroughly investigated many strategies for improvement: Oracle’s storage strategies (CLOB, Binary XML, XMLType, XMLTable) some schema-less some schema-based, other non-standard XML-document schemas, and variations in query formulation. He experimented with many possible combinations of these alternatives under varying conditions (database size, query complexity).
Since performance comparisons with Oracle is sensitive information, I cannot say anything specific about the outcomes. What I can say, is that Oracle offers a wide variety XML techniques, each with its own strengths and weaknesses. Much was learned on how these techniques work and how they affect execution performance. It proved hard to formulate a simple and concrete advise regarding the best strategy, because of the influence of so many factors. Nevertheless, considerable improvement could be obtained by choosing the right strategies. Furthermore, MonetDB/XQuery also proved up to the task of financial analysis. We believe that both industry and academia can learn from each other’s techniques. My hope is that this MSc project brought us a step closer to efficient and scalable general-purpose XML DBMS technology.
A public version of Luuk’s MSc thesis is not available yet, but I will of course immediately write about it here as soon as it does.
ROX: Run-time Optimization of XQueries
Riham Abdel Kader (UT), Peter Boncz (CWI), Stefan Manegold (CWI), Maurice van Keulen (UT)
Optimization of complex XQuery queries that combine many XPath steps as well as join conditions is currently hindered by the absence of good result size estimation and cost models for XQuery. Additionally, the state-of-the-art of even relational query optimization still struggles to cope with cost model estimation errors that increase with plan size, as well as with the effect of correlated join, selection and aggregation predicates.
In this research, we propose to radically depart from the traditional path of separating the query compilation and query execution phases, by having the optimizer execute and materialize partial results on the fly, observing intermediate result characteristics as well as applying sampling techniques to evaluate the real observed query cost. The query optimization problem studied here takes as input a Join Graph where the edges are either equi-predicates or XPath axis steps, and the execution environment provides value- and structural-join algorithms, in addition to structural and value-based indices.
While run-time optimization with sampling removes many of the vulnerabilities of classical optimizers, it brings its own challenges with respect to keeping resource usage under control, both with respect to the materialization of intermediates, as well as the cost of plan exploration using sampling. The ROX approach deals with these issues by limiting the run-time search space to so-called “zero-investment” algorithms for which sampling can be guaranteed to be strictly linear in sample size. While the Join Graph used in ROX is a purely relational concept, it crucially fits our XQuery domain as all structural join algorithms and XML value indices we use have the zero-investment property.
We perform extensive experimental evaluation on large XML datasets that shows that our run-time query optimizer finds good query plans in a robust fashion and has limited run-time overhead.
The paper will be presented at the ACM International Conference on Management of Data (SIGMOD 2009), 29 June – 2 July 2009, Providence, Rhode Island, USA. [details]
