Tag-Archive for » xml databases «

Thursday, November 25th, 2010 | Author:

On November 25th, Riham Abdel Kader defended her thesis on her ROX-approach for run-time optimization of XQueries. Her work and thesis were well-received by the PhD committee. The ROX-approach brings more robustness to query optimizers in finding near-optimal execution plans and it can exploit intricate correlations in the data. Albeit meant for XML databases, the approach can be applied to ordinary relational databases as well RDF stores. Riham recently accepted a position at ASML. I am very proud of her and her work.
“ROX: Run-Time Optimization of XQueries”[download, OPAQUE project]
Query optimization is the most important and complex phase of answering a user query. While sufficient for some applications, the widely used type of relational optimizers are not always robust, picking execution plans that are far from optimal. This is due to several reasons. First, they depend on statistics and a cost model which are often inaccurate, and sometimes even absent. Second, they fail to detect correlations which can unexpectedly make certain plans considerably cheaper than others. Finally, they cannot efficiently handle the large search space of big queries.
The challenges faced by traditional relational optimizers and their impact on the quality of the chosen plans are aggravated in the context of XML and XQueries. This is due to the fact that in XML, it is harder to collect and maintain representative statistics since they have to capture more information about the document. Moreover, the search space of plans for an XQuery query is on average larger than that of relational queries, due to the higher number of joins resulting from the existence of many XPath steps in a typical XQuery.
To overcome the above challenges, we propose ROX, a Run-time Optimizer for XQueries. ROX is autonomous, i.e. it does not depend on any statistics and cost models, robust in always finding a good execution plan while detecting and benefiting from correlations, and efficient in exploring the search space of plans. We show, through experiments, that ROX is indeed robust and efficient, and performs better than relational compile-time optimizers. ROX adopts a fundamentally different internal design which moves the optimization to run-time, and interleaves it with query execution. The search space is efficiently explored by alternating optimization and execution phases, defining the plan incrementally. Every execution step executes a set of operators and materializes the results, allowing the next optimization phase to benefit from the knowledge extracted from the newly materialized intermediates. Sampling techniques are used to accurately estimate the cardinality and cost of operators. To detect correlations, we introduce the chain sampling technique, the first generic and robust method to deal with any type of correlated data. We also extend the ROX idea to pipelined architectures to allow most of the existing database systems to benefit from our research.

Monday, May 03rd, 2010 | Author:

Mena Badieh Habib started his PhD research in the Neogeography-project today. For details, see my earlier post on “Kick-Off of Neogeography project”.

Friday, April 02nd, 2010 | Author:

We designed a variant of our ROX approach for run-time query optimization that works on database systems with a pipelined architecture (i.e., almost all commercial relational databases).
Run-time Optimization for Pipelined Systems
Riham Abdel Kader, Maurice van Keulen, Peter Boncz, Stefan Manegold
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for XQueries, which interleaves optimization and execution of full tables. ROX has proved to be robust, even in the presence of strong correlations, but it has one limitation: it uses full materialization of intermediate results making it unsuitable for pipelined systems. Therefore, this paper proposes ROX-sampled, a variant of ROX, which executes small data samples, thus generating smaller intermediates. We conduct extensive experiments which proved that ROX-sampled is comparable to ROX in performance, and that it is still robust against correlations. The main benefit of ROX-sampled is that it allows the large number of pipelined databases to import the ROX idea into their optimization paradigm.

The paper will be presented at the IV Alberto Mendelzon Workshop on Foundations of Data Management (AMW2010), 17-20 May 2010, Buenos Aires, Argentina [details]

Friday, March 12th, 2010 | Author:

To improve the integration of the new faculty ITC (Geo-Information Science and Earth Observation) into the university, the boards of directors of ITC and UT decided some time ago to subsidize several cooperation projects with each two PhD students, one at ITC and one at the UT. I am involved in one: “Neogeography: the challenge of channelling large and ill-behaved data streams” (see description below). Rolf de By (ITC) and I presented our Neogeography project on the Kick-off meeting 12 March 2010 [presentation]. Rolf’s PhD student is Clarisse Kagoyire and she arrived in The Netherlands just in time to make it to the meeting. My PhD student is Mena Badieh Habib; he will start 1 May 2010.

Neogeography: the challenge of channelling large and ill-behaved data streams
In this project, we develop XML-based data technology to support the channeling of large and ill-behaved neogeographic data streams. In neogeography, geographic information is derived from end-users, not from official bodies like mapping agencies, cadasters or other official, (para-)governmental organizations. The motivation is that multiple (neo)geographic information sources on the same phenomenon can be mutually enriching.
Content provision and feedback from large communities of end-users has great potential for sustaining a high level of data quality. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? The rise of sensor networks adds to the mix a badly needed verification mechanism for the real-time neogeographic data.
We strive for a proper mix of carefully integrated techniques in geoinformation handling, approaches to spatiotemporal imprecision and incompleteness, as well as data augmentation through sensors in a generic framework with which purpose- oriented end-user communities can be served appropriately.
The UT PhD position focuses on spatiotemporal data technology in XML databases and theory and support technology for storage, manipulation and reasoning with spatiotemporal and thematic uncertainty. The work is to be validated through testbed use cases, such as the H20 project with google.org (water consumers in Zanzibar), AGCommons project with the Gates Foundation (smallholder farmers in sub-Saharan Africa), and other projects with large user communities.

Friday, September 18th, 2009 | Author:

We developed a demonstration that shows and explains what happens behind the scenes of our ROX approach for run-time query optimization.
The Robustness of a Run-Time XQuery Optimizer against Correlated Data
Riham Abdel Kader, Peter Boncz, Stefan Manegold, Maurice van Keulen
We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or explore only a restrictive number of plans from the search space. ROX is developed to tackle these problems. ROX does not need any cost model, and defers query optimization to run-time intertwining optimization and execution steps. In every optimization step, sampling techniques are used to estimate the cardinality of unexecuted steps and joins to make a decision which sequence of operators to process next. Consequently, each execution step will provide updated and accurate knowledge about intermediate results, which will be used during the next optimization round. This demonstration will focus on: (i) illustrating the steps that ROX follows and the decisions it makes to choose a good join order, (ii) showing ROX’s robustness in the face of data with different degree of correlation, (iii) comparing the performance of the plan chosen by ROX to different plans picked from the search space, (iv) proving that the run-time overhead needed by ROX is restricted to a small fraction of the execution time.

The paper will be presented at the 26th International Conference on Data Engineering (ICDE2010), 1-6 Mar 2010, Long Beach, California, USA [details]

Tuesday, April 07th, 2009 | Author:

On Friday 30 January 2009, Luuk Peters defended his MSc thesis “Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”. The MSc project was carried out at Finan. It was supervised by me, Riham Abdel Kader (UT), Michiel Schipper (Finan) and Joost Willemse (Finan).

“Battle of the Bulk: Corporate XMLDB vs. Research XMLDB”
Finan, a company offering solutions for financial analysis, uses Oracle’s XML-support to store, query and analyze financial reports obtained in XBRL, an open XML-based standard for defining and exchanging business and financial performance information. The goal of the project was to improve the performance of querying a high volume of financial XML documents with changing schemas. A secondary goal was to compare Oracle’s performance on this task with that of MonetDB/XQuery as a representative of a successful XML DBMS from academia.

Luuk thoroughly investigated many strategies for improvement: Oracle’s storage strategies (CLOB, Binary XML, XMLType, XMLTable) some schema-less some schema-based, other non-standard XML-document schemas, and variations in query formulation. He experimented with many possible combinations of these alternatives under varying conditions (database size, query complexity).

Since performance comparisons with Oracle is sensitive information, I cannot say anything specific about the outcomes. What I can say, is that Oracle offers a wide variety XML techniques, each with its own strengths and weaknesses. Much was learned on how these techniques work and how they affect execution performance. It proved hard to formulate a simple and concrete advise regarding the best strategy, because of the influence of so many factors. Nevertheless, considerable improvement could be obtained by choosing the right strategies. Furthermore, MonetDB/XQuery also proved up to the task of financial analysis. We believe that both industry and academia can learn from each other’s techniques. My hope is that this MSc project brought us a step closer to efficient and scalable general-purpose XML DBMS technology.

A public version of Luuk’s MSc thesis is not available yet, but I will of course immediately write about it here as soon as it does.

Category: Student projects, XML databases  | Tags: , , , ,  | Comments off
Wednesday, March 11th, 2009 | Author:

ROX: Run-time Optimization of XQueries
Riham Abdel Kader (UT), Peter Boncz (CWI), Stefan Manegold (CWI), Maurice van Keulen (UT)
Optimization of complex XQuery queries that combine many XPath steps as well as join conditions is currently hindered by the absence of good result size estimation and cost models for XQuery. Additionally, the state-of-the-art of even relational query optimization still struggles to cope with cost model estimation errors that increase with plan size, as well as with the effect of correlated join, selection and aggregation predicates.

In this research, we propose to radically depart from the traditional path of separating the query compilation and query execution phases, by having the optimizer execute and materialize partial results on the fly, observing intermediate result characteristics as well as applying sampling techniques to evaluate the real observed query cost. The query optimization problem studied here takes as input a Join Graph where the edges are either equi-predicates or XPath axis steps, and the execution environment provides value- and structural-join algorithms, in addition to structural and value-based indices.

While run-time optimization with sampling removes many of the vulnerabilities of classical optimizers, it brings its own challenges with respect to keeping resource usage under control, both with respect to the materialization of intermediates, as well as the cost of plan exploration using sampling. The ROX approach deals with these issues by limiting the run-time search space to so-called “zero-investment” algorithms for which sampling can be guaranteed to be strictly linear in sample size. While the Join Graph used in ROX is a purely relational concept, it crucially fits our XQuery domain as all structural join algorithms and XML value indices we use have the zero-investment property.

We perform extensive experimental evaluation on large XML datasets that shows that our run-time query optimizer finds good query plans in a robust fashion and has limited run-time overhead.

The paper will be presented at the ACM International Conference on Management of Data (SIGMOD 2009), 29 June – 2 July 2009, Providence, Rhode Island, USA. [details]

Category: Opaque, XML databases  | Tags: , , , , , ,  | Comments off
Wednesday, February 25th, 2009 | Author:

Clarisse Kagoyire today defended her MSc thesis at the International Institute for Geo-Information Science and Earth Observation (ITC) in Enschede. She explored the application of XML database technology for distributed spatial data processing using web services. The idea is that XML database technology, if equipped with spatial support, can avoid development and run-time overhead by working directly on exchanged GML data compared to typical relational SDI*-based solutions as it allows you to stay in the XML-domain. Clarisse implemented a soil erosion scenario involving 5 independent XML database servers using MonetDB/XQuery as XML database platform. MonetDB/XQuery’s support for XRPC proved important and powerful for realizing the distributed query processing. Furthermore, Clarisse used and helped specify the recently implemented XQuery spatial functionality. The research shows that XML database technology is suitable for implementing web services and that the preliminary unoptimized spatial support in MonetDB/XQuery is already sufficient for certain distributed spatial data processing tasks.
(*) SDI = Spatial Data Infrastructure

Monday, June 26th, 2006 | Author:

MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine
Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the art with a number of new technical contributions, such as looplifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11 GB. The performance section also provides an extensive comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met.

The paper was presented at the ACM International Conference on Management of Data (SIGMOD 2006), 26-29 June 2006, Chicago, IL, USA. [electronic version] [details]