Today, PhD student Haihan Yin defended his PhD thesis. I served on his defense committee.
“Defusing the Debugging Scandal – Dedicated Debugging Technologies for Advanced Dispatching Languages”[download]
Archive for the Category » XML databases «
Today, PhD student Haihan Yin defended his PhD thesis. I served on his defense committee.
On 7 December 2012, Paul Stapersma defended his MSc thesis “Efficient Query Evaluation on Probabilistic XML Data”. The MSc project was supervised by me, Maarten Fokkinga and Jan Flokstra. The thesis is the result of a more than 2 year cooperation between Paul and me to build a probabilistic XML database system on top of a relational one: MayBMS.
“Efficient Query Evaluation on Probabilistic XML Data”[download]
In many application scenarios, reliability and accuracy of data are of great importance. Data is often uncertain or inconsistent because the exact state of represented real world objects is unknown. A number of uncertain data models have emerged to cope with imperfect data in order to guarantee a level of reliability and accuracy. These models include probabilistic XML (P-XML) –an uncertain semi-structured data model– and U-Rel –an uncertain table-structured data model. U-Rel is used by MayBMS, an uncertain relational database management system (URDBMS) that provides scalable query evaluation. In contrast to U-Rel, there does not exist an efficient query evaluation mechanism for P-XML.
In this thesis, we approach this problem by instructing MayBMS to cope with P-XML in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.
We present a specification of a P-XML to U-Rel data mapping and a corresponding XPath to SQL mapping. Additionally, we present two designs of this specification. The first design constructs a data mapping in such way that the corresponding query mapping is a traditional XPath to SQL mapping. The second design differs from the first in the sense that a component of the data mapping is evaluated as part of the query evaluation process. This offers the advantage that the data mapping is more efficient. Additionally, the second design allows for a number of optimizations that affect the performance of the query evaluation process. However, this process is burdened with the extra task of evaluating the data mapping component.
An extensive experimental evaluation on synthetically generated data sets and real-world data sets shows that our implementation of the second design is more efficient in most scenarios. Not only is the P-XML data mapping executed more efficient, the query evaluation performance is also improved in most scenarios.
On November 25th, Riham Abdel Kader defended her thesis on her ROX-approach for run-time optimization of XQueries. Her work and thesis were well-received by the PhD committee. The ROX-approach brings more robustness to query optimizers in finding near-optimal execution plans and it can exploit intricate correlations in the data. Albeit meant for XML databases, the approach can be applied to ordinary relational databases as well RDF stores. Riham recently accepted a position at ASML. I am very proud of her and her work.
“ROX: Run-Time Optimization of XQueries”[download, OPAQUE project]
Query optimization is the most important and complex phase of answering a user query. While sufficient for some applications, the widely used type of relational optimizers are not always robust, picking execution plans that are far from optimal. This is due to several reasons. First, they depend on statistics and a cost model which are often inaccurate, and sometimes even absent. Second, they fail to detect correlations which can unexpectedly make certain plans considerably cheaper than others. Finally, they cannot efficiently handle the large search space of big queries.
The challenges faced by traditional relational optimizers and their impact on the quality of the chosen plans are aggravated in the context of XML and XQueries. This is due to the fact that in XML, it is harder to collect and maintain representative statistics since they have to capture more information about the document. Moreover, the search space of plans for an XQuery query is on average larger than that of relational queries, due to the higher number of joins resulting from the existence of many XPath steps in a typical XQuery.
To overcome the above challenges, we propose ROX, a Run-time Optimizer for XQueries. ROX is autonomous, i.e. it does not depend on any statistics and cost models, robust in always finding a good execution plan while detecting and benefiting from correlations, and efficient in exploring the search space of plans. We show, through experiments, that ROX is indeed robust and efficient, and performs better than relational compile-time optimizers. ROX adopts a fundamentally different internal design which moves the optimization to run-time, and interleaves it with query execution. The search space is efficiently explored by alternating optimization and execution phases, defining the plan incrementally. Every execution step executes a set of operators and materializes the results, allowing the next optimization phase to benefit from the knowledge extracted from the newly materialized intermediates. Sampling techniques are used to accurately estimate the cardinality and cost of operators. To detect correlations, we introduce the chain sampling technique, the first generic and robust method to deal with any type of correlated data. We also extend the ROX idea to pipelined architectures to allow most of the existing database systems to benefit from our research.
For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.
The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]
Mena Badieh Habib started his PhD research in the Neogeography-project today. For details, see my earlier post on “Kick-Off of Neogeography project”.
We designed a variant of our ROX approach for run-time query optimization that works on database systems with a pipelined architecture (i.e., almost all commercial relational databases).
Run-time Optimization for Pipelined Systems
Riham Abdel Kader, Maurice van Keulen, Peter Boncz, Stefan Manegold
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for XQueries, which interleaves optimization and execution of full tables. ROX has proved to be robust, even in the presence of strong correlations, but it has one limitation: it uses full materialization of intermediate results making it unsuitable for pipelined systems. Therefore, this paper proposes ROX-sampled, a variant of ROX, which executes small data samples, thus generating smaller intermediates. We conduct extensive experiments which proved that ROX-sampled is comparable to ROX in performance, and that it is still robust against correlations. The main benefit of ROX-sampled is that it allows the large number of pipelined databases to import the ROX idea into their optimization paradigm.
The paper will be presented at the IV Alberto Mendelzon Workshop on Foundations of Data Management (AMW2010), 17-20 May 2010, Buenos Aires, Argentina [details]
To improve the integration of the new faculty ITC (Geo-Information Science and Earth Observation) into the university, the boards of directors of ITC and UT decided some time ago to subsidize several cooperation projects with each two PhD students, one at ITC and one at the UT. I am involved in one: “Neogeography: the challenge of channelling large and ill-behaved data streams” (see description below). Rolf de By (ITC) and I presented our Neogeography project on the Kick-off meeting 12 March 2010 [presentation]. Rolf’s PhD student is Clarisse Kagoyire and she arrived in The Netherlands just in time to make it to the meeting. My PhD student is Mena Badieh Habib; he will start 1 May 2010.
Neogeography: the challenge of channelling large and ill-behaved data streams
In this project, we develop XML-based data technology to support the channeling of large and ill-behaved neogeographic data streams. In neogeography, geographic information is derived from end-users, not from official bodies like mapping agencies, cadasters or other official, (para-)governmental organizations. The motivation is that multiple (neo)geographic information sources on the same phenomenon can be mutually enriching.
Content provision and feedback from large communities of end-users has great potential for sustaining a high level of data quality. The technology is meant to reach a substantial user community in the less-developed world through content provision and delivery via cell phone networks. Exploiting such neogeographic data requires a.o. the extraction of the where and when from textual descriptions. This comes with intrinsic uncertainty in space, time, but also thematically in terms of entity identification: which is the restaurant, bus stop, farm, market, forest mentioned in this information source? The rise of sensor networks adds to the mix a badly needed verification mechanism for the real-time neogeographic data.
We strive for a proper mix of carefully integrated techniques in geoinformation handling, approaches to spatiotemporal imprecision and incompleteness, as well as data augmentation through sensors in a generic framework with which purpose- oriented end-user communities can be served appropriately.
The UT PhD position focuses on spatiotemporal data technology in XML databases and theory and support technology for storage, manipulation and reasoning with spatiotemporal and thematic uncertainty. The work is to be validated through testbed use cases, such as the H20 project with google.org (water consumers in Zanzibar), AGCommons project with the Gates Foundation (smallholder farmers in sub-Saharan Africa), and other projects with large user communities.
Wired published an article “The Good Enough Revolution: When Cheap and Simple Is Just Fine” which perfectly describes what I want to achieve with building systems that connect to and integrate data sources and systems. “We now favor flexibility over high fidelity, convenience over features, quick and dirty over slow and polished. Having it here and now is more important than having it perfect.”
We developed a demonstration that shows and explains what happens behind the scenes of our ROX approach for run-time query optimization.
The Robustness of a Run-Time XQuery Optimizer against Correlated Data
Riham Abdel Kader, Peter Boncz, Stefan Manegold, Maurice van Keulen
We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or explore only a restrictive number of plans from the search space. ROX is developed to tackle these problems. ROX does not need any cost model, and defers query optimization to run-time intertwining optimization and execution steps. In every optimization step, sampling techniques are used to estimate the cardinality of unexecuted steps and joins to make a decision which sequence of operators to process next. Consequently, each execution step will provide updated and accurate knowledge about intermediate results, which will be used during the next optimization round. This demonstration will focus on: (i) illustrating the steps that ROX follows and the decisions it makes to choose a good join order, (ii) showing ROX’s robustness in the face of data with different degree of correlation, (iii) comparing the performance of the plan chosen by ROX to different plans picked from the search space, (iv) proving that the run-time overhead needed by ROX is restricted to a small fraction of the execution time.
The paper will be presented at the 26th International Conference on Data Engineering (ICDE2010), 1-6 Mar 2010, Long Beach, California, USA [details]
It happens too infrequently with students for my taste, but Irma Veldman, a student of mine, got a paper accepted for the SUM conference about her research project.
Compression of Probabilistic XML documents
Irma Veldman, Ander de Keijzer, Maurice van Keulen
Database techniques to store, query and manipulate data that contains uncertainty receives increasing research interest. Such UDBMSs can be classified according to their underlying data model: relational, XML, or RDF. We focus on uncertain XML DBMS with as representative example the Probabilistic XML model (PXML) of . The size of a PXML document is obviously a factor in performance. There are PXML-specific techniques to reduce the size, such as a push down mechanism, that produces equivalent but more compact PXML documents. It can only be applied, however, where possibilities are dependent. For normal XML documents there also exist several techniques for compressing a document. Since Probabilistic XML is (a special form of) normal XML, it might benefit from these methods even more. In this paper, we show that existing compression mechanisms can be combined with PXML-specific compression techniques. We also show that best compression rates are obtained with a combination of PXML-specific technique with a rather simple generic DAG-compression technique.
The paper will be presented at the third International Conference on Scalable Uncertainty Management (SUM2009), 28-30 Sep 2009, Washington, DC, USA [details]