I have a vacancy for a PhD position in a project called “Pay-As-You-Go Data Integration for Bio-Informatics” (PayDIBI). In short, the objective is to develop data coupling and integration technology to support bio-informatics scientists in quickly constructing targeted data sets for researching questions that require the combination of information from more than one biological database. More information and a webform to apply can be found here.
Tag-Archive for ◊ LinkedIn ◊
De studie Bedrijfsinformatietechnologie (BIT) aan de Universiteit Twente wordt door de Keuzegids Hoger Onderwijs Universiteiten 2011 een “echte hoogvlieger” genoemd. In een vergelijk tussen alle “Informatiekunde” studies in Nederland, krijgt BIT een totaalscore van 82, met kop en schouders boven de nummer 2, Informatiekunde in Groningen met 74 punten. Zie artikel in de weekkrant.
Ik heb een sterke band met BIT: Ik zit in de opleidingscommissie voor BIT die adviseert over het studieprogramma en andere zaken; bovendien ben ik actief in de voorlichting over BIT; en ik doceer BIT-vakken en begeleid BIT-studenten.
On November 25th, Riham Abdel Kader defended her thesis on her ROX-approach for run-time optimization of XQueries. Her work and thesis were well-received by the PhD committee. The ROX-approach brings more robustness to query optimizers in finding near-optimal execution plans and it can exploit intricate correlations in the data. Albeit meant for XML databases, the approach can be applied to ordinary relational databases as well RDF stores. Riham recently accepted a position at ASML. I am very proud of her and her work.
“ROX: Run-Time Optimization of XQueries”[download, OPAQUE project]
Query optimization is the most important and complex phase of answering a user query. While sufficient for some applications, the widely used type of relational optimizers are not always robust, picking execution plans that are far from optimal. This is due to several reasons. First, they depend on statistics and a cost model which are often inaccurate, and sometimes even absent. Second, they fail to detect correlations which can unexpectedly make certain plans considerably cheaper than others. Finally, they cannot efficiently handle the large search space of big queries.
The challenges faced by traditional relational optimizers and their impact on the quality of the chosen plans are aggravated in the context of XML and XQueries. This is due to the fact that in XML, it is harder to collect and maintain representative statistics since they have to capture more information about the document. Moreover, the search space of plans for an XQuery query is on average larger than that of relational queries, due to the higher number of joins resulting from the existence of many XPath steps in a typical XQuery.
To overcome the above challenges, we propose ROX, a Run-time Optimizer for XQueries. ROX is autonomous, i.e. it does not depend on any statistics and cost models, robust in always finding a good execution plan while detecting and benefiting from correlations, and efficient in exploring the search space of plans. We show, through experiments, that ROX is indeed robust and efficient, and performs better than relational compile-time optimizers. ROX adopts a fundamentally different internal design which moves the optimization to run-time, and interleaves it with query execution. The search space is efficiently explored by alternating optimization and execution phases, defining the plan incrementally. Every execution step executes a set of operators and materializes the results, allowing the next optimization phase to benefit from the knowledge extracted from the newly materialized intermediates. Sampling techniques are used to accurately estimate the cardinality and cost of operators. To detect correlations, we introduce the chain sampling technique, the first generic and robust method to deal with any type of correlated data. We also extend the ROX idea to pipelined architectures to allow most of the existing database systems to benefit from our research.
On 22 October 2010, Emiel Hollander defended his MSc thesis “Dynamic Access Control”. The MSc project was supervised by me, Virginia Nunes Franqueira, and Anton Boerma and Richard Scholten from Exxellence.
“Dynamic Access Control”[download]
An increasing number of services require access control. On the web, access control is usually enforced using a combination of username and password. Users are encouraged to choose secure passwords. These secure passwords are very hard to remember, which causes people to write passwords down, re-use the same password or choose a simple password. Our goal is to design an access control system that is easier to use, while still offering the same amount of security.
The main idea behind this research is that not every service needs the same amount of security. It may not be necessary to ask the secure password for every service; for services that require less security, an access control method that is less secure, but easier to use, may be sufficient.
We have built a system that is capable of dynamically determining the access control method or methods that it has to use to ensure sufficient security. When the user requests a service, the system looks up the amount of security that is needed and adapts the used access control methods to this.
The evaluation of this system shows that people appreciate the fact that the system is able to choose easier access control methods for services that do not require a high security level. According to the participants, the dynamic system is easier and more pleasant to use than an access control method based on caller ID, and easier and more pleasant than DigiD with additional SMS authentication. The participants, however, did not find the dynamic system easier or more pleasant to use than username and password. This system is so common and widely-used that it is hard to beat. We do believe, however, that the dynamic system can become better than username and password when users get more accustomed to it, and when some usability problems have been looked into.
Fabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.
On Thursday 26 August 2010, Guido van der Zanden defended his MSc thesis “Quality Assessment of Medical Health Records using Information Extraction”. The MSc project was supervised by me, Ander de Keijzer, and Vincent Ivens and Daan van Berkel from Topicus Zorg.
“Quality Assessment of Medical Health Records using Information Extraction” [download]
The most important information in Electronic Health Records is in free text form. The result is that the quality of Electronic Health Records is hard to as- sess. Since Electronic Health Records are exchanged more and more, badly writ- ten or incomplete records can cause problems when other healthcare providers do not completely understand them. In this thesis we try to automatically assess the quality of Electronic Health Records using Information Extraction. Another advantage of the automated analysis of Electronic Health Records is to extract management information which can be used in order to increase efficiency and decrease cost, another popular subject in healthcare nowadays.
Our solution for automated assessment of Electronic Health Records consists out of two parts. In the first part we theoretically determine what the quality of Electronic Health Records is, based upon Data and Information Quality theory. Based upon this analysis we propose three quality metrics. The first two check whether an Electronic Health Record is written as prescribed by guidelines of the association of general practitioners. The first checks whether the SOEP methodology is used correctly, the second whether a treatment is carried out according to the guideline for that illness. The third metric is more general applicable and measures conciseness.
In the second part we designed and implemented a prototype system to ex- ecute the quality assessment. Due to time limitations we only implemented the SOEP methodology metric. This metric tests whether a piece of text is placed in the right place. The fields that can be used by a healthcare provider are (S)ubjective, (O)bjective, (E)valuation and (P)lan. We implemented a proto- type based upon the ‘General Architecture for Text Engineering’. Many generic Information Extraction tasks were available already, we implemented two do- main specific tasks ourselves. The first looks up words in a thesaurus (the UMLS) in order to give meaning to the text, since to every word in the the- saurus one or more semantic types are assigned. The semantic types found in a sentence are then resolved to one of the four SOEP types. In a good Electronic Health Record, sentences are resolved to the SOEP field they are actually in.
To validate our prototype we annotated text from real Electronic Health Records with S,O,E and P and compared it to the output of our prototype. We found a Precision of roughly 50% and a recall of 20-25%. Although not perfect, because we had time nor resources to involve domain experts we think this result is encouraging for further research. Furthermore we shown that our other two metrics are sensible with use cases. Although no proof they are feasible in practice, they show that a whole set of different metrics can be used to assess the quality of Electronic Health Records.
For his “Research Topic” course, MSc student Emiel Hollander experimented with a mapping from Probabilistic XML to the probabilistic relational database Trio to investigate whether or not it is feasible to use Trio as a back-end for processing XPath queries on Probabilistic XML.
Storing and Querying Probabilistic XML Using a Probabilistic Relational DBMS
Emiel Hollander, Maurice van Keulen
This work explores the feasibility of storing and querying probabilistic XML in a probabilistic relational database. Our approach is to adapt known techniques for mapping XML to relational data such that the possible worlds are preserved. We show that this approach can work for any XML-to-relational technique by adapting a representative schema-based (inlining) as well as a representative schemaless technique (XPath Accelerator). We investigate the maturity of probabilistic relational databases for this task with experiments with one of the state-of- the-art systems, called Trio.
The paper will be presented at the 4th International Workshop on Management of Uncertain Data (MUD 2010) co-located with VLDB, 13 September 2010, Singapore [details]
On Wednesday 14 July 2010, Tom Palsma defended his MSc thesis entitled “Discovering groups using short messages from social network profiles”. The MSc project was carried out at Topicus FinCare. It was supervised by me, Dolf Trieschnigg (UT), Jasper Laagland (FinCare), and Wouter de Jong (FinCare).
Discovering groups using short messages from social network profiles[download]
In the past few years people used the internet more and more for sharing their photos, thoughts and activities via photo albums, user profiles, blogs and short text messages on online social networking sites, like Facebook and MySpace. This information could be very useful for (personal) marketing and advertising. Besides groups formed by the users of the social network sites explicitly, people could have a relation based on similar interests that they implicitly leave in short text messages.
This research has a focus on discovering groups, taking into account semantic relations between user profiles and describing the characteristics of the groups and relations. Because it is hard to discover these types of relations at word level by matching similar words in messages, we introduce a hierarchical structure of concepts obtained from the Wikipedia category system to discover groups and (semantic) relations at more abstract conceptual levels between profiles with short text messages obtained from Twitter.
In order to provide a general approach that is not limited to a specific set of concepts we use a naive classification approach. Concepts (and their parent concepts) are assigned to profiles when concept (related) terms occur in a short message of the profile. Manual evaluation of this approach shows 37.4% of the assignments is correct (the precision), which results in an F-score of 0.54. To improve the precision of the classification results we use Support Vector Machines. Using features related to characteristics of the concept structure and the relations between concepts and profiles improves the classification results with 14% according to the F-score (0.68).
The grouping process consists of clustering of statistical data of concept occurrences in user profiles. Interesting groups discovered based on the clustering results are groups of concepts that are not grouped together in the original Wikipedia category structure. Besides these types of groups the results also show groups of concepts that have a semantic relation, which is not reflected in the Wikipedia category structure. This information could be used to improve the Wikipedia category structure.
The overall process shows that the usage of hierarchical concepts and clustering helps to discover groups based on semantic relations on abstract conceptual levels. The selection of concepts and assigning them to user profiles could guide the grouping results to desired domains and the concepts help to describe the groups. However, due to problems with ambiguous meaning of concepts and characteristics of the messages, another approach of assigning concepts could improve the quality of the discovered groups. To know how useful the groups are for marketing and advertising requires more research.
Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].
Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.
