Tag-Archive for » data quality «

Tuesday, February 02nd, 2016 | Author:

The project proposal “Time To Care: Using sensor technology to dynamically model social interactions of healthcare professionals at work in relation to healthcare quality” has been accepted in our university’s Tech4People program. The project is a cooperation with Educational Sciences (chair OWK) and Psychology of Conflict, Risk and Safety (chair PCRS) with whom the funded PhD student will be shared.

What I am particulary enthusiastic about in this project is that it is not only interdisciplinary cooperation towards a shared goal, but that also disciplinary research questions from each of the participating disciplines can be answered. For me, it is a unique opportunity to test whether probabilistic modeling of the data quality problems / noise in the social interaction data obtained from the sensors indeed provide significantly different results when predicting team performance.

Thursday, January 14th, 2016 | Author:

Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]

Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.

Wednesday, February 25th, 2015 | Author:

Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.

Friday, June 22nd, 2012 | Author:

On 22 June 2012, Mike Niblett defended his MSc thesis “A method to obtain sustained data quality at Distimo”. The MSc project was carried out at Distimo, a mobile app analytics company.
“A method to obtain sustained data quality at Distimo”[download]
This thesis attempts to answer the following research question: “how can we determine and improve data quality?”.
A method is proposed to systematically analyse the demands and current state of data quality within an organisation. The mission statement and information systems architecture are used to characterise the organisation. A list of data quality characteristics based on literature is used to express the organisation in terms of data quality. Metrics are established to quantify the data quality characteristics. A risk analysis determines which are the most important areas to improve upon. After improvement, the metrics can be used to evaluate the success of the improvements.
Distimo is an innovative application store analytics company aiming to solve the challenges created by a widely fragmented application store marketplace filled with equally fragmented information and statistics. As Distimo’s products are very data driven, data quality is very important. The method will be applied to Distimo as a case study.
The proposed method provides a way to determine the current state of data quality, and to determine what to improve, and how to evaluate if the improve- ments provide the desired outcome. The case study of Distimo resulted in an in-depth analysis of Distimo, which in turn yielded a number of data quality improvements that at this very moment are in production and have improved data quality.
Because of the generic nature of input data, the proposed method is applicable to any organisation looking to improve data quality. The iterative improvement process allow for fine grained control of changes to organisational processes and systems.

Category: Student projects  | Tags: , ,  | Comments off
Thursday, October 21st, 2010 | Author:

QloUDFabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.

Friday, August 27th, 2010 | Author:

On Thursday 26 August 2010, Guido van der Zanden defended his MSc thesis “Quality Assessment of Medical Health Records using Information Extraction”. The MSc project was supervised by me, Ander de Keijzer, and Vincent Ivens and Daan van Berkel from Topicus Zorg.

“Quality Assessment of Medical Health Records using Information Extraction” [download]
The most important information in Electronic Health Records is in free text form. The result is that the quality of Electronic Health Records is hard to as- sess. Since Electronic Health Records are exchanged more and more, badly writ- ten or incomplete records can cause problems when other healthcare providers do not completely understand them. In this thesis we try to automatically assess the quality of Electronic Health Records using Information Extraction. Another advantage of the automated analysis of Electronic Health Records is to extract management information which can be used in order to increase efficiency and decrease cost, another popular subject in healthcare nowadays.
Our solution for automated assessment of Electronic Health Records consists out of two parts. In the first part we theoretically determine what the quality of Electronic Health Records is, based upon Data and Information Quality theory. Based upon this analysis we propose three quality metrics. The first two check whether an Electronic Health Record is written as prescribed by guidelines of the association of general practitioners. The first checks whether the SOEP methodology is used correctly, the second whether a treatment is carried out according to the guideline for that illness. The third metric is more general applicable and measures conciseness.
In the second part we designed and implemented a prototype system to ex- ecute the quality assessment. Due to time limitations we only implemented the SOEP methodology metric. This metric tests whether a piece of text is placed in the right place. The fields that can be used by a healthcare provider are (S)ubjective, (O)bjective, (E)valuation and (P)lan. We implemented a proto- type based upon the ‘General Architecture for Text Engineering’. Many generic Information Extraction tasks were available already, we implemented two do- main specific tasks ourselves. The first looks up words in a thesaurus (the UMLS) in order to give meaning to the text, since to every word in the the- saurus one or more semantic types are assigned. The semantic types found in a sentence are then resolved to one of the four SOEP types. In a good Electronic Health Record, sentences are resolved to the SOEP field they are actually in.
To validate our prototype we annotated text from real Electronic Health Records with S,O,E and P and compared it to the output of our prototype. We found a Precision of roughly 50% and a recall of 20-25%. Although not perfect, because we had time nor resources to involve domain experts we think this result is encouraging for further research. Furthermore we shown that our other two metrics are sensible with use cases. Although no proof they are feasible in practice, they show that a whole set of different metrics can be used to assess the quality of Electronic Health Records.

Friday, June 18th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].

Thursday, June 17th, 2010 | Author:

I just read a very nice post about “Who owns the data”. I especially like the analogy at the end: who owns your house? Not the architect, not the builder, nor the plumber, but you yourself, of course, with everything in it! So, it should be the organization, or more precise, the department that uses the data, who should own it. Not the IT department who builds and/or maintains it. I also like his link to data quality in identifying who should own the data: “Who is going to care most if the data is incorrect?”. These need not be the people who generate or capture the data. This issue is, in my opinion, one of the major causes of problems with data quality and ineffective use of data.

Category: Probabilistic Data Integration  | Tags: ,  | Comments off
Tuesday, April 27th, 2010 | Author:

Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.

Wednesday, November 25th, 2009 | Author:

As a product of my cooperation with Fabian Panse from the University of Hamburg, we got a paper accepted at the NTII-workshop co-located with ICDE 2010.
Duplicate Detection in Probabilistic Data
Fabian Panse, Maurice van Keulen, Ander de Keijzer, Norbert Ritter
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.

The paper will be presented at the Second International Workshop on New Trends in Information Integration (NTII 2010), Long Beach, California, USA [details]