On 22 June 2012, Mike Niblett defended his MSc thesis “A method to obtain sustained data quality at Distimo”. The MSc project was carried out at Distimo, a mobile app analytics company.
“A method to obtain sustained data quality at Distimo”[download]
This thesis attempts to answer the following research question: “how can we determine and improve data quality?”.
A method is proposed to systematically analyse the demands and current state of data quality within an organisation. The mission statement and information systems architecture are used to characterise the organisation. A list of data quality characteristics based on literature is used to express the organisation in terms of data quality. Metrics are established to quantify the data quality characteristics. A risk analysis determines which are the most important areas to improve upon. After improvement, the metrics can be used to evaluate the success of the improvements.
Distimo is an innovative application store analytics company aiming to solve the challenges created by a widely fragmented application store marketplace filled with equally fragmented information and statistics. As Distimo’s products are very data driven, data quality is very important. The method will be applied to Distimo as a case study.
The proposed method provides a way to determine the current state of data quality, and to determine what to improve, and how to evaluate if the improve- ments provide the desired outcome. The case study of Distimo resulted in an in-depth analysis of Distimo, which in turn yielded a number of data quality improvements that at this very moment are in production and have improved data quality.
Because of the generic nature of input data, the proposed method is applicable to any organisation looking to improve data quality. The iterative improvement process allow for fine grained control of changes to organisational processes and systems.
Tag-Archive for ◊ data quality ◊
Fabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.
On Thursday 26 August 2010, Guido van der Zanden defended his MSc thesis “Quality Assessment of Medical Health Records using Information Extraction”. The MSc project was supervised by me, Ander de Keijzer, and Vincent Ivens and Daan van Berkel from Topicus Zorg.
“Quality Assessment of Medical Health Records using Information Extraction” [download]
The most important information in Electronic Health Records is in free text form. The result is that the quality of Electronic Health Records is hard to as- sess. Since Electronic Health Records are exchanged more and more, badly writ- ten or incomplete records can cause problems when other healthcare providers do not completely understand them. In this thesis we try to automatically assess the quality of Electronic Health Records using Information Extraction. Another advantage of the automated analysis of Electronic Health Records is to extract management information which can be used in order to increase efficiency and decrease cost, another popular subject in healthcare nowadays.
Our solution for automated assessment of Electronic Health Records consists out of two parts. In the first part we theoretically determine what the quality of Electronic Health Records is, based upon Data and Information Quality theory. Based upon this analysis we propose three quality metrics. The first two check whether an Electronic Health Record is written as prescribed by guidelines of the association of general practitioners. The first checks whether the SOEP methodology is used correctly, the second whether a treatment is carried out according to the guideline for that illness. The third metric is more general applicable and measures conciseness.
In the second part we designed and implemented a prototype system to ex- ecute the quality assessment. Due to time limitations we only implemented the SOEP methodology metric. This metric tests whether a piece of text is placed in the right place. The fields that can be used by a healthcare provider are (S)ubjective, (O)bjective, (E)valuation and (P)lan. We implemented a proto- type based upon the ‘General Architecture for Text Engineering’. Many generic Information Extraction tasks were available already, we implemented two do- main specific tasks ourselves. The first looks up words in a thesaurus (the UMLS) in order to give meaning to the text, since to every word in the the- saurus one or more semantic types are assigned. The semantic types found in a sentence are then resolved to one of the four SOEP types. In a good Electronic Health Record, sentences are resolved to the SOEP field they are actually in.
To validate our prototype we annotated text from real Electronic Health Records with S,O,E and P and compared it to the output of our prototype. We found a Precision of roughly 50% and a recall of 20-25%. Although not perfect, because we had time nor resources to involve domain experts we think this result is encouraging for further research. Furthermore we shown that our other two metrics are sensible with use cases. Although no proof they are feasible in practice, they show that a whole set of different metrics can be used to assess the quality of Electronic Health Records.
Ik heb een artikel geschreven over “onzekere databases” voor DB/M Database Magazine van Array Publications. Hij staat in nummer 4, het juni-nummer dus nu te koop. Het thema van dit speciale nummer is “Datakwaliteit”.
Onzekere databases
Een recente ontwikkeling in het databaseonderzoek betreft de zogenaamde ‘onzekere databases’. Dit artikel beschrijft wat onzekere databases zijn, hoe gebruikt kunnen worden en welke toepassingen met name voordeel zouden kunnen hebben van deze technologie [details].
I just read a very nice post about “Who owns the data”. I especially like the analogy at the end: who owns your house? Not the architect, not the builder, nor the plumber, but you yourself, of course, with everything in it! So, it should be the organization, or more precise, the department that uses the data, who should own it. Not the IT department who builds and/or maintains it. I also like his link to data quality in identifying who should own the data: “Who is going to care most if the data is incorrect?”. These need not be the people who generate or capture the data. This issue is, in my opinion, one of the major causes of problems with data quality and ineffective use of data.
Ik heb een artikel geschreven over “onzekere databases” voor Database Magazine van Array Publications. Het wordt geplaatst in nummer 4, een speciaal nummer over “Datakwaliteit”.
As a product of my cooperation with Fabian Panse from the University of Hamburg, we got a paper accepted at the NTII-workshop co-located with ICDE 2010.
Duplicate Detection in Probabilistic Data
Fabian Panse, Maurice van Keulen, Ander de Keijzer, Norbert Ritter
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.
The paper will be presented at the Second International Workshop on New Trends in Information Integration (NTII 2010), Long Beach, California, USA [details]
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used.
The technical report was published as TR-CTIT-08-42 by the CTIT. [electronic version] [details]
