Archive for » June, 2012 «

Friday, June 22nd, 2012 | Author:

On 22 June 2012, Mike Niblett defended his MSc thesis “A method to obtain sustained data quality at Distimo”. The MSc project was carried out at Distimo, a mobile app analytics company.
“A method to obtain sustained data quality at Distimo”[download]
This thesis attempts to answer the following research question: “how can we determine and improve data quality?”.
A method is proposed to systematically analyse the demands and current state of data quality within an organisation. The mission statement and information systems architecture are used to characterise the organisation. A list of data quality characteristics based on literature is used to express the organisation in terms of data quality. Metrics are established to quantify the data quality characteristics. A risk analysis determines which are the most important areas to improve upon. After improvement, the metrics can be used to evaluate the success of the improvements.
Distimo is an innovative application store analytics company aiming to solve the challenges created by a widely fragmented application store marketplace filled with equally fragmented information and statistics. As Distimo’s products are very data driven, data quality is very important. The method will be applied to Distimo as a case study.
The proposed method provides a way to determine the current state of data quality, and to determine what to improve, and how to evaluate if the improve- ments provide the desired outcome. The case study of Distimo resulted in an in-depth analysis of Distimo, which in turn yielded a number of data quality improvements that at this very moment are in production and have improved data quality.
Because of the generic nature of input data, the proposed method is applicable to any organisation looking to improve data quality. The iterative improvement process allow for fine grained control of changes to organisational processes and systems.

Category: Student projects  | Tags: , ,  | Comments off
Thursday, June 21st, 2012 | Author:

On 21 June 2012, Jasper Kuperus defended his MSc thesis “Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”. The MSc project was supervised by me, Dolf Trieschnigg, Mena Badieh Habib and Cor Veenman from the Dutch Forensics Institute (NFI).
“Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”[download]
In forensics, large amounts of unstructured data have to be analyzed in order to find evidence or to detect risks. For example, the contents of a personal computer or USB data carriers belonging to a suspect. Automatic processing of these large amounts of unstructured data, using techniques like Information Extraction, is inevitable. Named Entity Recognition (NER) is an important first step in Information Extraction and still a difficult task.
A main challenge in NER is the ambiguity among the extracted named entities. Most approaches take a hard decision on which named entities belong to which class or which boundary fits an entity. However, often there is a significant amount of ambiguity when making this choice, resulting in errors by making these hard decisions. Instead of making such a choice, all possible alternatives can be preserved with a corresponding confidence of the probability that it is the correct choice. Extracting and handling entities in such a probabilistic way is called Probabilistic Named Entity Recognition (PNER).
Combining the fields of Probabilistic Databases and Information Extraction results in a new field of research. This research project explores the problem of Probabilistic NER. Although Probabilistic NER does not make hard decisions when ambiguity is involved, it also does not yet resolve ambiguity. A way of resolving this ambiguity is by using user feedback to let the probabilities converge to the real world situation, called Targeted Feedback. The main goal in this project is to improve NER results by using PNER, preventing ambiguity related extraction errors and using Targeted Feedback to reduce ambiguity.
This research project shows that Recall values of the PNER results are significantly higher than for regular NER, adding up to improvements over 29%. Using Targeted Feedback, both Precision and Recall approach 100% after full user feed- back. For Targeted Feedback, both the order in which questions are posed and whether a strategy attempts to learn from the answers of the user provide performance gains. Although PNER shows to have potential, this research project provides insufficient evidence whether PNER is better than regular NER.