Following New Scientist, also WebWereld features an article about my identity extraction work together with Fox IT: “Politiesoftware filtert slim identiteiten uit digibewijs” (Dutch).
Archive for the Category ◊ MSc projects ◊
The popular science magazine New Scientist features a small article on one of my “Crime Science” endeavors with Hans Henseler and Jop Hofsté from the company Fox-IT: Fast digital forensics sniff out accomplices (also appeared in Mafia Today). It is based on the MSc-project work of Jop Hofsté which will be demonstrated at ICAIL 2013.
On 20 December 2012, Jasper Stoop defended his MSc thesis on process mining for fraud detection in the procurement process. The MSc project was carried out at KPMG.
“Process Mining and Fraud Detection: A case study on the theoretical and practical value of using process mining for the detection of fraudulent behavior in the procurement process”[download]
This thesis presents the results of a six month research period on process mining and fraud detection. This thesis aimed to answer the research question as to how process mining can be utilized in fraud detection and what the benefits of using process mining for fraud detection are. Based on a literature study it provides a discussion of the theory and application of process mining and its various aspects and techniques. Using both a literature study and an interview with a domain expert, the concepts of fraud and fraud detection are discussed. These results are combined with an analysis of existing case studies on the application of process mining and fraud detection to construct an initial setup of two case studies, in which process mining is applied to detect possible fraudulent behavior in the procurement process. Based on the experiences and results of these case studies, the 1+5+1 methodology is presented as a first step towards operationalizing principles with advice on how process mining techniques can be used in practice when trying to detect fraud. This thesis presents three conclusions: (1) process mining is a valuable addition to fraud detection, (2) using the 1+5+1 concept it was possible to detect indicators of possibly fraudulent behavior (3) the practical use of process mining for fraud detection is diminished by the poor performance of the current tools. The techniques and tools that do not suffer from performance issues are an addition, rather than a replacement, to regular data analysis techniques by providing either new, quicker, or more easily obtainable insights into the process and possible fraudulent behavior.
On 7 December 2012, Paul Stapersma defended his MSc thesis “Efficient Query Evaluation on Probabilistic XML Data”. The MSc project was supervised by me, Maarten Fokkinga and Jan Flokstra. The thesis is the result of a more than 2 year cooperation between Paul and me to build a probabilistic XML database system on top of a relational one: MayBMS.
“Efficient Query Evaluation on Probabilistic XML Data”[download]
In many application scenarios, reliability and accuracy of data are of great importance. Data is often uncertain or inconsistent because the exact state of represented real world objects is unknown. A number of uncertain data models have emerged to cope with imperfect data in order to guarantee a level of reliability and accuracy. These models include probabilistic XML (P-XML) –an uncertain semi-structured data model– and U-Rel –an uncertain table-structured data model. U-Rel is used by MayBMS, an uncertain relational database management system (URDBMS) that provides scalable query evaluation. In contrast to U-Rel, there does not exist an efficient query evaluation mechanism for P-XML.
In this thesis, we approach this problem by instructing MayBMS to cope with P-XML in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.
We present a specification of a P-XML to U-Rel data mapping and a corresponding XPath to SQL mapping. Additionally, we present two designs of this specification. The first design constructs a data mapping in such way that the corresponding query mapping is a traditional XPath to SQL mapping. The second design differs from the first in the sense that a component of the data mapping is evaluated as part of the query evaluation process. This offers the advantage that the data mapping is more efficient. Additionally, the second design allows for a number of optimizations that affect the performance of the query evaluation process. However, this process is burdened with the extra task of evaluating the data mapping component.
An extensive experimental evaluation on synthetically generated data sets and real-world data sets shows that our implementation of the second design is more efficient in most scenarios. Not only is the P-XML data mapping executed more efficient, the query evaluation performance is also improved in most scenarios.
A MSc student of mine, Jasper Kuperus, was nominated for the ENIAC thesis award for his thesis “Catching criminals by chance” named entity extraction in digital forensics. Unfortunately, he didn’t win.
On 1 November 2012, Jop Hofste defended his MSc thesis “Scalable identity extraction and ranking in Tracks Inspector”. The MSc project was carried out at Fox-IT.
“Scalable identity extraction and ranking in Tracks Inspector”[download]
The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.
The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.
The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.
On 9 August 2012, Rudo Denneman defended his MSc thesis on a requirements analysis for business intelligence of CRM processes of municipalities. The MSc project was carried out at Exxellence.
“Management information requirements for customer relationship management in municipalities”[download]
This research project looks into the management information requirements of municipalities in the Netherlands, related to their customer relationship program. Information requirements engineering methodologies for data warehouses are reviewed and a method is proposed based on its perceived suitability for the municipality context. The used methodology by Winter and Strauch matches information requirements elicitation with analyses of the data sources to get an overview of requirements and whether they are attainable. Results are a list of management information requirement, representation requirements and an advice to Exxellence Group on how they can foresee in this demand.
The resulting list of management information requirements seems to indicate that the management of client contact centres would like to see more management information than what it currently prescribed by the Antwoord© concept on which they have based their management information needs for the most part. The list was sent back to municipalities to allow them to comment and rate the information needs on their usefulness. Also, the COPC standard on which the Antwoord© indicators are based and the Antwoord© indicators themselves were compared to the results. The results seem to cover almost all of the COPC metrics except for several process areas that are not as relevant in the municipality context. Also potentially interesting additions to the results that could be made from the COPC standard have been identified. The indicators from the Antwoord© concept score relatively high in the ranking of information needs and are a solid basis for measurements.
Overall, the information needs voiced by municipalities are on an operational level to measure performance of departments and individual employees over time. To satisfy the information needs, Exxellence group will have to combine data from several back-office source systems along with other information from other sources such as customer satisfaction surveys. These sources will have to be identified per municipality due to the large variance in the types of back-office systems that are used in different municipalities. A data warehouse schema should be created that matches the information needs. The sources of information used to fill the data warehouse can then be identified per municipality.
In addition municipalities will have to access their processes and the training level of their personnel to see whether they are able to correctly capture all the information required to satisfy the information needs.
On 22 June 2012, Mike Niblett defended his MSc thesis “A method to obtain sustained data quality at Distimo”. The MSc project was carried out at Distimo, a mobile app analytics company.
“A method to obtain sustained data quality at Distimo”[download]
This thesis attempts to answer the following research question: “how can we determine and improve data quality?”.
A method is proposed to systematically analyse the demands and current state of data quality within an organisation. The mission statement and information systems architecture are used to characterise the organisation. A list of data quality characteristics based on literature is used to express the organisation in terms of data quality. Metrics are established to quantify the data quality characteristics. A risk analysis determines which are the most important areas to improve upon. After improvement, the metrics can be used to evaluate the success of the improvements.
Distimo is an innovative application store analytics company aiming to solve the challenges created by a widely fragmented application store marketplace filled with equally fragmented information and statistics. As Distimo’s products are very data driven, data quality is very important. The method will be applied to Distimo as a case study.
The proposed method provides a way to determine the current state of data quality, and to determine what to improve, and how to evaluate if the improve- ments provide the desired outcome. The case study of Distimo resulted in an in-depth analysis of Distimo, which in turn yielded a number of data quality improvements that at this very moment are in production and have improved data quality.
Because of the generic nature of input data, the proposed method is applicable to any organisation looking to improve data quality. The iterative improvement process allow for fine grained control of changes to organisational processes and systems.
On 21 June 2012, Jasper Kuperus defended his MSc thesis “Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”. The MSc project was supervised by me, Dolf Trieschnigg, Mena Badieh Habib and Cor Veenman from the Dutch Forensics Institute (NFI).
“Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”[download]
In forensics, large amounts of unstructured data have to be analyzed in order to find evidence or to detect risks. For example, the contents of a personal computer or USB data carriers belonging to a suspect. Automatic processing of these large amounts of unstructured data, using techniques like Information Extraction, is inevitable. Named Entity Recognition (NER) is an important first step in Information Extraction and still a difficult task.
A main challenge in NER is the ambiguity among the extracted named entities. Most approaches take a hard decision on which named entities belong to which class or which boundary fits an entity. However, often there is a significant amount of ambiguity when making this choice, resulting in errors by making these hard decisions. Instead of making such a choice, all possible alternatives can be preserved with a corresponding confidence of the probability that it is the correct choice. Extracting and handling entities in such a probabilistic way is called Probabilistic Named Entity Recognition (PNER).
Combining the fields of Probabilistic Databases and Information Extraction results in a new field of research. This research project explores the problem of Probabilistic NER. Although Probabilistic NER does not make hard decisions when ambiguity is involved, it also does not yet resolve ambiguity. A way of resolving this ambiguity is by using user feedback to let the probabilities converge to the real world situation, called Targeted Feedback. The main goal in this project is to improve NER results by using PNER, preventing ambiguity related extraction errors and using Targeted Feedback to reduce ambiguity.
This research project shows that Recall values of the PNER results are significantly higher than for regular NER, adding up to improvements over 29%. Using Targeted Feedback, both Precision and Recall approach 100% after full user feed- back. For Targeted Feedback, both the order in which questions are posed and whether a strategy attempts to learn from the answers of the user provide performance gains. Although PNER shows to have potential, this research project provides insufficient evidence whether PNER is better than regular NER.
On 18 January 2012, Sjoerd van der Spoel defended his MSc thesis “Outcome and variable prediction for discrete processes: A framework for finding answers to business questions using (process) data”. The MSc project was supervised by me and Chintan Amrit.
“Outcome and variable prediction for discrete processes: A framework for finding answers to business questions using (process) data”[download]
The research described in this paper is aimed at solving planning problems associated with a new hospital declaration methodology called DOT. With this methodology, that will become mandatory starting January 1st 2012, hospitals will no longer be able to tell in advance how much they will receive for the care they provide. A related problem is that hospitals do not know when delivered care becomes declarable. Topicus Fincare wants to find a solution to both these problems.
These problems, and more generally the problem of answering business questions that involve predicting process outcomes and variables is what this research aims to solve. The approach chosen is to model the business process as a graph, and to predict the path through that graph, as well as to use the path to predict the variables of interest. For the hospital, the nodes in the graph represent care activities, and the variables to predict are the care product – that determines the value of the provided care – and the duration of care.
A literature study has found data mining and shortest path algorithms in combination with a naive graph elicitation technique to be the best way of accomplishing these two goals. Specifically, Random Forests was found to be the most accurate technique for predicting path-variable relations and for predicting the final step of a process. The Floyd-Warshall shortest path algorithm was found to be the best technique for predicting the path between two nodes in the process graph.
To test this findings, a number of experiments was performed for the hospital case. These experiments show that Random Forests and the Floyd-Warshall algorithm are indeed the most accurate techniques in the test. Using Random Forests, the care product for a set of performed activities can be predicted with on average 50% accuracy, lows of 30% and highs of 70%. Using Floyd-Warshall, the consequent set of steps can be predicted with 45% accuracy on average, with lows of 25% and highs of 100%.
From the experiment with the hospital data, a set of processing steps for producing an answer to a business question was produced. The steps are trans- forming the business question, analyzing and transforming data, and then depending on the business question classifier training and variable prediction or process elicitation and path prediction. The final step is to analyze the result, to see if it has adequately answered the question. That these processing steps do actually work was validated using a dataset from Topicus’ bug tracking soft- ware. In conclusion, the approach presented predicts the total cash flow to be expected from the provided care with average error between six and 17 percent. The time the provided care becomes declarable cannot be accurately predicted.
