Archive for the Category » Student projects «
On 20 June 2013, Ben Companjen defended his MSc thesis on matching author names on publications to researcher profiles on the scale of The Netherlands. He carried out this research at DANS where he applied and validated his techniques on a coupling between the NARCIS scholarly database and the researcher profile database VSOI.
“Probabilistically Matching Author Names to Researchers”[download]
Publications are most important form of scientific communication, but science also consists of researchers, research projects and organisations. The goal of NARCIS (National Academic Research and Collaboration Information System) is to provide a complete and concise view of current science in the Netherlands.
Connecting publications to the researchers, projects and organisations that created them in retrospect is hard, because of a lack in the use of author identifiers in publications and researcher profiles. There is too much data to identify all researchers in NARCIS manually, so an automatic method is needed to assist completing the view of science in the Netherlands.
In this thesis the problems that limit automatic connection of author names in publications to researchers are explored and a method to automatically connect publications and researchers is developed and evaluated.
Using only the author names themselves finds the correct researcher for around 80% of the author names in an experiment, using two test sets. However, none of the correct matches were given the highest confidence of the returned matches. Over 90% of the correct matches were ranked second by confidence. Other correct matches were ranked lower, and using probabilistic results allows working with the correct results, even if they are not the best match. Many names that should not match, were included in the matches. The matching algorithm can be optimised to assign confidence to matches differently.
Including a matching function that compares publication titles and researcher’s project titles did not improve the results, but better results are expected when more context elements are used to assign confidences.
The news feed of the UT homepage features an item to the master project of Henry Been.
Gauging the risk of fraud from social media.
On 18 June 2013, Henry Been defended his MSc thesis on an attempt to find a person’s Twitter account given only name, address, telephone, email address for the purpose of risk analysis for fraud detection. It turned out that he could determine a set of a few tens/hunderds of candidate Twitter accounts among which the correct one was indeed present in almost all cases. Henry also paid much attention to the ethical aspects surrounding this research. A news item on the UT homepage made it on ACM TechNews.
“Finding you on the Internet: Entity resolution on Twitter accounts and real world people”[download]
Over the last years online social network sites [SNS] have become very popular. There are many scenarios in which it might prove valuable to know which accounts on a SNS belong to a person. For example, the dutch social investigative authority is interested in extracting characteristics of a person from Twitter to aid in their risk analysis for fraud detection.
In this thesis a novel approach to finding a person’s Twitter account using only known real world information is developed and tested. The developed approach operates in three steps. First a set of heuristic queries using known information is executed to find possibly matching accounts. Secondly, all these accounts are crawled and information about the account, and thus its owner, is extracted. Currently, name, url’s, description, language of the tweets and geo tags are extracted. Thirdly, all possible matches are examined and the correct account is determined.
This approach differs from earlier research in that it does not work with extracted and cleaned datasets, but directly with the Internet. The prototype has to cope with all the ”noise” on the Internet like slang, typo’s, incomplete profiles, etc. Another important part the approach was repetition of the three steps. It was expected that repeating the discovering candidates, enriching them and eliminating false positives will increase the chance that over time the correct account ”surfaces.”
During development of the prototype ethical concerns surrounding both the experi- ments and the application in practice were considered and judged morally justifiable.
Validation of the prototype in an experiment showed that the first step is executed very well. In an experiment With 12 subjects with a Twitter account, an inclusion of 92% was achieved. This means that for 92% of the subjects the correct Twitter account was found and thus included as a possible match. A number of variations of this experiment were ran, which showed that inclusion of both first and last name is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail addresses and telephone numbers does not influence inclusion.
Contrary to those of the first step, the results of the third step were less accurate. The currently extracted features cannot be used to predict if a possible match is actually the correct Twitter account or not. However, there is much ongoing research into feature extraction from tweets and Twitter accounts in general. It is therefore expected that enhancing feature extraction using new techniques will make it a matter of time before it is also possible to identify correct matches in the candidate set.
Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden
Following New Scientist and WebWereld, also the homepage of the UT features an article about my identity extraction work together with Fox IT: “Tracks Inspector brengt binnen paar uur netwerk van verdachte in kaart” (Dutch).
Following New Scientist, also WebWereld features an article about my identity extraction work together with Fox IT: “Politiesoftware filtert slim identiteiten uit digibewijs” (Dutch).
The popular science magazine New Scientist features a small article on one of my “Crime Science” endeavors with Hans Henseler and Jop Hofsté from the company Fox-IT: Fast digital forensics sniff out accomplices (also appeared in Mafia Today). It is based on the MSc-project work of Jop Hofsté which will be demonstrated at ICAIL 2013.
My master student Henry Been and I have had many interesting discussions with the ethical advisor of our faculty, Aimee van Wynsberghe. She used our research on “Finding you on the Internet” [details] as a case study. She wrote a paper together with us on practical guidelines for evaluating and incorporating ethics in the research.
To use or not to use: guidelines for researchers using data from online social networking sites
Aimee van Wynsberghe, Henry Been and Maurice van Keulen
The paper was published in The Observatory, the community portal of RRI (Responsible research and innovation). [details]
On 20 December 2012, Jasper Stoop defended his MSc thesis on process mining for fraud detection in the procurement process. The MSc project was carried out at KPMG.
“Process Mining and Fraud Detection: A case study on the theoretical and practical value of using process mining for the detection of fraudulent behavior in the procurement process”[download]
This thesis presents the results of a six month research period on process mining and fraud detection. This thesis aimed to answer the research question as to how process mining can be utilized in fraud detection and what the benefits of using process mining for fraud detection are. Based on a literature study it provides a discussion of the theory and application of process mining and its various aspects and techniques. Using both a literature study and an interview with a domain expert, the concepts of fraud and fraud detection are discussed. These results are combined with an analysis of existing case studies on the application of process mining and fraud detection to construct an initial setup of two case studies, in which process mining is applied to detect possible fraudulent behavior in the procurement process. Based on the experiences and results of these case studies, the 1+5+1 methodology is presented as a first step towards operationalizing principles with advice on how process mining techniques can be used in practice when trying to detect fraud. This thesis presents three conclusions: (1) process mining is a valuable addition to fraud detection, (2) using the 1+5+1 concept it was possible to detect indicators of possibly fraudulent behavior (3) the practical use of process mining for fraud detection is diminished by the poor performance of the current tools. The techniques and tools that do not suffer from performance issues are an addition, rather than a replacement, to regular data analysis techniques by providing either new, quicker, or more easily obtainable insights into the process and possible fraudulent behavior.