Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]
Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.
Archive for the Category » Data-driven fraud detection «
Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.
Today I’m going to give a presentation about my fraud detection research for the SCS chair.
Information Combination and Enrichment for Data-Driven Fraud Detection
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a ‘paper reality’. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more ‘real-world’ sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle uncertainty in the data. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology we have developed and a few more applications in which it has been or is intended to be applied. In terms of the 3 V’s of big data — volume, velocity, and variety — this presentation focuses on the third V: variety.
Date: Wednesday, April 16th, 2014
Room: ZI 2042
Time: 12:30-13:30 hrs
I was interviewed for the company magazine E-Novation4U of Unit4
“Big data … Big brothergevoel of juist kans voor de accountant?”
The news feed of the UT homepage features an item to the master project of Henry Been.
Gauging the risk of fraud from social media.
On 18 June 2013, Henry Been defended his MSc thesis on an attempt to find a person’s Twitter account given only name, address, telephone, email address for the purpose of risk analysis for fraud detection. It turned out that he could determine a set of a few tens/hunderds of candidate Twitter accounts among which the correct one was indeed present in almost all cases. Henry also paid much attention to the ethical aspects surrounding this research. A news item on the UT homepage made it on ACM TechNews.
“Finding you on the Internet: Entity resolution on Twitter accounts and real world people”[download]
Over the last years online social network sites [SNS] have become very popular. There are many scenarios in which it might prove valuable to know which accounts on a SNS belong to a person. For example, the dutch social investigative authority is interested in extracting characteristics of a person from Twitter to aid in their risk analysis for fraud detection.
In this thesis a novel approach to finding a person’s Twitter account using only known real world information is developed and tested. The developed approach operates in three steps. First a set of heuristic queries using known information is executed to find possibly matching accounts. Secondly, all these accounts are crawled and information about the account, and thus its owner, is extracted. Currently, name, url’s, description, language of the tweets and geo tags are extracted. Thirdly, all possible matches are examined and the correct account is determined.
This approach differs from earlier research in that it does not work with extracted and cleaned datasets, but directly with the Internet. The prototype has to cope with all the ”noise” on the Internet like slang, typo’s, incomplete profiles, etc. Another important part the approach was repetition of the three steps. It was expected that repeating the discovering candidates, enriching them and eliminating false positives will increase the chance that over time the correct account ”surfaces.”
During development of the prototype ethical concerns surrounding both the experi- ments and the application in practice were considered and judged morally justifiable.
Validation of the prototype in an experiment showed that the first step is executed very well. In an experiment With 12 subjects with a Twitter account, an inclusion of 92% was achieved. This means that for 92% of the subjects the correct Twitter account was found and thus included as a possible match. A number of variations of this experiment were ran, which showed that inclusion of both first and last name is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail addresses and telephone numbers does not influence inclusion.
Contrary to those of the first step, the results of the third step were less accurate. The currently extracted features cannot be used to predict if a possible match is actually the correct Twitter account or not. However, there is much ongoing research into feature extraction from tweets and Twitter accounts in general. It is therefore expected that enhancing feature extraction using new techniques will make it a matter of time before it is also possible to identify correct matches in the candidate set.
Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden
Following New Scientist and WebWereld, also the homepage of the UT features an article about my identity extraction work together with Fox IT: “Tracks Inspector brengt binnen paar uur netwerk van verdachte in kaart” (Dutch).
Following New Scientist, also WebWereld features an article about my identity extraction work together with Fox IT: “Politiesoftware filtert slim identiteiten uit digibewijs” (Dutch).