Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]
Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.
Archive for the Category » 2. Domains «
Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.
Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.
Today I’m going to give a presentation about my fraud detection research for the SCS chair.
Information Combination and Enrichment for Data-Driven Fraud Detection
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a ‘paper reality’. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more ‘real-world’ sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle uncertainty in the data. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology we have developed and a few more applications in which it has been or is intended to be applied. In terms of the 3 V’s of big data — volume, velocity, and variety — this presentation focuses on the third V: variety.
Date: Wednesday, April 16th, 2014
Room: ZI 2042
Time: 12:30-13:30 hrs
Andreas Wombacher and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails project. Companies involved are Arcadis and Nspyre. The functionality of the proof-of-concept product can be summarized as
- A back-end system for collecting, managing and summarizing information from external sources which includes the novel pre-aggregation technology from COMMIT/TimeTrails
- A visualization component providing a unique view of aggregated information in a map-based application (Geographical Information System). It is geared towards supporting online decision making by providing interactive visualizations of the huge amounts of available information.
Besides the proof-of-concept product, we will be organizing and executing a few pilot projects with customers of Arcadis and Nspyre, develop product training material, and conduct several dissemination activities.
I was interviewed for the company magazine E-Novation4U of Unit4
“Big data … Big brothergevoel of juist kans voor de accountant?”
On 20 June 2013, Ben Companjen defended his MSc thesis on matching author names on publications to researcher profiles on the scale of The Netherlands. He carried out this research at DANS where he applied and validated his techniques on a coupling between the NARCIS scholarly database and the researcher profile database VSOI.
“Probabilistically Matching Author Names to Researchers”[download]
Publications are most important form of scientific communication, but science also consists of researchers, research projects and organisations. The goal of NARCIS (National Academic Research and Collaboration Information System) is to provide a complete and concise view of current science in the Netherlands.
Connecting publications to the researchers, projects and organisations that created them in retrospect is hard, because of a lack in the use of author identifiers in publications and researcher profiles. There is too much data to identify all researchers in NARCIS manually, so an automatic method is needed to assist completing the view of science in the Netherlands.
In this thesis the problems that limit automatic connection of author names in publications to researchers are explored and a method to automatically connect publications and researchers is developed and evaluated.
Using only the author names themselves finds the correct researcher for around 80% of the author names in an experiment, using two test sets. However, none of the correct matches were given the highest confidence of the returned matches. Over 90% of the correct matches were ranked second by confidence. Other correct matches were ranked lower, and using probabilistic results allows working with the correct results, even if they are not the best match. Many names that should not match, were included in the matches. The matching algorithm can be optimised to assign confidence to matches differently.
Including a matching function that compares publication titles and researcher’s project titles did not improve the results, but better results are expected when more context elements are used to assign confidences.
The news feed of the UT homepage features an item to the master project of Henry Been.
Gauging the risk of fraud from social media.
On 18 June 2013, Henry Been defended his MSc thesis on an attempt to find a person’s Twitter account given only name, address, telephone, email address for the purpose of risk analysis for fraud detection. It turned out that he could determine a set of a few tens/hunderds of candidate Twitter accounts among which the correct one was indeed present in almost all cases. Henry also paid much attention to the ethical aspects surrounding this research. A news item on the UT homepage made it on ACM TechNews.
“Finding you on the Internet: Entity resolution on Twitter accounts and real world people”[download]
Over the last years online social network sites [SNS] have become very popular. There are many scenarios in which it might prove valuable to know which accounts on a SNS belong to a person. For example, the dutch social investigative authority is interested in extracting characteristics of a person from Twitter to aid in their risk analysis for fraud detection.
In this thesis a novel approach to finding a person’s Twitter account using only known real world information is developed and tested. The developed approach operates in three steps. First a set of heuristic queries using known information is executed to find possibly matching accounts. Secondly, all these accounts are crawled and information about the account, and thus its owner, is extracted. Currently, name, url’s, description, language of the tweets and geo tags are extracted. Thirdly, all possible matches are examined and the correct account is determined.
This approach differs from earlier research in that it does not work with extracted and cleaned datasets, but directly with the Internet. The prototype has to cope with all the ”noise” on the Internet like slang, typo’s, incomplete profiles, etc. Another important part the approach was repetition of the three steps. It was expected that repeating the discovering candidates, enriching them and eliminating false positives will increase the chance that over time the correct account ”surfaces.”
During development of the prototype ethical concerns surrounding both the experi- ments and the application in practice were considered and judged morally justifiable.
Validation of the prototype in an experiment showed that the first step is executed very well. In an experiment With 12 subjects with a Twitter account, an inclusion of 92% was achieved. This means that for 92% of the subjects the correct Twitter account was found and thus included as a possible match. A number of variations of this experiment were ran, which showed that inclusion of both first and last name is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail addresses and telephone numbers does not influence inclusion.
Contrary to those of the first step, the results of the third step were less accurate. The currently extracted features cannot be used to predict if a possible match is actually the correct Twitter account or not. However, there is much ongoing research into feature extraction from tweets and Twitter accounts in general. It is therefore expected that enhancing feature extraction using new techniques will make it a matter of time before it is also possible to identify correct matches in the candidate set.