Archive for the Category » COMMIT «

Thursday, June 02nd, 2016 | Author:

Today, a PhD student of mine, Mohammad S. Khelghati, defended his thesis.
Deep Web Content Monitoring [Download]
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles: finding methods that can achieve the best coverage in harvesting data for a topic; reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size; monitoring data changes over time in web data repositories; and we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings. These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.

Category: COMMIT, Web harvasting  | Tags: , ,  | Comments off
Thursday, January 28th, 2016 | Author:

My PhD student Mohammad Khelgathi released his web harvesting software, called HarvestED.

Thursday, January 14th, 2016 | Author:

Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]

Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.

Wednesday, October 14th, 2015 | Author:

Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.

Wednesday, February 25th, 2015 | Author:

Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.

Wednesday, April 16th, 2014 | Author:

Today I’m going to give a presentation about my fraud detection research for the SCS chair.

Information Combination and Enrichment for Data-Driven Fraud Detection

Abstract
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a ‘paper reality’. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more ‘real-world’ sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle uncertainty in the data. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology we have developed and a few more applications in which it has been or is intended to be applied. In terms of the 3 V’s of big data — volume, velocity, and variety — this presentation focuses on the third V: variety.

Date: Wednesday, April 16th, 2014
Room: ZI 2042
Time: 12:30-13:30 hrs

Monday, December 23rd, 2013 | Author:

Andreas Wombacher and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails project. Companies involved are Arcadis and Nspyre. The functionality of the proof-of-concept product can be summarized as

  • A back-end system for collecting, managing and summarizing information from external sources which includes the novel pre-aggregation technology from COMMIT/TimeTrails
  • A visualization component providing a unique view of aggregated information in a map-based application (Geographical Information System). It is geared towards supporting online decision making by providing interactive visualizations of the huge amounts of available information.

Besides the proof-of-concept product, we will be organizing and executing a few pilot projects with customers of Arcadis and Nspyre, develop product training material, and conduct several dissemination activities.

Tuesday, October 01st, 2013 | Author:

I was interviewed for the company magazine E-Novation4U of Unit4
“Big data … Big brothergevoel of juist kans voor de accountant?”

Wednesday, August 28th, 2013 | Author:

My PhD student, Victor de Graaff, has a poster paper on SIGSPATIAL 2013.
Point of interest to region of interest conversion [details]
Victor de Graaff, Rolf A. de By, Maurice van Keulen, and Jan Flokstra
The paper will be presented at the ACM SIGSPATIAL GIS, 5-8 November 2013, Orlando, Florida, USA

Wednesday, June 26th, 2013 | Author:

ACM TechNews picked up the UT homepage news item Gauging the Risk of Fraud From Social Media on Henry Been’s master project “Finding you on the Internet“.