Following New Scientist, also WebWereld features an article about my identity extraction work together with Fox IT: “Politiesoftware filtert slim identiteiten uit digibewijs” (Dutch).
Archive for the Category ◊ Projects ◊
The popular science magazine New Scientist features a small article on one of my “Crime Science” endeavors with Hans Henseler and Jop Hofsté from the company Fox-IT: Fast digital forensics sniff out accomplices (also appeared in Mafia Today). It is based on the MSc-project work of Jop Hofsté which will be demonstrated at ICAIL 2013.
Tomorrow, 28 Feb 2013, a PhD student of mine, Victor de Graaff, is going to give a presentation on how to estimate the boundaries for objects for which you only have a point and other public data such as Open Street Map [Announcement].
Point of Interest to Region of Interest Conversion
Date/Time: Thursday, February 28, 2013 – 13:30 to 14:30; Room: 0-142
GPS trajectories from a mobile device, such as a smartphone, indirectly contain a vast amount of information on the interests of the owner of the device. Collections of GPS trajectories even provide insight in the popularity of locations, and the time spent at those locations. To obtain this information, the visited places on such a trajectory need to be recognized. However, the location information on a point of interest (POI) in a database is normally limited to an address and a GPS coordinate, rather than a geometry describing its boundaries. To create a match with a GPS trajectory, a two-dimensional shape representing this place, a region of interest (ROI), is needed. In the absence of expensive and hard to obtain detailed spatial data like cadastral data, we need to estimate this ROI. In this research project, we bridge this gap by presenting several approaches to estimate the size and shape of the ROI, and validate these estimations against the cadastral data of the city of Enschede, The Netherlands.
On 7 December 2012, Paul Stapersma defended his MSc thesis “Efficient Query Evaluation on Probabilistic XML Data”. The MSc project was supervised by me, Maarten Fokkinga and Jan Flokstra. The thesis is the result of a more than 2 year cooperation between Paul and me to build a probabilistic XML database system on top of a relational one: MayBMS.
“Efficient Query Evaluation on Probabilistic XML Data”[download]
In many application scenarios, reliability and accuracy of data are of great importance. Data is often uncertain or inconsistent because the exact state of represented real world objects is unknown. A number of uncertain data models have emerged to cope with imperfect data in order to guarantee a level of reliability and accuracy. These models include probabilistic XML (P-XML) –an uncertain semi-structured data model– and U-Rel –an uncertain table-structured data model. U-Rel is used by MayBMS, an uncertain relational database management system (URDBMS) that provides scalable query evaluation. In contrast to U-Rel, there does not exist an efficient query evaluation mechanism for P-XML.
In this thesis, we approach this problem by instructing MayBMS to cope with P-XML in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.
We present a specification of a P-XML to U-Rel data mapping and a corresponding XPath to SQL mapping. Additionally, we present two designs of this specification. The first design constructs a data mapping in such way that the corresponding query mapping is a traditional XPath to SQL mapping. The second design differs from the first in the sense that a component of the data mapping is evaluated as part of the query evaluation process. This offers the advantage that the data mapping is more efficient. Additionally, the second design allows for a number of optimizations that affect the performance of the query evaluation process. However, this process is burdened with the extra task of evaluating the data mapping component.
An extensive experimental evaluation on synthetically generated data sets and real-world data sets shows that our implementation of the second design is more efficient in most scenarios. Not only is the P-XML data mapping executed more efficient, the query evaluation performance is also improved in most scenarios.
Two of my PhD students, Mohammad Khelgati and Victor de Graaff, are presenting on the Dutch-Belgian DataBase Day (DBDBD). Mohammad about “Size Estimation of Non-Cooperative Data Collections” and Victor on “Semantic Enrichment of GPS Trajectories“.
FedSS (Federated Security Shield) is an ITEA2/CATRENE project in which I participate with workpackage on data cleaning. The Dutch-part of the funding has been approved. Hopefully the other partners also get the funding in their countries approved. When this happens, I have a PostDoc position available for 2.5 years.
One of my PhD students, Mena Habib, has won the Best Student Paper Award at the 4th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) in Barcelona, Spain, for our paper ”Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction”.
One of my PhD students, Mohammad Khelgati, has a paper on iiWAS 2012.
Size Estimation of Non-Cooperative Data Collections
Mohammad Khelghati, Djoerd Hiemstra, and Maurice van Keulen
With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which can be so costly in some cases [14]. This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in the case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [17]. In this paper, the suggested approaches for this purpose in the literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.
The paper will be presented at the iiWAS 2012 conference, 3-5 Dec 2012, Bali, Indonesia [details]
One of my PhD students, Mena Badieh Habib, got a paper accepted at the Semantic Web and Information Extraction (SWAIE) workshop at the EKAW conference on improving NEE in twitter.
Unsupervised Improvement of Named Entity Extraction in Short Informal Context Using Disambiguation Clues
Mena Badieh Habib, Maurice van Keulen
Short context messages (like tweets and SMS’s) are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. Most efforts done in this direction rely on machine learning techniques which are expensive in terms of data collection and training.
In this paper we present an unsupervised Semantic Web-driven approach to improve the extraction process by using clues from the disambiguation process. For extraction we used a simple Knowledge-Base matching technique combined with a clustering-based approach for disambiguation. Experimental results on a self-collected set of tweets (as an example of short context messages) show improvement in extraction results when using unsupervised feedback from the disambiguation process.
The paper will be presented at the EKAW workshop co-located with SWAIE 2012, 8-12 October 2012, Galway City, Ireland [details]
One of my PhD students, Mena Badieh Habib, got a paper accepted at the Knowledge Discovery and Information Retrieval (KDIR) conference on improving NEE and NED by treating them as processes that can reinforce each other.
Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction
Mena Badieh Habib, Maurice van Keulen
Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process.
It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.
The paper will be presented at the KDIR conference, 4-7 October 2012, Barcelona, Spain [details]
