Following New Scientist, also WebWereld features an article about my identity extraction work together with Fox IT: “Politiesoftware filtert slim identiteiten uit digibewijs” (Dutch).
Archive for the Category ◊ Information Extraction ◊
The popular science magazine New Scientist features a small article on one of my “Crime Science” endeavors with Hans Henseler and Jop Hofsté from the company Fox-IT: Fast digital forensics sniff out accomplices (also appeared in Mafia Today). It is based on the MSc-project work of Jop Hofsté which will be demonstrated at ICAIL 2013.
Tomorrow, 28 Feb 2013, a PhD student of mine, Victor de Graaff, is going to give a presentation on how to estimate the boundaries for objects for which you only have a point and other public data such as Open Street Map [Announcement].
Point of Interest to Region of Interest Conversion
Date/Time: Thursday, February 28, 2013 – 13:30 to 14:30; Room: 0-142
GPS trajectories from a mobile device, such as a smartphone, indirectly contain a vast amount of information on the interests of the owner of the device. Collections of GPS trajectories even provide insight in the popularity of locations, and the time spent at those locations. To obtain this information, the visited places on such a trajectory need to be recognized. However, the location information on a point of interest (POI) in a database is normally limited to an address and a GPS coordinate, rather than a geometry describing its boundaries. To create a match with a GPS trajectory, a two-dimensional shape representing this place, a region of interest (ROI), is needed. In the absence of expensive and hard to obtain detailed spatial data like cadastral data, we need to estimate this ROI. In this research project, we bridge this gap by presenting several approaches to estimate the size and shape of the ROI, and validate these estimations against the cadastral data of the city of Enschede, The Netherlands.
Thales managed to find partial funding for the TEC4SE-project (pronounce “tecforce”). The project is about developing an innovative control room for police and fire fighters. I’m going to develop a Twitter analysis component for this together with company ENAI.
One of my PhD students, Mena Habib, has won the Best Student Paper Award at the 4th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) in Barcelona, Spain, for our paper ”Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction”.
One of my PhD students, Mena Badieh Habib, got a paper accepted at the Semantic Web and Information Extraction (SWAIE) workshop at the EKAW conference on improving NEE in twitter.
Unsupervised Improvement of Named Entity Extraction in Short Informal Context Using Disambiguation Clues
Mena Badieh Habib, Maurice van Keulen
Short context messages (like tweets and SMS’s) are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. Most efforts done in this direction rely on machine learning techniques which are expensive in terms of data collection and training.
In this paper we present an unsupervised Semantic Web-driven approach to improve the extraction process by using clues from the disambiguation process. For extraction we used a simple Knowledge-Base matching technique combined with a clustering-based approach for disambiguation. Experimental results on a self-collected set of tweets (as an example of short context messages) show improvement in extraction results when using unsupervised feedback from the disambiguation process.
The paper will be presented at the EKAW workshop co-located with SWAIE 2012, 8-12 October 2012, Galway City, Ireland [details]
One of my PhD students, Mena Badieh Habib, got a paper accepted at the Knowledge Discovery and Information Retrieval (KDIR) conference on improving NEE and NED by treating them as processes that can reinforce each other.
Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction
Mena Badieh Habib, Maurice van Keulen
Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process.
It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.
The paper will be presented at the KDIR conference, 4-7 October 2012, Barcelona, Spain [details]
On 21 June 2012, Jasper Kuperus defended his MSc thesis “Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”. The MSc project was supervised by me, Dolf Trieschnigg, Mena Badieh Habib and Cor Veenman from the Dutch Forensics Institute (NFI).
“Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”[download]
In forensics, large amounts of unstructured data have to be analyzed in order to find evidence or to detect risks. For example, the contents of a personal computer or USB data carriers belonging to a suspect. Automatic processing of these large amounts of unstructured data, using techniques like Information Extraction, is inevitable. Named Entity Recognition (NER) is an important first step in Information Extraction and still a difficult task.
A main challenge in NER is the ambiguity among the extracted named entities. Most approaches take a hard decision on which named entities belong to which class or which boundary fits an entity. However, often there is a significant amount of ambiguity when making this choice, resulting in errors by making these hard decisions. Instead of making such a choice, all possible alternatives can be preserved with a corresponding confidence of the probability that it is the correct choice. Extracting and handling entities in such a probabilistic way is called Probabilistic Named Entity Recognition (PNER).
Combining the fields of Probabilistic Databases and Information Extraction results in a new field of research. This research project explores the problem of Probabilistic NER. Although Probabilistic NER does not make hard decisions when ambiguity is involved, it also does not yet resolve ambiguity. A way of resolving this ambiguity is by using user feedback to let the probabilities converge to the real world situation, called Targeted Feedback. The main goal in this project is to improve NER results by using PNER, preventing ambiguity related extraction errors and using Targeted Feedback to reduce ambiguity.
This research project shows that Recall values of the PNER results are significantly higher than for regular NER, adding up to improvements over 29%. Using Targeted Feedback, both Precision and Recall approach 100% after full user feed- back. For Targeted Feedback, both the order in which questions are posed and whether a strategy attempts to learn from the answers of the user provide performance gains. Although PNER shows to have potential, this research project provides insufficient evidence whether PNER is better than regular NER.
One of my PhD students, Mena Badieh Habib, has given a talk on the Dutch-Belgian DataBase Day (DBDBD) about “Named Entity Extraction and Disambiguation from an Uncertainty Perspective“.
I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.
The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]