Tag-Archive for » named entity extraction «

Thursday, May 15th, 2014 | Author:

Tweakers.net, NU.nl and Kennislink.nl picked up the UT homepage news item on the research of my PhD student Mena Badieh Habib on Named Entity Extraction and Named Entity Disambiguation.
Tweakers.net: UT laat politiecomputers tweets ‘begrijpen’ voor veiligheid bij evenementen
NU.nl: Universiteit Twente laat computers beter begrijpend lezen
Kennislink.nl: Twentse computer leest beter

Wednesday, May 14th, 2014 | Author:

The news feed of the UT homepage features an item on the research of my PhD student Mena Badieh Habib.
Computers leren beter begrijpend lezen dankzij UT-onderzoek (in Dutch).
Mena defended his PhD thesis entitled “Named Entity Extraction and Disambiguation for Informal Text – The Missing Links on May 9th.

Friday, May 09th, 2014 | Author:

Today, a PhD student of mine, Mena Badieh Habib Morgan, defended his thesis.
Named Entity Extraction and Disambiguation for Informal Text – The Missing Link
Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. A main challenge of natural language is its ambiguity and vagueness. When we move to informal language widely used in social media, the language becomes even more ambiguous and thus more challenging for automatic understanding. Named Entity Extraction (NEE) is a sub task of Information Extraction (IE) that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language.

Monday, April 07th, 2014 | Author:

Last year we won the #Microposts2013 challenge; this year we came in second for the new #Microposts2014 challenge called NEEL, “Named Entity Extraction and Linking”, that as opposed to last year also involves Entity Disambiguation (by linking to DBpedia).
Named Entity Extraction and Linking Challenge: University of Twente at #Microposts2014 [Download]
Mena Badieh Habib, Maurice van Keulen, Zhemin Zhu

Monday, June 03rd, 2013 | Author:

Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden

Monday, May 13th, 2013 | Author:

Together with my PhD student Mena Badieh Habib and another PhD student of our group Zhemin Zhu, we participated in the “Making Sense of Microposts” challenge at the WWW 2013 conference … and we won the best IE award!
[paper | presentation | poster]

Thursday, November 01st, 2012 | Author:

On 1 November 2012, Jop Hofste defended his MSc thesis “Scalable identity extraction and ranking in Tracks Inspector”. The MSc project was carried out at Fox-IT.
“Scalable identity extraction and ranking in Tracks Inspector”[download]
The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.
The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.
The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.

Tuesday, September 04th, 2012 | Author:

One of my PhD students, Mena Badieh Habib, got a paper accepted at the Semantic Web and Information Extraction (SWAIE) workshop at the EKAW conference on improving NEE in twitter.
Unsupervised Improvement of Named Entity Extraction in Short Informal Context Using Disambiguation Clues
Mena Badieh Habib, Maurice van Keulen
Short context messages (like tweets and SMS’s) are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. Most efforts done in this direction rely on machine learning techniques which are expensive in terms of data collection and training.
In this paper we present an unsupervised Semantic Web-driven approach to improve the extraction process by using clues from the disambiguation process. For extraction we used a simple Knowledge-Base matching technique combined with a clustering-based approach for disambiguation. Experimental results on a self-collected set of tweets (as an example of short context messages) show improvement in extraction results when using unsupervised feedback from the disambiguation process.
The paper will be presented at the EKAW workshop co-located with SWAIE 2012, 8-12 October 2012, Galway City, Ireland [details]

Category: Information Extraction, Neogeography  | Tags: ,  | Comments off
Tuesday, July 24th, 2012 | Author:

One of my PhD students, Mena Badieh Habib, got a paper accepted at the Knowledge Discovery and Information Retrieval (KDIR) conference on improving NEE and NED by treating them as processes that can reinforce each other.
Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction
Mena Badieh Habib, Maurice van Keulen
Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process.
It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.
The paper will be presented at the KDIR conference, 4-7 October 2012, Barcelona, Spain [details]

Thursday, June 21st, 2012 | Author:

On 21 June 2012, Jasper Kuperus defended his MSc thesis “Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”. The MSc project was supervised by me, Dolf Trieschnigg, Mena Badieh Habib and Cor Veenman from the Dutch Forensics Institute (NFI).
“Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”[download]
In forensics, large amounts of unstructured data have to be analyzed in order to find evidence or to detect risks. For example, the contents of a personal computer or USB data carriers belonging to a suspect. Automatic processing of these large amounts of unstructured data, using techniques like Information Extraction, is inevitable. Named Entity Recognition (NER) is an important first step in Information Extraction and still a difficult task.
A main challenge in NER is the ambiguity among the extracted named entities. Most approaches take a hard decision on which named entities belong to which class or which boundary fits an entity. However, often there is a significant amount of ambiguity when making this choice, resulting in errors by making these hard decisions. Instead of making such a choice, all possible alternatives can be preserved with a corresponding confidence of the probability that it is the correct choice. Extracting and handling entities in such a probabilistic way is called Probabilistic Named Entity Recognition (PNER).
Combining the fields of Probabilistic Databases and Information Extraction results in a new field of research. This research project explores the problem of Probabilistic NER. Although Probabilistic NER does not make hard decisions when ambiguity is involved, it also does not yet resolve ambiguity. A way of resolving this ambiguity is by using user feedback to let the probabilities converge to the real world situation, called Targeted Feedback. The main goal in this project is to improve NER results by using PNER, preventing ambiguity related extraction errors and using Targeted Feedback to reduce ambiguity.
This research project shows that Recall values of the PNER results are significantly higher than for regular NER, adding up to improvements over 29%. Using Targeted Feedback, both Precision and Recall approach 100% after full user feed- back. For Targeted Feedback, both the order in which questions are posed and whether a strategy attempts to learn from the answers of the user provide performance gains. Although PNER shows to have potential, this research project provides insufficient evidence whether PNER is better than regular NER.