Tag-Archive for » Information Extraction «

Friday, January 22nd, 2016 | Author:

I’ve been interview by NRCQ about natural language processing, in particular, about computers learning to understand the more subtle aspects of language use such as sarcasm
Súperhandig hoor, computers die sarcasme kunnen herkennen

Wednesday, February 25th, 2015 | Author:

Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.

Monday, May 20th, 2013 | Author:

One of my Master students, Oliver Jundt, has a paper on EUSFLAT 2013.
Sample-based XPath Ranking for Web Information Extraction
Oliver Jundt and Maurice van Keulen
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
The paper will be presented at the EUSFLAT 2013 conference, 11-13 Sep 2013, Milan, Italy [details]

Thursday, June 21st, 2012 | Author:

On 21 June 2012, Jasper Kuperus defended his MSc thesis “Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”. The MSc project was supervised by me, Dolf Trieschnigg, Mena Badieh Habib and Cor Veenman from the Dutch Forensics Institute (NFI).
“Catching Criminals by Chance: A probabilistic Approach to Named Entity Recognition using Targeted Feedback”[download]
In forensics, large amounts of unstructured data have to be analyzed in order to find evidence or to detect risks. For example, the contents of a personal computer or USB data carriers belonging to a suspect. Automatic processing of these large amounts of unstructured data, using techniques like Information Extraction, is inevitable. Named Entity Recognition (NER) is an important first step in Information Extraction and still a difficult task.
A main challenge in NER is the ambiguity among the extracted named entities. Most approaches take a hard decision on which named entities belong to which class or which boundary fits an entity. However, often there is a significant amount of ambiguity when making this choice, resulting in errors by making these hard decisions. Instead of making such a choice, all possible alternatives can be preserved with a corresponding confidence of the probability that it is the correct choice. Extracting and handling entities in such a probabilistic way is called Probabilistic Named Entity Recognition (PNER).
Combining the fields of Probabilistic Databases and Information Extraction results in a new field of research. This research project explores the problem of Probabilistic NER. Although Probabilistic NER does not make hard decisions when ambiguity is involved, it also does not yet resolve ambiguity. A way of resolving this ambiguity is by using user feedback to let the probabilities converge to the real world situation, called Targeted Feedback. The main goal in this project is to improve NER results by using PNER, preventing ambiguity related extraction errors and using Targeted Feedback to reduce ambiguity.
This research project shows that Recall values of the PNER results are significantly higher than for regular NER, adding up to improvements over 29%. Using Targeted Feedback, both Precision and Recall approach 100% after full user feed- back. For Targeted Feedback, both the order in which questions are posed and whether a strategy attempts to learn from the answers of the user provide performance gains. Although PNER shows to have potential, this research project provides insufficient evidence whether PNER is better than regular NER.

Friday, December 02nd, 2011 | Author:

One of my PhD students, Mena Badieh Habib, has given a talk on the Dutch-Belgian DataBase Day (DBDBD) about “Named Entity Extraction and Disambiguation from an Uncertainty Perspective“.

Monday, September 05th, 2011 | Author:

I wrote a position paper about a different approach towards development of information extractors, which I call Sherlock Holmes-style based on his famous quote “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. The idea is that we fundamentally treat annotations as uncertain. We even start with a “no knowledge”, i.e., “everything is possible” starting point and then interactively add more knowledge, apply the knowledge directly to the annotation state by removing possible annotations and recalculating the probabilities of the remaining ones. For example, “Paris Hilton”, “Paris”, and “Hilton” can all be interpreted as a City, Hotel or Person name. But adding knowledge like “If a phrase is interpreted as a Person Name, then its subphrases should not be interpreted as a City” makes the annotations <"Paris Hilton":Person Name> and <"Paris":City> mutually exclusive. Observe that initially all annotations were independent, and these two are now dependent. We argue in the paper that the main challenge in this approach lies in efficient storage and conditioning of probabilistic dependencies, because trivial approaches do not work.
Handling Uncertainty in Information Extraction.
Maurice van Keulen, Mena Badieh Habib
This position paper proposes an interactive approach for developing information extractors based on the ontology definition process with knowledge about possible (in)correctness of annotations. We discuss the problem of managing and manipulating probabilistic dependencies.

The paper will be presented at the URSW workshop co-located with ICSW 2011, 23 October 2011, Bonn, Germany [details]

Monday, August 22nd, 2011 | Author:

One of my PhD students, Mena Badieh Habib, and I submitted a paper about improving the effectiveness of named entity extraction (NEE) with what we call “the reinforcement effect” to the MUD workshop of VLDB2011.
Named Entity Extraction and Disambiguation: The Reinforcement Effect.
Mena Badieh Habib, Maurice van Keulen
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.

The paper will be presented at the MUD workshop co-located with VLDB 2011, 29 August 2011, Seattle, USA [details]

Wednesday, December 22nd, 2010 | Author:

One of my PhD students, Mena Badieh Habib, submitted a paper with his research plans in the Neogeography project to the PhD workshop of ICDE2011.
Neogeography: The Challenge of Channelling Large and Ill-Behaved Data Streams
Mena Badieh Habib
Neogeography is the combination of user generated data and experiences with mapping technologies. In this paper we propose a research project to extract valuable structured information with a geographic component from unstructured user generated text in wikis, forums, or SMSes. The project intends to help workers communities in developing countries to share their knowledge, providing a simple and cheap way to contribute and get benefit using the available communication technology.

The paper will be presented at the PhD workshop co-located with ICDE 2010, 11 April 2011, Hannover, Germany [details]

Friday, August 27th, 2010 | Author:

On Thursday 26 August 2010, Guido van der Zanden defended his MSc thesis “Quality Assessment of Medical Health Records using Information Extraction”. The MSc project was supervised by me, Ander de Keijzer, and Vincent Ivens and Daan van Berkel from Topicus Zorg.

“Quality Assessment of Medical Health Records using Information Extraction” [download]
The most important information in Electronic Health Records is in free text form. The result is that the quality of Electronic Health Records is hard to as- sess. Since Electronic Health Records are exchanged more and more, badly writ- ten or incomplete records can cause problems when other healthcare providers do not completely understand them. In this thesis we try to automatically assess the quality of Electronic Health Records using Information Extraction. Another advantage of the automated analysis of Electronic Health Records is to extract management information which can be used in order to increase efficiency and decrease cost, another popular subject in healthcare nowadays.
Our solution for automated assessment of Electronic Health Records consists out of two parts. In the first part we theoretically determine what the quality of Electronic Health Records is, based upon Data and Information Quality theory. Based upon this analysis we propose three quality metrics. The first two check whether an Electronic Health Record is written as prescribed by guidelines of the association of general practitioners. The first checks whether the SOEP methodology is used correctly, the second whether a treatment is carried out according to the guideline for that illness. The third metric is more general applicable and measures conciseness.
In the second part we designed and implemented a prototype system to ex- ecute the quality assessment. Due to time limitations we only implemented the SOEP methodology metric. This metric tests whether a piece of text is placed in the right place. The fields that can be used by a healthcare provider are (S)ubjective, (O)bjective, (E)valuation and (P)lan. We implemented a proto- type based upon the ‘General Architecture for Text Engineering’. Many generic Information Extraction tasks were available already, we implemented two do- main specific tasks ourselves. The first looks up words in a thesaurus (the UMLS) in order to give meaning to the text, since to every word in the the- saurus one or more semantic types are assigned. The semantic types found in a sentence are then resolved to one of the four SOEP types. In a good Electronic Health Record, sentences are resolved to the SOEP field they are actually in.
To validate our prototype we annotated text from real Electronic Health Records with S,O,E and P and compared it to the output of our prototype. We found a Precision of roughly 50% and a recall of 20-25%. Although not perfect, because we had time nor resources to involve domain experts we think this result is encouraging for further research. Furthermore we shown that our other two metrics are sensible with use cases. Although no proof they are feasible in practice, they show that a whole set of different metrics can be used to assess the quality of Electronic Health Records.

Monday, May 03rd, 2010 | Author:

Mena Badieh Habib started his PhD research in the Neogeography-project today. For details, see my earlier post on “Kick-Off of Neogeography project”.