Tag-Archive for » probabilistic data integration «

Thursday, June 20th, 2013 | Author:

On 20 June 2013, Ben Companjen defended his MSc thesis on matching author names on publications to researcher profiles on the scale of The Netherlands. He carried out this research at DANS where he applied and validated his techniques on a coupling between the NARCIS scholarly database and the researcher profile database VSOI.
“Probabilistically Matching Author Names to Researchers”[download]
Publications are most important form of scientific communication, but science also consists of researchers, research projects and organisations. The goal of NARCIS (National Academic Research and Collaboration Information System) is to provide a complete and concise view of current science in the Netherlands.
Connecting publications to the researchers, projects and organisations that created them in retrospect is hard, because of a lack in the use of author identifiers in publications and researcher profiles. There is too much data to identify all researchers in NARCIS manually, so an automatic method is needed to assist completing the view of science in the Netherlands.
In this thesis the problems that limit automatic connection of author names in publications to researchers are explored and a method to automatically connect publications and researchers is developed and evaluated.
Using only the author names themselves finds the correct researcher for around 80% of the author names in an experiment, using two test sets. However, none of the correct matches were given the highest confidence of the returned matches. Over 90% of the correct matches were ranked second by confidence. Other correct matches were ranked lower, and using probabilistic results allows working with the correct results, even if they are not the best match. Many names that should not match, were included in the matches. The matching algorithm can be optimised to assign confidence to matches differently.
Including a matching function that compares publication titles and researcher’s project titles did not improve the results, but better results are expected when more context elements are used to assign confidences.

Monday, June 03rd, 2013 | Author:

Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden

Monday, December 10th, 2012 | Author:

Brend Wanders, a PhD student of mine, presents a poster at the BeNeLux Bioinformatics Conference (BBC 2012) in Nijmegen.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Background: Scientific research in bio-informatics is often data-driven and supported by biological databases. In a growing number of research projects, researchers like to ask questions that require the combination of information from more than one database. Most bio-informatics papers do not detail the integration of different databases. As roughly 30% of all tasks in workflows are data transformation tasks, database integration is an important issue. Integrating multiple data sources can be difficult. As data sources are created, many design decisions are made by their creators.
Methods: Our research is guided by two use cases: homologues, the representation and integration of groupings; metabolomics integration, with a focus on the TCA cycle
Results: We propose to approach the time consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defining a knowledge base of data mapping rules, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration. Through user feedback on query results and trust assessments, the integration can be improved upon over time.
Conclusions: We conclude that this direction of research is worthy of further exploration. [details]

Wednesday, November 21st, 2012 | Author:

Brend Wanders, a PhD student of mine, presents his research at the Dutch-Belgian Database Day (DBDBD 2012) in Brussels.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Scientific research in bio-informatics is often data-driven and supported by numerous biological databases. A biological database contains factual information collected from scientific experiments and computational analyses about areas including genomics, proteomics, metabolomics, microarray gene expression and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
In a growing number of research projects, bio-informatics researchers like to ask combined ques- tions, i.e., questions that require the combination of information from more than one database. We have observed that most bio-informatics papers do not go into detail on the integration of different databases. It has been observed that roughly 30% of all tasks in bio-informatics workflows are data transformation tasks, a lot of time is used to integrate these databases (shown by [1]).
As data sources are created and evolve, many design decisions made by their creators. Not all of these choices are documented. Some of such choices are made implicitly based on experience or preference of the creator. Other choices are mandated by the purpose of the data source, as well as inherent data quality issues such as imprecision in measurements, or ongoing scientific debates. Integrating multiple data sources can be difficult.
We propose to approach the time-consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defin- ing a knowledge base of data mapping rules, schema alignment, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration to serve the purposes of the user. By using user feedback on query results and trust assessments, the integration can be improved upon over time.
The research will be guided by a set of use cases. As the research is in its early stages, we have determined three use cases:

  • Homologues, the representation and integration of groupings. Homology is the relationship between two characteristics that have descended, usually with divergence, from a common ancestral characteristic. A characteristic can be any genic, structural or behavioural feature of an organism

  • Metabolomics integration, with a focus on the TCA cycle. The TCA cycle (also known as the citric acid cycle, or Krebs cycle) is used by aerobic organism to generate energy from the oxidation of carbohydrates, fats and proteins.
  • Bibliography integration and improvement, the correction and expansion of citation databases.

[1] I. Wassink. Work flows in life science. PhD thesis, University of Twente, Enschede, January 2010. [details]

Thursday, March 08th, 2012 | Author:

A journal paper with my vision on data interoperability and a basis formalization has been accepted for a special issue of the Journal of IT volume 54, issue 3.
Managing Uncertainty: The Road Towards Better Data Interoperability.
Maurice van Keulen
Data interoperability encompasses the many data management activities needed for effective information management in anyone´s or any organization´s everyday work such as data cleaning, coupling, fusion, mapping, and information extraction. It is our conviction that a significant amount of money and time in IT that is devoted to these activities, is about dealing with one problem: “semantic uncertainty”. Sometimes data is subjective, incomplete, not current, or incorrect, sometimes it can be interpreted in different ways, etc. In our opinion, clean correct data is only a special case, hence data management technology should treat data quality problems as a fact of life, not as something to be repaired afterwards. Recent approaches treat uncertainty as an additional source of information which should be preserved to reduce its impact. We believe that the road towards better data interoperability, is to be found in teaching our data processing tools and systems about all forms of doubt and how to live with them. In this paper, we show for several data interoperability use cases (deduplication, data coupling/fusion, and information extraction) how to formally model the associated data quality problems as semantic uncertainty. Furthermore, we provide an argument why our approach leads to better data interoperability in terms of natural problem exposure and risk assessment, more robustness and automation, reduced development costs, and potential for natural and effective feedback loops leveraging human attention.
[details]

Thursday, September 01st, 2011 | Author:

A master student performed a problem exploration for the PayDIBI project. This is the report he wrote.
Integration of Biological Sources – Exploring the Case of Protein Homology
Tjeerd W. Boerman, Maurice van Keulen, Paul van der Vet, Edouard I. Severing (Wageningen University)
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heterogeneous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioinformatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Uncertain databases are able to contain several possible worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration.
[details]

Monday, January 24th, 2011 | Author:

I have a vacancy for a PhD position in a project called “Pay-As-You-Go Data Integration for Bio-Informatics” (PayDIBI). In short, the objective is to develop data coupling and integration technology to support bio-informatics scientists in quickly constructing targeted data sets for researching questions that require the combination of information from more than one biological database. More information and a webform to apply can be found here.

Thursday, October 21st, 2010 | Author:

QloUDFabian Panse from the University of Hamburg in Germany just lauched a website about our cooperation on the topic of “Quality of Uncertain Data (QloUD)”.

Friday, January 22nd, 2010 | Author:

On Friday 22 January 2010, Michiel Punter defended his MSc thesis “Multi-Source Entity Resolution“. The MSc project was supervised by me, Ander de Keijzer, and Riham Abdel Kader.

“Multi-Source Entity Resolution” [download]
Background: The focus of this research was on multi-source entity resolution in the setting of pair-wise data integration. In contrast to most existing approaches to entity resolution this research does not consider matching to be transitive. A consequence of this is that entity resolution on multiple sources is not guaranteed to be associative. The goal of this research was to construct a generic model for multi-source entity resolution in the setting of pair-wise data integration that is associative.
Results: The main contributions of this research are: (1) a formal model for multi-source entity resolution and (2) strategies that can be used to resolve matching conflicts in a way that renders multi-source entity resolution to be associative. The possible worlds semantics is used to handle uncertainty originating from possible matches. The presented model is generic enough to allow different match and merge function as well as allowing different strategies to resolve matching conflicts.
Conclusions: A formalization of an example of multi-source entity resolution is presented to show the utility of the proposed model. By using small examples in which three sources are integrated it is shown that the strategies resulted in associative behavior of the integrate function.

Wednesday, November 25th, 2009 | Author:

As a product of my cooperation with Fabian Panse from the University of Hamburg, we got a paper accepted at the NTII-workshop co-located with ICDE 2010.
Duplicate Detection in Probabilistic Data
Fabian Panse, Maurice van Keulen, Ander de Keijzer, Norbert Ritter
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities.

The paper will be presented at the Second International Workshop on New Trends in Information Integration (NTII 2010), Long Beach, California, USA [details]