Archive for the Category » PayDIBI «

Wednesday, October 14th, 2015 | Author:

Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.

Monday, December 10th, 2012 | Author:

Brend Wanders, a PhD student of mine, presents a poster at the BeNeLux Bioinformatics Conference (BBC 2012) in Nijmegen.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Background: Scientific research in bio-informatics is often data-driven and supported by biological databases. In a growing number of research projects, researchers like to ask questions that require the combination of information from more than one database. Most bio-informatics papers do not detail the integration of different databases. As roughly 30% of all tasks in workflows are data transformation tasks, database integration is an important issue. Integrating multiple data sources can be difficult. As data sources are created, many design decisions are made by their creators.
Methods: Our research is guided by two use cases: homologues, the representation and integration of groupings; metabolomics integration, with a focus on the TCA cycle
Results: We propose to approach the time consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defining a knowledge base of data mapping rules, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration. Through user feedback on query results and trust assessments, the integration can be improved upon over time.
Conclusions: We conclude that this direction of research is worthy of further exploration. [details]

Wednesday, November 21st, 2012 | Author:

Brend Wanders, a PhD student of mine, presents his research at the Dutch-Belgian Database Day (DBDBD 2012) in Brussels.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Scientific research in bio-informatics is often data-driven and supported by numerous biological databases. A biological database contains factual information collected from scientific experiments and computational analyses about areas including genomics, proteomics, metabolomics, microarray gene expression and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
In a growing number of research projects, bio-informatics researchers like to ask combined ques- tions, i.e., questions that require the combination of information from more than one database. We have observed that most bio-informatics papers do not go into detail on the integration of different databases. It has been observed that roughly 30% of all tasks in bio-informatics workflows are data transformation tasks, a lot of time is used to integrate these databases (shown by [1]).
As data sources are created and evolve, many design decisions made by their creators. Not all of these choices are documented. Some of such choices are made implicitly based on experience or preference of the creator. Other choices are mandated by the purpose of the data source, as well as inherent data quality issues such as imprecision in measurements, or ongoing scientific debates. Integrating multiple data sources can be difficult.
We propose to approach the time-consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defin- ing a knowledge base of data mapping rules, schema alignment, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration to serve the purposes of the user. By using user feedback on query results and trust assessments, the integration can be improved upon over time.
The research will be guided by a set of use cases. As the research is in its early stages, we have determined three use cases:

  • Homologues, the representation and integration of groupings. Homology is the relationship between two characteristics that have descended, usually with divergence, from a common ancestral characteristic. A characteristic can be any genic, structural or behavioural feature of an organism

  • Metabolomics integration, with a focus on the TCA cycle. The TCA cycle (also known as the citric acid cycle, or Krebs cycle) is used by aerobic organism to generate energy from the oxidation of carbohydrates, fats and proteins.
  • Bibliography integration and improvement, the correction and expansion of citation databases.

[1] I. Wassink. Work flows in life science. PhD thesis, University of Twente, Enschede, January 2010. [details]

Thursday, September 01st, 2011 | Author:

A master student performed a problem exploration for the PayDIBI project. This is the report he wrote.
Integration of Biological Sources – Exploring the Case of Protein Homology
Tjeerd W. Boerman, Maurice van Keulen, Paul van der Vet, Edouard I. Severing (Wageningen University)
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heterogeneous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioinformatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Uncertain databases are able to contain several possible worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration.
[details]

Monday, January 24th, 2011 | Author:

I have a vacancy for a PhD position in a project called “Pay-As-You-Go Data Integration for Bio-Informatics” (PayDIBI). In short, the objective is to develop data coupling and integration technology to support bio-informatics scientists in quickly constructing targeted data sets for researching questions that require the combination of information from more than one biological database. More information and a webform to apply can be found here.