Archive for » November, 2012 «

Wednesday, November 21st, 2012 | Author:

Brend Wanders, a PhD student of mine, presents his research at the Dutch-Belgian Database Day (DBDBD 2012) in Brussels.
Pay-as-you-go data integration for bio-informatics
Brend Wanders
Scientific research in bio-informatics is often data-driven and supported by numerous biological databases. A biological database contains factual information collected from scientific experiments and computational analyses about areas including genomics, proteomics, metabolomics, microarray gene expression and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
In a growing number of research projects, bio-informatics researchers like to ask combined ques- tions, i.e., questions that require the combination of information from more than one database. We have observed that most bio-informatics papers do not go into detail on the integration of different databases. It has been observed that roughly 30% of all tasks in bio-informatics workflows are data transformation tasks, a lot of time is used to integrate these databases (shown by [1]).
As data sources are created and evolve, many design decisions made by their creators. Not all of these choices are documented. Some of such choices are made implicitly based on experience or preference of the creator. Other choices are mandated by the purpose of the data source, as well as inherent data quality issues such as imprecision in measurements, or ongoing scientific debates. Integrating multiple data sources can be difficult.
We propose to approach the time-consuming problem of integrating multiple biological databases through the principles of ‘pay-as-you-go’ and ‘good-is-good-enough’. By assisting the user in defin- ing a knowledge base of data mapping rules, schema alignment, trust information and other evidence we allow the user to focus on the work, and put in as little effort as is necessary for the integration to serve the purposes of the user. By using user feedback on query results and trust assessments, the integration can be improved upon over time.
The research will be guided by a set of use cases. As the research is in its early stages, we have determined three use cases:

  • Homologues, the representation and integration of groupings. Homology is the relationship between two characteristics that have descended, usually with divergence, from a common ancestral characteristic. A characteristic can be any genic, structural or behavioural feature of an organism

  • Metabolomics integration, with a focus on the TCA cycle. The TCA cycle (also known as the citric acid cycle, or Krebs cycle) is used by aerobic organism to generate energy from the oxidation of carbohydrates, fats and proteins.
  • Bibliography integration and improvement, the correction and expansion of citation databases.

[1] I. Wassink. Work flows in life science. PhD thesis, University of Twente, Enschede, January 2010. [details]

Wednesday, November 21st, 2012 | Author:

Two of my PhD students, Mohammad Khelgati and Victor de Graaff, are presenting on the Dutch-Belgian DataBase Day (DBDBD). Mohammad about “Size Estimation of Non-Cooperative Data Collections” and Victor on “Semantic Enrichment of GPS Trajectories“.

Category: COMMIT, Information Extraction  | Tags:  | Comments off
Monday, November 19th, 2012 | Author:

The University of Twente is completely reorganizing all its bachelor studies. We are going to adopt a one-module-per-quartile system with more activating teaching forms. I’ve been asked to coordinate the design and realization of the first 15EC module for the study of Technical Informatics. The team I’ve assembled is comprised of Arend Rensink, Pieter-Tjerk de Boer, Pascal van Eck, and Jan Kamphuis.

Category: Course Pearls of Computer Science  | Comments off
Wednesday, November 14th, 2012 | Author:

Thales managed to find partial funding for the TEC4SE-project (pronounce “tecforce”). The project is about developing an innovative control room for police and fire fighters. I’m going to develop a Twitter analysis component for this together with company ENAI.

Category: Information Extraction, Situational awareness  | Tags:  | Comments off
Sunday, November 11th, 2012 | Author:

A MSc student of mine, Jasper Kuperus, was nominated for the ENIAC thesis award for his thesis “Catching criminals by chance” named entity extraction in digital forensics. Unfortunately, he didn’t win.

Monday, November 05th, 2012 | Author:

Fabian Panse, PhD student at the University of Hamburg, visits me for an entire week 17-21 December 2012. He will present at the DB seminar on Tuesday: “Duplicate detection in probabilistic data”.

Thursday, November 01st, 2012 | Author:

On 1 November 2012, Jop Hofste defended his MSc thesis “Scalable identity extraction and ranking in Tracks Inspector”. The MSc project was carried out at Fox-IT.
“Scalable identity extraction and ranking in Tracks Inspector”[download]
The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.
The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.
The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.