by Matthijs Ooms
Scientific Workflow Managements Systems (SWfMSs), such as our own
research prototype e-BioFlow, are being used by bioinformaticians to design
and run data-intensive experiments, connecting local and remote (Web)
services and tools. Preserving data, for later inspection or reuse,
determine the quality of results. To validate results is essential for
scientific experiments. This can all be achieved by collecting provenance
data. The dependencies between services and data are captured in a
provenance model, such as the interchangeable Open Provenance Model
(OPM). This research consists of the following two provenance related goals:
- Using a provenance archive effectively and efficiently as cache for
workflow tasks.
- Designing techniques to support browsing and navigation through
a provenance archive.
The use case identified is called OligoRAP, taken from the life science
domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its
performance in terms of duration was measured and its results validated
by comparing them to the results of the original Perl implementation. By
casting OligoRAP as a workflow and using parallelism, its performance
is improved by a factor two.
Many improvements were made to e-BioFlow in order to run OligoRAP,
among which a new provenance implementation based on the OPM, enabling
provenance capturing during the execution of OligoRAP in e-BioFlow.
During this research, e-BioFlow has grown from a proof-of-concept
to a powerful research prototype.
For the OPM implementation, a profile for the OPM to collect provenance
data during workflow execution has been proposed, that defines
how provenance is collected during workflow enactment. The proposed
profile maintains the hierarchical structure of (sub)workflows in the
collected provenance data. With this profile, interoperability of the
OPM for SWfMS is improved.
A caching strategy is proposed for caching workflow tasks and is
implemented in e-BioFlow. It queries the OPM implementation for previous
task executions. The queries are optimised by formulating them
differently and creating several indices. The performance improvement of
each optimisation was measured using a query set taken from an OligoRAP
cache run. Three tasks in OligoRAP were cached, resulting in an additional
performance improvement of 19%. A provenance archive based on the
OPM can be used to effectively cache workflow tasks.
A provenance browser is introduced that incorporates several techniques
to help browsing through large provenance archives. Its primary
visualisation is the graph representation specified by the OPM.
More information at the e-BioFlow project page at SourceForge, or in Matthijs’ master thesis in ePrints.