Understanding biological processes requires access to many complex datasets. In order to infer reliable conclusions from the data, it is necessary to integrate the different -omics spaces or scales corresponding to the different domains of knowledge (e.g. genomics, transcriptomics, proteomics, metabolomics, structural biology, cellular biology). However, constant progress in scientific technologies (e.g. Next Generation Sequencing - NGS, microarrays, imaging, high-throughput phenotyping) and simulation tools (that foster in silico experimentation) creates a huge data overload. In particular, the increasing number of completed genome sequences opens new challenges in the discovery of gene functions. Therefore, it is crucial for modern biology to be able to integrate large amounts of heterogeneous data, with different formats and semantics, and manipulate them through complex workflows. This requires new, automated methods and tools for data integration and workflow management, to enable users with different backgrounds and interests to easily integrate and manipulate various datasets.
The overall goal of this workpackage is to make data and knowledge in plant biology easier to access, reproduce, and share by scientists, at large scale. We address this challenge by pursuing several complementary research directions in: distributed, heterogeneous data integration, using metadata and ontologies; distributed workflow execution involving distributed processes and large amounts of heterogeneous data, with support of data provenance (lineage) to understand result data; and integrated analysis of biological data.