Integration of biology data and knowledge (Databases)

Understanding biological processes requires access to many complex datasets. In order to infer reliable conclusions from the data, it is necessary to integrate the different -omics spaces or scales corresponding to the different domains of knowledge (e.g. genomics, transcriptomics, proteomics, metabolomics, structural biology,  cellular biology). However, constant progress in scientific technologies (e.g. Next Generation Sequencing - NGS, microarrays, imaging, high-throughput phenotyping) and simulation tools (that foster in silico experimentation) creates a huge data overload. In particular, the increasing number of completed genome sequences opens new challenges in the discovery of gene functions. Therefore, it is crucial for modern biology to be able to integrate large amounts of heterogeneous data, with different formats and semantics, and manipulate them through complex workflows. This requires new, automated methods and tools for data integration and workflow management, to enable users with different backgrounds and interests to easily integrate and manipulate various datasets.

The overall goal of this workpackage is to make data and knowledge in plant biology easier to access, reproduce, and share by scientists, at large scale. We address this challenge by pursuing several complementary research directions in: distributed, heterogeneous data integration, using metadata and ontologies; distributed workflow execution involving distributed processes and large amounts of heterogeneous data, with support of data provenance (lineage) to understand result data; and integrated analysis of biological data.

Integrating cell and tissue imaging with Omics Data (Imaging)

The sequence‐driven Omics revolution has excelled at unraveling the gene content and potential gene interactions of manyorganisms, what is traditionally referred to in systems biology as the "parts list". However,"genome sequences alone lack spatial and temporal information and are therefore as dynamicand informative as census lists or telephone directories" (Tsien 2003).

The challenge for the 21stcentury is to spatialize these Omics data into a quantitative description of cells and tissues. Highresolution organ‐wide 3D imaging at cellular resolution on both fixed and living (3D+t) tissue hasnow become possible, but generates massive volumes of raw data (up to terabytes of images for a single developing embryo). Making sense of these raw datainvolves further major technological developments in image digitizing and computational imageanalysis and in the design of databases able tocombine spatial and Omics data.

One major goal of these efforts is to generate a quantitative, computable view of the developmental programs of the organism under study.

Follow WP4 events

  • Interdisciplinary spring school on animal and plant morphogenesis

    We organize next spring (Feb 26-March 4, 2017) in a beautiful estate 30 kilometers north of Montpellier, France, The Hameau de l'Etoile. This Spring school is destined to early career researchers (masters students to early post-docs) and will focus on the genetic and evolutionary aspects of morphogenesis. We aim to bring together students of...

Scaling‐up evolutionary analyses (Evolution)

The comparison of genetic data (notably at the genomic scale) under an evolutionary perspective naturally follows its acquisition and first treatment (WP1‐HTS). It constitutes the basis for: inferring gene function (in connection with WP3‐ Annotation); reconstructing the history of species/populations; elucidating the genetic basis of adaptation (in connection with WP5‐Databases); understanding the dynamics of molecular evolution.

The state of the art relies on the use of probabilistic modeling and advanced algorithmic techniques (e.g. heuristics with performance guarantees, stochastic approaches). Recent years have seen the development of likelihood‐based inferences for population genetics data, using algorithms applicable on small datasets as have been typical of the field in the last twenty years. However, these algorithms are too slow to handle hundreds or more loci.

Several more descriptive approaches have been reconsidered to tackle this problem, as well as methods based on refined summaries of the data. An alternative methodology is to obtain distributions of summary statistics by simulation of biological processes for different parameter values. However, even the more recent adaptive versions of this methodology are very far from being able to analyze huge data sets. The challenge is then to scale up these tools to the analysis of modern genomic data, both at the inter‐specific level (e.g. the 10,000 vertebrate genomes project2) and at the intra‐specific level (e.g. the 1000 human genomes project3).

Another problem with the current approaches is that they largely do not exploit the potential synergy between phylogenetics and population genetics, despite their significant overlaps both in the scope of the research and in the methodology employed.

Structural and functional annotation of proteomes (Annotation)

A huge number of new genomes of a wide range of species are expected in the near future. Today, the growth of the sequencing data significantly exceeds the growth of capacities to analyze these data. In line with the dramatic growth of this information and urgent needs in new bioinformatics tools, our AXE3 deal with the development of new algorithms and software, data integration and workflows to implement the complex processing chains required to analyze proteome data.

New genome proteins are usually annotated based on homology with better‐characterized proteins of already annotated species using tools such as BLAST, HMMER and INTERPRO suite to provide functional annotations in standardized frameworks (e.g. Gene Ontology). However, when applied to genomes that are phylogenetically distant from classic model organisms, this strategy fails to annotate a large part of the proteins. Especially, this is the case for most human pathogens.

We plan to develop approaches for improving annotation of protein domains. For example, combining the results achieved on all homologues of the same protein will be a strategy to increase the sensitivity of the procedure. Another approach will be to use 3D structure information and molecular modeling to assess the likelihood of dubious domain occurrence. Special tools will also be developed for characterization of regions with non-globular structures (arrays of tandem repeats and intrinsically unstructured regions). Conventional approaches developed for globular domains have limited success when applied to these regions and the existing specialized tools remain highly perfectible. Finally, we plan to develop approaches for functional annotation of proteins that integrate structural information (e.g. protein domains and tandem repeats) with other types of data (e.g. gene expression).

Methods for high‐throughput sequencing analysis (HTS)

Due to the volume of data produced by HTS, the bottleneck lies in the bioinformatics analyses. Except for locating reads on a reference genome, the algorithmics for analyzing HTS data has been little explored to date. All life sciences communities urgently need new methods for general tasks like indexing, compressing, or comparing read sets, or for more specialized ones like performing transcriptomic or genomic variation (SNP) analyses.

The main focus here will be on transcriptome data obtained by RNA sequencing (RNA‐seq). Indeed, unraveling the full complexity of transcriptomes remains a major issue. Single mRNA variants can drastically influence tissue metabolism, and the characterization of alternative splicing events and tissue‐related expression appears to be more complex than expected. Moreover, non‐coding RNAs (ncRNAs) are spread along the entire genome. Ultimately, one also wishes to identify potential fusion RNAs, which are known to drive carcinogenesis and may be highly specific markers, such as in Chronic Myeloid Leukemia (CML), but could also be functional in normal tissues.

Current methods can partly capture transcriptome complexity, and the most precise methods require combining multiple approaches and integrating complementary sources of information. Presently, RNA‐seq analysis cumulates mistakes made in multi‐step procedures and suffers from mapping limitations (cross‐mapping, false positive matching). Therefore, tools like TopHat lack precision in exon boundary detection, which paired‐end reads cannot resolve.

The entanglement of transcriptomes and the need for scalability drive the need for algorithmic innovations and new analysis approaches that combine flexibility, efficiency and specificity. Related techniques such as indexation and algorithmics, robust and statistically controlled analysis, and data integration (see WP5‐Databases) will help to overcome these limitations.


Follow WP1 events

  • Tutorial presented in the ECMLPKDD'14 conference

    Ronnie Alves (WP1) has given a Tutorial about "The Pervasiveness of Machine Learning in Omics Science" at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2014). This event is the premier European machine learning and data mining conference and builds upon a very successful...