Due to the volume of data produced by HTS, the bottleneck lies in the bioinformatics analyses. Except for locating reads on a reference genome, the algorithmics for analyzing HTS data has been little explored to date. All life sciences communities urgently need new methods for general tasks like indexing, compressing, or comparing read sets, or for more specialized ones like performing transcriptomic or genomic variation (SNP) analyses.
The main focus here will be on transcriptome data obtained by RNA sequencing (RNA‐seq). Indeed, unraveling the full complexity of transcriptomes remains a major issue. Single mRNA variants can drastically influence tissue metabolism, and the characterization of alternative splicing events and tissue‐related expression appears to be more complex than expected. Moreover, non‐coding RNAs (ncRNAs) are spread along the entire genome. Ultimately, one also wishes to identify potential fusion RNAs, which are known to drive carcinogenesis and may be highly specific markers, such as in Chronic Myeloid Leukemia (CML), but could also be functional in normal tissues.
Current methods can partly capture transcriptome complexity, and the most precise methods require combining multiple approaches and integrating complementary sources of information. Presently, RNA‐seq analysis cumulates mistakes made in multi‐step procedures and suffers from mapping limitations (cross‐mapping, false positive matching). Therefore, tools like TopHat lack precision in exon boundary detection, which paired‐end reads cannot resolve.
The entanglement of transcriptomes and the need for scalability drive the need for algorithmic innovations and new analysis approaches that combine flexibility, efficiency and specificity. Related techniques such as indexation and algorithmics, robust and statistically controlled analysis, and data integration (see WP5‐Databases) will help to overcome these limitations.
Follow WP1 events
Ronnie Alves (WP1) has given a Tutorial about "The Pervasiveness of Machine Learning in Omics Science" at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2014). This event is the premier European machine learning and data mining conference and builds upon a very successful...