Hypothesis-free NGS data analysis

Friday 9th Mars 2018 at 2 pm, IBC Campus St Priest BAT5-03.124

Dr. Daniel Gautheret
Institute for Integrative Biology of the Cell
Universite Paris-Sud - CNRS -CEA
Orsay, France

Computational pipelines for NGS data analysis involve mutiple hypotheses and simplifications leading to an important loss of information. For instance, a major limiting factor is the mapping step where NGS reads are aligned to a reference genome or transcriptome. In RNA-seq analysis, relying on a reference transcriptome amounts to ignoring novel genes, alternative transcripts and transcripts from repeats or with high levels of mutation or editing. Hundreds of dedicated software have been developed to bypass these limitations and retrieve specific event types, with highly diverging results.

We have developed a method for RNA-seq data analysis, DE-kupl (1), in which NGS data is analysed at the level of raw sequence using k-mers (i.e. subsequences of length k, with typically k=31) followed by differential expression analysis. Only k-mers that are differentially represented between two sets of libraries are extracted and analyzed. Therefore all biological variation present in the original NGS dataset is theroretically collected, with no prior hypothesis about their origin.

We will show how DE-kupl can be applied to various experimental settings and present our plans for future developments, including application to the discovery of novel biomarkers based on cliniciallly annotated DNA-seq or RNA-seq data.

Référence :
(1) Audoux J, Philippe N, Chikhi R, Salson M, Gallopin M, Gabriel M, Le Coz J, Commes T,  Gautheret D. (2017)  DE-kupl: Exhaustive capture of biological variation in RNA-seq data through k-mer decomposition. Genome Biol. 18: 243.

IBC seminars