welcome: please sign in
location: AbstractLiviuCiortuz

Acknowledgement:


This work was done in collaboration with: Claire Lemaitre and Pierre Peterlongo, GenScale team, INRIA Rennes Bretagne-Atlantique, IRISA

Abstract:


Structural variation in genomes can be involved in inherited diseases, evolution and speciation. The extent of structural variations in populations has been only recently acknowledged, thanks mainly to next 'generation sequencing' (NGS).

Due to the small size of the 'reads' these variants are difficult to identify. Most methods proposed so far rely on 'mapping' the reads on a 'reference genome'. The main approach calls structural variant breakpoints when mapped read pairs show discordant mappings with respect to expected insert-size and orientation of the reads. Due mainly to repetitions in complex genomes and mapping errors, these methods suffer from high false positive rates and a small overlap between predictions obtained by different methods. Noteworthy, 'copy number variations' seem to have focussed most attention and efforts, whereas balanced structural variants such as inversions have been less investigated, suggesting that the latter are even more difficult to detect in short read data.

All these approaches rely on a 'reference genome' and on a first mapping step. This is a strong limitation when dealing with organisms with no available reference genome or one of poor quality or too distantly related. On the other hand, one can also perform full 'de novo assembly' of re-sequenced genomes and compare the resulting assemblies, however de novo assembly remains a difficult and resource intensive task and this could be reduced by targeting directly inversion variants.

The problem we address is therefore: Can we identify inversion signatures directly in raw NGS reads without the need of any reference genome nor full assembly of the reads?

We propose a model and an algorithm for detecting inversion breakpoints without a reference genome, directly from raw NGS data. This model is characterized by a fixed size topological pattern in the de Bruijn graph (dBg). We describe precisely the sources of false positives and false negatives and we additionally propose a sequence-based filter optimizing the precision/recall results. We implemented these ideas in two prototypes, called InTl and TakeABreak. They were applied on simulated repeats from genomes of various complexity (E. coli to Human dataset). We obtained promising results with a low memory footprint and a small computation time.

Reference:


Proccedings of The 1st International Conference on Algorithms for Computational Biology (AlCoB 2014), Tarragona, Spain, July 1-3, 2014, A-H Dediu C. Martin-Vide, B. Truthe(eds.), LNCS 8542, Springer Verlag, pages 119-130. http://link.springer.com/chapter/10.1007%2F978-3-319-07953-0_10#page-2

AbstractLiviuCiortuz (last edited 2014-11-14 10:41:41 by DanielaZaharie)