发明名称 Assembly of metagenomic sequences
摘要 Systems and methods for assembly of metagenomic sequences are described herein. In one embodiment, a plurality of metagenomic sequences is represented in three dimensional space to obtain a plurality of sequence vectors. Based on plurality of the sequence vectors, a cuboid having a plurality of grids is defined in the three dimensional space such that it encompasses the plurality of metagenomic sequences. Further, the plurality of metagenomic sequences is assembled into one or more contigs based on traversal of the plurality of grids. In one implementation, the one or more contigs are assembled such that a contig includes metagenomic sequences probably originating from the same genome.
申请公布号 US9372959(B2) 申请公布日期 2016.06.21
申请号 US201213484885 申请日期 2012.05.31
申请人 Tata Consultancy Services Limited 发明人 Mande Sharmila Shekhar;Ghosh Tarini Shankar;Mehra Varun
分类号 G06F19/24;G06F19/22 主分类号 G06F19/24
代理机构 Lee & Hayes, PLLC 代理人 Lee & Hayes, PLLC
主权项 1. A computerized method for assembly of metagenomic sequences comprising: obtaining sequencing data of a plurality of organisms in an environmental sample to obtain a plurality of metagenomic sequences; for a metagenomic sequence of the plurality of metagenomic sequences, generating an intermediate vector which represents frequencies with which possible tetra-nucleotides occur in the metagenomic sequence; splitting the metagenomic sequence into fragments; for individual ones of the fragments, generating respective fragment vectors comprising the frequencies with which the possible tetra-nucleotides occur in the individual ones of the fragments; generating a plurality of fragment clusters by clustering the fragment vectors; computing centroids of individual ones of the fragment clusters; for individual ones of the centroids, generating respective cluster vectors; identifying, as a set of reference points, three cluster vectors from the respective cluster vectors, the three cluster vectors having pairwise dot products which are the least correlated amongst computed pairwise dot products of combinations of individual ones of the respective cluster vectors; transforming the intermediate vector into a three-dimensional sequence vector having coordinates determined by a distance between the intermediate vector and individual ones of the set of reference points, wherein the three-dimensional sequence vector corresponds to the metagenomic sequence; defining a cuboid having a plurality of grids in a three-dimensional space encompassing the three-dimensional sequence vector, wherein individual ones of the plurality of grids encompass taxonomically similar metagenomic sequences from among the plurality of metagenomic sequences; selecting a subset of the plurality of metagenomic sequences, wherein the subset includes a first metagenomic sequence located within coordinates defined by one of the plurality of grids in the cuboid and a second metagenomic sequence located within an immediate neighbor of the one of the plurality of grids in the cuboid; and assembling a one of the metagenomic sequences present in the subset with at least one other metagenomic sequence present in the subset into a contig, wherein the metagenomic sequence and the at least one other metagenomic sequence originate from a same genome.
地址 Mumbai, Maharashtra IN