发明名称 SYSTEMS, METHODS, AND COMPUTER PROGRAM PRODUCTS FOR MERGING A NEW NUCLEOTIDE OR AMINO ACID SEQUENCE INTO OPERATIONAL TAXONOMIC UNITS
摘要 The present disclosure provides a method for filtering sequence clusters during a process of merging a newly generated nucleotide or amino acid sequence with a set of previously clustered sequences. In another aspect, the disclosure provides a method for assigning newly generated nucleotide or amino acid sequences to presumptive species called operational taxonomic units. In yet another embodiment, the sequences are derived from the cytochrome c oxidase I gene.
申请公布号 US2016103958(A1) 申请公布日期 2016.04.14
申请号 US201414897321 申请日期 2014.06.13
申请人 UNIVERSITY OF GUELPH 发明人 Hebert Paul;Ratnasingham Sujeevan
分类号 G06F19/24;G06F17/30;G06F19/14 主分类号 G06F19/24
代理机构 代理人
主权项 1. A method for operating a computer system to filter out clusters from a group of clusters from further consideration during a process of merging a new nucleic acid or amino acid sequence into the group of clusters based on sequence similarity, the computer comprising a processor and a memory, the method comprising: a) determining a candidate cluster set including a plurality of candidate clusters, each candidate cluster comprising a plurality of previously classified nucleic acid or amino acid sequences wherein each previously classified nucleic acid or amino acid sequence in a cluster is closer to at least one other previously classified nucleic acid or amino acid sequence in that cluster than to any previously classified nucleic acid or amino acid sequences in other clusters; b) using the processor of the computer system to determine a plurality of sets of representative sequences, by determining, for each of the candidate clusters in the candidate cluster set, a set of one or more representative sequences, wherein for at least one candidate cluster, the number of representative sequences in the set of one or more representative sequences of the candidate cluster is less than the number of previously classified nucleic acid or amino acid sequences in the plurality of previously classified nucleic acid or amino acid sequences of the candidate cluster; c) using the processor to determine a plurality of candidate cluster distance measures, by determining, for each of the candidate clusters in the candidate cluster set, a candidate cluster distance measure between the new nucleic acid or amino acid sequence and the candidate cluster, wherein the candidate cluster distance measure between the new nucleic acid or amino acid sequence and the candidate cluster is determined by determining the distance between the nucleic acid or amino acid sequence and the set of one or more representative sequences of the candidate cluster; and d) using the processor to filter out from further consideration each candidate cluster in the candidate cluster set if and only if the associated candidate cluster distance measure of the candidate cluster is greater than a pre-defined filtering threshold, and retaining and storing in the memory all other candidate clusters in the candidate cluster set for further consideration.
地址 Guelph CA