发明名称 BIOLOGICAL SEQUENCE TANDEM REPEAT CHARACTERIZATION
摘要 Short fixed length source sub-sequences are extracted from a collection of source sequences derived from a sample for which the biological signature is to be determined. The extracted short fixed length source sub-sequences are compiled to determine the frequency of each within the collection. Overlaps between the short fixed length source sub-sequences are used to find a chain of overlaps from one or more sub-sequences equivalent to a pre-flanking reference marker sequence to one or more sub-sequences equivalent to a post-flanking reference marker sequence, wherein the reference marker sequences flank a region containing a repetitive sequence region. In response to the chain containing multiple instances of the one or more short fixed length source sub-sequences, thereby defining a cycle, the sequences from the collection derived from the sample are examined to find one or more sequences that span the cycle, and at least one of: (i) the lengths of the spanning sequences are used to determine the length of the cycle and; (ii) the number of repeat motif copies within each spanning sequence are counted.
申请公布号 US2016103955(A1) 申请公布日期 2016.04.14
申请号 US201514793273 申请日期 2015.07.07
申请人 International Business Machines Corporation 发明人 Conway Thomas C.;Wyres Kelly L.
分类号 G06F19/22 主分类号 G06F19/22
代理机构 代理人
主权项 1. A method for determining a signature from biological sequence data, comprising: extracting short fixed length source sub-sequences from a collection of source sequences derived from a sample for which the biological signature is to be determined; compiling the extracted short fixed length source sub-sequences to determine the frequency of each within the collection; using overlaps between the short fixed length source sub-sequences to find a chain of overlaps from one or more sub-sequences equivalent to a pre-flanking reference marker sequence to one or more sub-sequences equivalent to a post-flanking reference marker sequence, the reference marker sequences flanking a region containing a repetitive sequence region; and in response to the chain containing multiple instances of the one or more short fixed length source sub-sequences, thereby defining a cycle, examining the sequences from the collection derived from the sample to find one or more sequences that span the cycle, and at least one of: (i) using the lengths of the spanning sequences to determine the length of the cycle and; (ii) counting the number of repeat motif copies within each spanning sequence; wherein one or more of the above steps are performed in accordance with a processor and a memory.
地址 Armonk NY US