发明名称 |
BIOLOGICAL SEQUENCE TANDEM REPEAT CHARACTERIZATION |
摘要 |
Short fixed length source sub-sequences are extracted from a collection of source sequences derived from a sample for which the biological signature is to be determined. The extracted short fixed length source sub-sequences are compiled to determine the frequency of each within the collection. Overlaps between the short fixed length source sub-sequences are used to find a chain of overlaps from one or more sub-sequences equivalent to a pre-flanking reference marker sequence to one or more sub-sequences equivalent to a post-flanking reference marker sequence, wherein the reference marker sequences flank a region containing a repetitive sequence region. In response to the chain containing multiple instances of the one or more short fixed length source sub-sequences, thereby defining a cycle, the sequences from the collection derived from the sample are examined to find one or more sequences that span the cycle, and at least one of: (i) the lengths of the spanning sequences are used to determine the length of the cycle and; (ii) the number of repeat motif copies within each spanning sequence are counted. |
申请公布号 |
US2016103955(A1) |
申请公布日期 |
2016.04.14 |
申请号 |
US201514793273 |
申请日期 |
2015.07.07 |
申请人 |
International Business Machines Corporation |
发明人 |
Conway Thomas C.;Wyres Kelly L. |
分类号 |
G06F19/22 |
主分类号 |
G06F19/22 |
代理机构 |
|
代理人 |
|
主权项 |
1. A method for determining a signature from biological sequence data, comprising:
extracting short fixed length source sub-sequences from a collection of source sequences derived from a sample for which the biological signature is to be determined; compiling the extracted short fixed length source sub-sequences to determine the frequency of each within the collection; using overlaps between the short fixed length source sub-sequences to find a chain of overlaps from one or more sub-sequences equivalent to a pre-flanking reference marker sequence to one or more sub-sequences equivalent to a post-flanking reference marker sequence, the reference marker sequences flanking a region containing a repetitive sequence region; and in response to the chain containing multiple instances of the one or more short fixed length source sub-sequences, thereby defining a cycle, examining the sequences from the collection derived from the sample to find one or more sequences that span the cycle, and at least one of: (i) using the lengths of the spanning sequences to determine the length of the cycle and; (ii) counting the number of repeat motif copies within each spanning sequence; wherein one or more of the above steps are performed in accordance with a processor and a memory. |
地址 |
Armonk NY US |