发明名称 Minimization of surprisal data through application of hierarchy of reference genomes
摘要 A method, computer product, and computer system of minimizing surprisal data comprising: at a source, reading and identifying characteristics of a genetic sequence of an organism; receiving an input of rank of at least two identified characteristics of the genetic sequence of the organism; generating a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics of the genetic sequence of the organism; comparing the hierarchy of ranked, identified characteristics to a repository of reference genomes; and if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, comparing nucleotides of the genetic sequence of the organism to nucleotides from the at least one matched reference genome, to obtain differences and create surprisal data.
申请公布号 US8855938(B2) 申请公布日期 2014.10.07
申请号 US201213475183 申请日期 2012.05.18
申请人 International Business Machines Corporation 发明人 Friedlander Robert R.;Kraemer James R.
分类号 G06F7/00;G01N19/00 主分类号 G06F7/00
代理机构 Brown & Michaels, PC 代理人 Brown & Michaels, PC ;Pivnichny John R.
主权项 1. A method of minimizing surprisal data representing an entire genome of an organism for compression and transmission, comprising, at a source computer having one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors, performing the steps of: a) reading and identifying characteristics associated with the organism's medical history and background for a genetic sequence of an organism; b) receiving an input of rank of at least two identified characteristics associated with the genetic sequence of the organism; c) generating a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics associated with the genetic sequence of the organism; d) comparing the hierarchy of ranked, identified characteristics to a repository of reference genomes; and e) if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, i) storing the at least one matched reference genome in a repository;ii) comparing nucleotides of the genetic sequence of the organism to nucleotides from the at least one matched reference genome, to find differences where nucleotides of the genetic sequence of the organism which are different from the nucleotides of the at least one matched reference genome;iii) using the differences to create surprisal data representing an entire genome of the organism and storing the surprisal data in the repository, the surprisal data comprising a starting location of the differences within the reference genome, a count of a number of differences at the location within the at least one matched reference genome and the nucleotides from the genetic sequence of the organism which are different from the nucleotides of the reference genome; repeating steps (e)(i), (e)(ii), and (e)(iii) if a another reference genome from the repository matches the hierarchy of ranked, identified characteristics; and transmitting to a destination computer having one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors, a compressed, minimized genome representing an entire genome by sending the surprisal data and the indication of the at least one matched reference genome, and not sending sequences of nucleotides that are the same in the genetic sequence of the organism and the at least one matched reference genome.
地址 Armonk NY US
您可能感兴趣的专利