发明名称 MINIMIZATION OF SURPRISAL DATA THROUGH APPLICATION OF HIERARCHY FILTER PATTERN
摘要 A computer product and system of minimizing surprisal data comprising: at a source, reading and identifying characteristics of an organism's background associated with a genetic sequence of the organism; receiving an input of rank of at least two identified characteristics of the genetic sequence; generating a hierarchy of ranked, identified characteristics based on the rank of the identified characteristics; comparing the hierarchy of ranked, identified characteristics to a repository of reference genomes; and if at least one reference genome from the repository matches the ranked characteristics, breaking the matched reference genomes into pieces, combining pieces associated with the identified characteristics from the matched reference genome to form a filter pattern to be compared to the nucleotides of the genetic sequence of the organism. The differences from the comparison are used to create surprisal data representing an entire genome of the organism.
申请公布号 US2015095293(A1) 申请公布日期 2015.04.02
申请号 US201414476234 申请日期 2014.09.03
申请人 International Business Machines Corporation 发明人 Friedlander Robert R.;Kraemer James R.
分类号 G06F17/30;G06F19/22;G06F19/28 主分类号 G06F17/30
代理机构 代理人
主权项 1. A computer program product for minimizing surprisal data representing an entire genome of an organism for compression and transmission, comprising a source computer having one or more processors and one or more computer-readable memories coupled to the one or more processors, comprising: one or more computer-readable storage devices, and program instructions, stored on the one or more storage devices, the program instructions comprising: program instructions to, at a source computer, read and identify characteristics of the organism's medical history and background associated with a genetic sequence of an organism; program instructions to receive an input of rank of at least two identified characteristics associated with the genetic sequence of the organism; program instructions to generate a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics of the genetic sequence of the organism; program instructions to compare the hierarchy of ranked, identified characteristics to a repository of reference genomes; and program instructions that if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, program instructions to: i) storing the at least one matched reference genome in a repository;ii) breaking the at least one matched reference genome into pieces comprising nucleotides of the genetic sequence which comprises at least one gene, at least some of the pieces being associated with the identified characteristics;iii) storing the pieces which are associated with the identified characteristics in the repository;iv) combining the stored pieces of the at least one matched reference genome into a filter pattern;v) comparing pieces of the nucleotides of the genetic sequence of the organism which comprises at least one gene which correspond to the stored pieces of the at least one matched reference genome to the nucleotides of the filter pattern of the pieces of the at least one matched reference genome, to find differences where nucleotides of the genetic sequence of the organism which are different from the nucleotides of the at least one matched reference genome;vi) using the differences to create surprisal data representing an entire genome of the organism and storing the surprisal data in the repository, the surprisal data comprising a starting location of the differences within the reference genome, how the reference genomes were broken into pieces, a count of a number of differences at the location within the at least one matched reference genome and the nucleotides from the genetic sequence of the organism which are different from the nucleotides of the reference genome; andvii) transmitting to a destination, a compressed, minimized genome representing an entire genome by sending the surprisal data, an indication of the at least one matched reference genome, and how the reference genome were broken into pieces and not sending sequences of nucleotides that are the same in the genetic sequence of the organism and the at least one matched reference genome.
地址 Armonk NY US