发明名称 METHODS FOR IDENTIFYING SEQUENCE MOTIFS, AND APPLICATIONS THEREOF
摘要 The present invention relates to methods and algorithms that can be used to identify sequence motifs that are either under- or over-represented in a given nucleotide sequence as compared to the frequency of those sequences that would be expected to occur by chance, or that are either under- or over-represented as compared to the frequency of those sequences that occur in other nucleotide sequences, and to methods of scoring sequences based on the occurrence of these sequence motifs. Such sequence motifs may be biologically significant, for example they may constitute transcription factor binding sites, mRNA stability/instability signals, epigenetic signals, and the like. The methods of the invention can also be used, inter alia, to classify sequences or organisms in terms of their phylogenetic relationships, or to identify the likely host of a pathogenic organism. The methods of the present invention can also be used to optimize expression of proteins. Figure 1 Step 1 Select a real genome or real genome portion in which to identify over- or under-represented "sequence motifs" Step 2 Generate a background genome that encodes the same amino acids, and has the same codon usage as the real genome, but is otherwise random. Step 3 Identify, and count the number of occurrences of, each word of a given length the background genome Has the standard deviation in the number of occurrences converged for the words? If no, repeat steps 2 and 3. Step 4 Compute the average background count of each word across the background genomes generated in each repetition of step 2, and convert the average background count for each word into a frequency or probability Step 5 Count the number of occurrences of each word in the real genome and convert the count for each word into a frequency or probability Step 6 Perform an iterative word search algorithm to identify words contributing to the difference between the real and background genome probability distributions List of words or "sequence motifs" that contribute most to the difference between the real and background genomes.
申请公布号 AU2013206364(B2) 申请公布日期 2016.07.07
申请号 AU20130206364 申请日期 2013.06.17
申请人 INSTITUTE FOR ADVANCED STUDY 发明人 ROBINS, HARLAN;KRASNITZ, MICHAEL;LEVINE, ARNOLD
分类号 G06F19/00 主分类号 G06F19/00
代理机构 代理人
主权项
地址