发明名称 METHOD AND SYSTEM FOR PROCESSING DATA FOR EVALUATING A QUALITY LEVEL OF A DATASET
摘要 A method processes data for evaluating quality level of an original dataset. The original dataset is obtained from an automated sequencing of a chain of nucleotides and represents a plurality of total mapped reads. The method includes sampling of a plurality of total mapped reads of the original dataset to produce a subset of mapped reads. The method also includes computing a dispersion indicator for the subset. The dispersion indicator represents divergence between an actual read count intensity and a theoretical read count intensity. The actual read count corresponds to the number of sampled mapped reads. The theoretical read count corresponds to a theoretical number of sampled mapped reads, which does not depend on the current sampling.
申请公布号 US2015310166(A1) 申请公布日期 2015.10.29
申请号 US201314648250 申请日期 2013.11.26
申请人 INSTITUT NATIONAL DE LA SANTE ET DE LA RECHERCHE MEDICALE (INSERM) ;CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE (C.N.R.S) ;UNIVERSITÉ DE STRASBOURG 发明人 GRONEMEYER Hinrich;MENDOZA PARRA Marco Antonio
分类号 G06F19/22;G06F17/30 主分类号 G06F19/22
代理机构 代理人
主权项 1. Method for processing data for evaluating a quality level of an original dataset resulting from an automated sequencing of a chain of nucleotides, wherein said sequenced chain comprises a plurality of predefined identified regions and said original dataset represents a plurality of total mapped reads and a plurality of read count intensities, each read count intensity corresponding to the number of total mapped reads in an identified region of said sequenced chain, said method comprising: sampling of the plurality of total mapped reads of said original dataset at a selected sampling density to produce at least one data subset comprising a plurality of sampled mapped reads;for each said data subset, computing at least one dispersion indicator for each of said identified regions, representative of the divergence between an actual read count intensity for said identified region in said data subset and a theoretical read count intensity for said identified region in said data subset, the actual read count intensity for said identified region corresponding to the number of sampled mapped reads in this identified region, the theoretical read count intensity for said identified region corresponding to a theoretical number of sampled mapped reads in this identified region which does not depend on the current sampling.
地址 Paris FR