发明名称 Method and apparatus for underdetermined blind separation of correlated pure components from nonlinear mixture mass spectra
摘要 The present invention relates to a computer-implemented method and apparatus for data processing for the purpose of blind separation of nonnegative correlated pure components from smaller number of nonlinear mixtures of mass spectra. More specific, the invention relates to preprocessing of recorded matrix of mixtures spectra by robust principal component analysis, trimmed thresholding, hard thresholding and soft thresholding; empirical kernel map-based nonlinear mappings of preprocessed matrix of mixtures mass spectra into reproducible kernel Hilbert space and linear sparseness and nonnegativity constrained factorization of mapped matrices therein. Thereby, preprocessing of recorded matrix of mixtures mass spectra is performed to suppress higher order monomials of the pure components that are induced by nonlinear mixtures. Components separated by each factorization are correlated with the ones stored in the library. Thereby, component from the library is associated with the separated component by which it has the highest correlation coefficient. Value of the correlation coefficient indicates degree of pureness of the separated component. Separated components that are not assigned to the pure components from the library can be considered as candidates for new pure components. Identified pure components can be used for identification of compounds in chemical synthesis, food quality inspection or pollution inspection, identification and characterization of compounds obtained from natural sources (microorganisms, plants and animals), or in instrumental diagnostics—determination and identification of metabolites and biomarkers present in biological fluids (urine, blood plasma, cerebrospinal fluid, saliva, amniotic fluid, bile, tears, etc.) or tissue extracts.
申请公布号 US2015206727(A1) 申请公布日期 2015.07.23
申请号 US201414157578 申请日期 2014.01.17
申请人 RUDJER BOSKOVIC INSTITUTE 发明人 Kopriva Ivica;Jeric Ivanka;Brkljacic Lidija
分类号 H01J49/00;G01N33/50;G01N33/49;H01J49/26;G01N33/66 主分类号 H01J49/00
代理机构 代理人
主权项 1. A method for blind separation of nonnegative correlated pure components from smaller number of nonlinear mixtures mass spectra by using robust principal component analysis, trimmed thresholding, hard thresholding and soft thresholding for preprocessing of experimental data matrix of mixtures mass spectra; empirical kernel map-based nonlinear mapping of preprocessed matrices onto reproducible kernel Hilbert space, sparseness and nonnegativity constrained factorization of mapped matrices, correlation of separated components with the reference components from the library and assignment of the separated components to the pure components from the library using maximal correlation criterion, comprising the following steps: recording and storing the mixtures data X, where X is nonnegative data matrix comprised of N≧2 rows that correspond to mixture mass spectra and R columns that correspond to observations at different mass-to-charge (m/z) ratios, scaling the mixture data matrix by maximal element of X, xmax: X=X/xmax  [I] that yields new data matrix X such that 0≦xnr≦1, n=1, . . . , N, and r=1, . . . , R, representing scaled mixture data matrix in [I] by nonlinear mixture model: X=f(S)  [II] where S stands for an unknown nonnegative matrix comprised of M>N rows {sm}m=1M that correspond with pure components mass spectra and R columns that correspond with observations at different m/z ratios; f(S) implies that nonlinear mapping is performed column-wise: xr=f(sr) r=1, . . . , R, whereas f(sr)=[ƒ(sr) . . . ƒN(sr)]T and {ƒn: R0+M→R0+}n-1B. Scaling [I] implies that 0≦smr≦1, m=1, . . . , M and r=1, . . . , R, using mixed state probabilistic model for the amplitudes of the pure components mass spectra smr: p(smr)=ρmδ(smr)+(1−ρm)δ*(smr)ƒ(smr)  [III] where δ(smr) is an indicator function and δ*(smt)=1−δ(smr) is its complementary function, ρm stands for probability that smr=0. Thus, 1−ρm stands for probability that smr>0. ƒ(smr) is continuous probability density function that models sparse probability distribution of the amplitude smr. representing [II] by using truncated Taylor expansion:X=GsS+Gs2[{sm1sm2}m1,m2=1M]+HOT[IV] where {sm1sm2}m1,m2=1M stand for second order monomials that are cross-products between pure components {sm}m=1M, Gs and Gs2, are matrices of appropriate dimensions and HOT stands for higher-order terms that include monomials of order greater than 2, apply robust principal component analysis to X in [IV] to obtain: X=A+E  [V] whereA≈GsS+Gs2[{sm1sm2}m1,m2=1M]stands for low-rank matrix composed of linear combination of original pure components and linear combination of second order monomials that represent new components correlated with the original ones, and E≈HOT stands for sparse matrix that represents error terms associated with higher-order monomials, apply hard threshodling operator to X in [IV] to obtain:B≈GsS+Gs2[{sm1sm2}m1,m2=1M][VI] where B stands for hard thresholded version of X in [IV], applying soft thresholding operator to X in [IV] to obtain:C≈GsS+Gs2[{sm1sm2}m1,m2=1M][VII] where C stands for soft thresholded version of X in [IV], applying trimmed thresholding operator to X in [IV] to obtain:D≈GsS+Gs2[{sm1sm2}m1,m2=1M][VIII] where D stands for trimmed thresholded version of X in [IV], using empirical kernel map for nonlinear mapping of A in [V] onto reproducible kernel Hilbert space:Ψ(A)=[κ(a1,v1)…κ(aR,v1)………κ(a1,vD)…κ(aR,vD)][IX] where κ(ar,vd), r=1, . . . , R and d=1, . . . , D stands for positive symmetric kernel function and vd, d=1, . . . , D stand for basis vectors that approximately span the same space as the vectors: ar, r=1=, . . . , R. using empirical kernel map for nonlinear mapping of B in [VI] onto reproducible kernel Hilbert space:Ψ(B)=[κ(b1,v1)…κ(bR,v1)………κ(b1,vD)…κ(bR,vD)][X] where interpretation of Ψ(B) is equivalent to those of Ψ(A) in [IX], using empirical kernel map for nonlinear mapping of C in [VII] onto reproducible kernel Hilbert space:Ψ(C)=[κ(c1,v1)…κ(cR,v1)………κ(c1,vD)…κ(cR,vD)][XI] where interpretation of Ψ(C) is equivalent to those of Ψ(A) in [IX], using empirical kernel map for nonlinear mapping of D in [VIII] onto reproducible kernel Hilbert space:Ψ(D)=[κ(d1,v1)…κ(dR,v1)………κ(d1,vD)…κ(dR,vD)][XII] where interpretation of Ψ(D) is equivalent to those of Ψ(A) in [IX], applying sparseness and nonnegativity constrained matrix factorization (sNMF) algorithms to [IX], [X], [XI] and [XII] to obtain estimates of the pure components {sm}m=1M and some of their cross-products {sm1sm2}m1m2=1M: {smA}m=1{circumflex over (M)}=sNMF(Ψ(A))  [XIII]{smB}m=1{circumflex over (M)}=sNMF(Ψ(B))  [XIV]{smC}m=1{circumflex over (M)}=sNMF(Ψ(C))  [XV]{smD}m=1{circumflex over (M)}=sNMF(Ψ(D))  [XVI] where {circumflex over (M)} denotes overall number of components separated in [XIII], [XIV], [XV] and [XVI], estimating further the pure components by correlating { smA}m=1{circumflex over (M)} from [XIII], { smB}m=1{circumflex over (M)} from [XIV], { smC}m=1{circumflex over (M)} from [XV] and { smD}m=1M from [XVI], with the components stored in the library composed of J reference compounds {sjref}j=1J:cmjA=argmaxj=1,…,J〈s_mA,sjref〉s_mAsjrefm=1,…,M^,[XVII]cmjB=argmaxj=1,…,J〈s_mB,sjref〉s_mBsjrefm=1,…,M^,[XVIII]cmjC=argmaxj=1,…,J〈s_mC,sjref〉s_mCsjrefm=1,…,M^,[XIX]cmjD=argmaxj=1,…,J〈s_mD,sjref〉s_mDsjrefm=1,…,M^,[XX] where sm,sjref, smB,sjref, smC,sjref and smD,sjref denote the inner products respectively between smA, smB, smC, smD and sjref. ∥ smA∥, ∥ smB∥, ∥ smC∥, ∥ smD∥ and ∥sjref∥ denote, respectively, l2-norm of smA, smB, smC, smD and sjref. assigning to each component in the library {sjref}j=1J components separated from [XIII], [XIV], [XV] and [XVI] that are indexed according to:[cA,mA*]=argmaxm{cmjA}m=1Aj[XXI][cB,mB*]=argmaxm{cmjB}m=1Bj[XXII][cC,mC*]=argmaxm{cmjC}m=1Cj[XXIII][cD,mD*]=argmaxm{cmjD}m=1Dj[XXIV] where Aj, Bj, Cj and Dj respectively stand for number of separated components { smA}m=1{circumflex over (M)}, { smB}m=1{circumflex over (M)}, { smC}m=1{circumflex over (M)} and { smD}m=1{circumflex over (M)} associated respectively in [XVII], [XVIII], [XIX] and [XX] to reference component sjref. obtaining final estimates of the candidates for pure components {ŝj}j=1J according:I=argmaxA,B,C,D(cA,cB,cC,cD)s^j=s_mI*j=1,…,JandI∈{A,B,C,D}.[XXV] separated components { smA}m=1{circumflex over (M)}, { smB}m=1{circumflex over (M)}, { smC}m=1{circumflex over (M)} and { smD}m=1{circumflex over (M)} that are not assigned to the pure components from the library {ŝj}j=1J, are considered as candidates for new pure components. presenting estimated candidates of pure components {ŝj}j=1J and candidates for new pure components from [XXV].
地址 Zagreb HR