发明名称 DATA ANALYSIS METHOD
摘要 Current classification methods attempt to classify each classification value into a separate class. Consequently, a lot of effort is dedicated to distinguishing between two or more similar classification objects, meaning that supervised learning procedures are slow and produce classifiers that are excessively large. Moreover, the classifiers are often difficult to understand, and take a long time to be generated. Embodiments of the invention are concerned with reducing the number of classification values that can be used to classify a data item. Relationships between classification values are identified on the basis of attribute values in a set of training data, and those classification values that are determined to be related to one another are subsumed into a single classification group. An embodiment of the invention is thus concerned with identifying groups of classification values corresponding to a set of data, where each data item in the set is characterised by a plurality of attributes, and each attribute has one of a plurality of attribute values associated therewith. The method comprises the steps of: (i) selecting an attribute; (ii) identifying, on the basis of the distribution of attribute values, two classification values that are least similar to one another and allocating a first identified classification value to a first group and a second identified classification value to a second group; (iii) allocating each unidentified classification value to one of the groups in dependence on correlation between the unidentified classification value and the first and second identified classification values;(iv) evaluating an association between the first and second groups and the selected attribute;(v) repeating steps (i) to (iv) for each of at least some of the plurality of attributes;(vi) comparing associations evaluated at step (iv) and selecting first and second groups corresponding to the weakest association; (vii) for each of the first and second groups repeating steps (i) to (vi) for the classification values therein, until the association evaluated at step (iv) falls below a predetermined threshold value. Essentially classification groups are repeatedly analysed with respect to a range of attributes so as to identify all possible groupings of classification values. For example, classification values Daily Mail, Daily Express, The Times, The Guardian, Vogue, New Scientist, Economist, Cosmopolitan, FHM, House and Garden are analysed with respect to a selection of attributes (e.g. sex, age, occupation etc.). Assuming that the analysis identifies the classification values as falling within two classification groups: [Daily Mail, Daily Express, Cosmopolitan, FHM] and [The Times, The Guardian, Vogue, New Scientist, Economist, House and Garden], each of these groups is then analysed with respect to the same, or a different, selection of attributes. This second round of analysis may identify further clusters of classification values - e.g. the analysis could show that the classification values in the latter group are clustered into two distinct groups: [House and Garden, Vogue] and [The Times, The Guardian, New Scientist, Economist]. After each ro
申请公布号 WO03090117(A1) 申请公布日期 2003.10.30
申请号 WO2003GB01471 申请日期 2003.04.04
申请人 BRITISH TELECOMMUNICATIONS PUBLIC LIMITED COMPANY;HO, COLIN, KOK, MENG;NAUCK, DETLEF, DANIEL 发明人 HO, COLIN, KOK, MENG;NAUCK, DETLEF, DANIEL
分类号 G06F17/30;G06K9/62;G06K9/68;(IPC1-7):G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址