发明名称 Method for classifying sub-trees in semi-structured documents
摘要 A method and system for classifying semi-structured documents by distinguishing sub-tree structural information as a distinct representative characteristic of a fragment of the document structure identified by a sub-tree node therein. The structural information comprises both an inner structure and an outer structure which individually can be exploited as representative data in a probabilistic classifier for classifying the sub-tree itself or the entire document. Additional representative feature data can also be independently used for classification and comprises the data content of the fragment structurally represented by the sub-tree and additionally with node attributes. The classification values independently generated from each of the different sets of features can then be combined in an assembly classifier to generate an automated classification system.
申请公布号 US2006288275(A1) 申请公布日期 2006.12.21
申请号 US20050156776 申请日期 2005.06.20
申请人 XEROX CORPORATION 发明人 CHIDLOVSKII BORIS;FUSELIER JEROME
分类号 G06F17/00 主分类号 G06F17/00
代理机构 代理人
主权项
地址