发明名称 METHOD FOR DISCOVERING DATA ARTIFACTS IN AN ON-LINE DATA OBJECT
摘要 A method for discovering data artifacts in an on-line data object is described. One embodiment parses the on-line data object into at least one string; divides each string into a set of separate characters; for each set of separate characters, aggregates the separate characters in that set of separate characters into a sequence of tokens, each token in the sequence of tokens being one of a word, a punctuation symbol, a HyperText-Markup-Language tag, and a number; for each sequence of tokens during a first analysis phase, determines, for each of a plurality of rule sets, whether the sequence of tokens includes one or more candidate data artifacts of a distinct type to which that rule set corresponds, each of the plurality of rule sets being adapted to discovery of the distinct type of data artifact to which that rule set corresponds, at least one rule set in the plurality of rule sets including a context-free grammar; computes, for each candidate data artifact of a distinct type, a probability ranking indicating a degree of likelihood that the candidate data artifact is a data artifact of that distinct type; and classifies each candidate data artifact as a data artifact of the distinct type for which a most favorable probability ranking was computed for that candidate data artifact; associates with each classified data artifact a subject found within the on-line data object; and stores the classified data artifacts in a storage subsystem that includes at least one data structure, the classified data artifacts in the storage subsystem being indexed and organized by subject for retrieval in response to a search query indicating a particular subject.
申请公布号 US2008147588(A1) 申请公布日期 2008.06.19
申请号 US20070683936 申请日期 2007.03.08
申请人 发明人 LEFFINGWELL DEAN;MILLER JEREMIE;WIDRIG DONALD R.;KOROLEV ALEKSEY;YAKYMA OLEKSANDR
分类号 G06N5/02 主分类号 G06N5/02
代理机构 代理人
主权项
地址