发明名称 HIERARCHICAL CONDITIONAL RANDOM FIELDS FOR WEB EXTRACTION
摘要 A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.
申请公布号 US2010281009(A1) 申请公布日期 2010.11.04
申请号 US20100776308 申请日期 2010.05.07
申请人 MICROSOFT CORPORATION 发明人 WEN JI-RONG;MA WEI-YING;NIE ZAIQING;ZHU JUN
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 代理人
主权项
地址