发明名称 Webpage entity extraction through joint understanding of page structures and sentences
摘要 Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.
申请公布号 US9092424(B2) 申请公布日期 2015.07.28
申请号 US200912569912 申请日期 2009.09.30
申请人 Microsoft Technology Licensing, LLC 发明人 Nie Zaiqing;Cao Yong;Wen Ji-Rong;Yang Chunyu
分类号 G06F17/00;G06F17/27 主分类号 G06F17/00
代理机构 代理人 Wight Steve;Yee Judy;Minhas Micky
主权项 1. In a computing environment, a method comprising; processing a webpage to understand one or more entities of the webpage by bidirectional integration of web structure understanding and text understanding, including understanding text of the webpage into text segmentation data, using the text segmentation data of understanding the text and visual layout features of the web page to produce webpage structure information including a labeled block, and using the webpage structure information including the labeled block to further understand the text of the webpage including the one more entities, wherein understanding the text of the webpage and understanding the structure of the webpage are performed iteratively until an iteration similarity stop criterion is met.
地址 Redmond WA US