发明名称 Automatic classification of segmented portions of web pages
摘要 Exemplary methods and apparatuses are provided which may be used for classifying and indexing segmented portions of web pages and providing related information for use in information extraction and/or information retrieval systems. In an embodiment, an index of segmented portions may be used by a search engine to respond to a search query. In an embodiment, one or more machine learned models may be used to identify one or more feature properties of a plurality of segmented portions within one or more files, or otherwise inferable from the one or more files. In an embodiment, one or more machine learned models may be used to classify one or more of a plurality of segmented portions as being at least one of a plurality of segment types.
申请公布号 US9514216(B2) 申请公布日期 2016.12.06
申请号 US201414480528 申请日期 2014.09.08
申请人 Yahoo! Inc. 发明人 Duan Lei;Li Fan;Vadrevu Srinivas;Velipasaoglu Emre;Hajela Swapnil;Chakrabarti Deepayan
分类号 G06F17/30;G06F15/18;G06K9/62;G06N5/04;G06N99/00;G06Q10/10 主分类号 G06F17/30
代理机构 Berkeley Law & Technology Group, LLP 代理人 Berkeley Law & Technology Group, LLP
主权项 1. A method comprising: with one or more special purpose computing devices coupled to a memory: accessing a plurality of segmented portions of at least one of a plurality of displayable web pages represented by one or more digital signals of one or more files stored in a memory, wherein a particular displayable web page of the plurality of displayable web pages comprises at least two of the plurality of segmented portions; using one or more machine learned models for: identifying one or more feature properties of the plurality of segmented portions within the one or more files, or otherwise inferable from the one or more files,classifying the at least two of the plurality of segmented portions as being at least one of a plurality of segment types based, at least in part, on the one or more identified feature properties, the one or more identified feature properties comprising at least language feature properties of a language model of content to be displayed in one or more of the at least two of the plurality of segmented portions, anddetermining content quality scores for at least two of the plurality of segmented portions of at least the particular displayable web page; and storing one or more digital signals in the memory as part of an index for the plurality of segmented portions, the index being based, at least in part, on the segment type, the index indicating the content quality scores.
地址 Sunnyvale CA US