发明名称 System, method, and computer program product for generation of local content corpus
摘要 A method for generating a body of content relevant to a geographical space can comprise building a gazette containing a lexicon of at least people, places, and organizations. A system can process content obtain from a plurality of sources to bootstrap an initial set of entities for each of the places in the gazette. A local content corpus can be created utilizing the initial set of entities. This bootstrapping process may utilize geocodes and/or heuristics that are topological, people oriented, place oriented, etc. The bootstrapping may further comprise ascribing the content based on human curated documents known to be local to the place. Documents in the local content corpus are semantically related to each other with respect to the place.
申请公布号 US9563644(B1) 申请公布日期 2017.02.07
申请号 US201213444691 申请日期 2012.04.11
申请人 Groupon, Inc. 发明人 Castillo Roger H.;Jack Thomas
分类号 G06F7/00;G06F17/30 主分类号 G06F7/00
代理机构 Alston & Bird LLP 代理人 Alston & Bird LLP
主权项 1. A computer-implemented method, comprising: building a gazette containing a set of local terms referencing at least people, places, and organizations; for each place in the gazette, generating an initial set of local terms, generating the initial set of local terms comprising: accessing a document, and calculating an initial weighting for at least a portion of the initial set of local terms; creating an initial local content corpus utilizing the initial set of terms, the initial local content corpus containing documents that are semantically related to each other with respect to the place, the local content corpus configured for providing the system with a local entity weighting of each of the local terms associated with each of the places; and monitoring a subset of content sources, each of the subset of content sources identified as comprising local content, wherein when updated content is found, updating existing local entity weighting for each of the local terms in view of the updated content; identifying and indexing additional local content by utilizing the initial local content corpus for targeted web crawling by: performing, for each place, an additional search utilizing each of the local terms and the local entity weighting of each of the local terms to identify additional local content;updating the local entity weighting of each of the local terms in view of the additional local content;updating the initial local content corpus with the updated local entity weighting of each of the local terms; andperforming a second additional search utilizing the updated local content corpus with the updated local entity weighting of each of the local terms,wherein the identifying and indexing additional local content is subject to a collection algorithm, the collection algorithm comprising: receiving a target number of content associated with each of a plurality of categories, the target number of content indicative of a minimum threshold number of content from each of the plurality of categories and a maximum number of content from each of the plurality of categories; pulling content from an index; and classifying the pulled content into one of the plurality of categories; andin an instance in which the minimum threshold is not met for a particular category, climbing a place hierarchy to identify additional content sources related to the particular category; and in response to a search from a browsing application, ranking the documents in the local content corpus according to a relevancy of the place.
地址 Chicago IL US