发明名称 Automatic acquisition of a parallel corpus from a network
摘要 Network pages are identified based on whether the pages include image alternative text that indicates that the network pages contain links to pages that are translations of each other. A plurality of pages and a plurality of respective uniform resource locators are downloaded from a server associated with the domain name of the identified network pages. The uniform resource locators are used to identify a set of candidate parallel page pairs and a set of features are created for each candidate parallel page pair. The sets of features are used to identify parallel page pairs, wherein the pages in a parallel page pair are translations of each other.
申请公布号 US2008168049(A1) 申请公布日期 2008.07.10
申请号 US20070650660 申请日期 2007.01.08
申请人 MICROSOFT CORPORATION 发明人 GAO JIANFENG;ZHANG YING;WU KE
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人
主权项
地址