发明名称 Method for unsupervised learning of grammatical parsers
摘要 The invention comprises a core algorithm to use language regularity in large collections of human created textual documents, as well as optimization techniques to make the algorithm tractable. The core algorithm includes receiving tuples of text units that may be grammatically linked and processing a stream of such tuples to discover language regularities. After this learning is completed, the algorithm's output is used to evaluate the perceived likelihood that different interpretations of novel sentences would have been intended by a speaker of the language.
申请公布号 US9460076(B1) 申请公布日期 2016.10.04
申请号 US201514941724 申请日期 2015.11.16
申请人 Lexalytics, Inc. 发明人 Barba Paul F.
分类号 G06F17/27;G06F17/16 主分类号 G06F17/27
代理机构 Doherty, Wallace, Pillsbury & Murphy, P.C. 代理人 Doherty, Wallace, Pillsbury & Murphy, P.C.
主权项 1. A method for unsupervised learning of a grammatical parser and the use thereof, wherein the method comprises: providing a processor on a computer, wherein the processor runs a content acquisition system to obtain a corpus of text documents over a computer network; storing the corpus of text documents from the content acquisition system on a storage device; providing a processor on the computer which runs a text analytics engine, wherein the text analytics engine comprise a core algorithm, and a factorization algorithm; using the core algorithm to: divide the corpus of text documents into a plurality of sentences;divide the sentences from the plurality of sentences into a plurality of text units;join the text units from the plurality of text units into a plurality of grammatical links; and using the factorization algorithm to factorize a matrix or a tensor for each of the grammatical links of the plurality of grammatical links to respectively generate a plurality of factorized matrices or a plurality of factorized tensors; and additionally using the core algorithm to: generate parses from a corpus of a novel document, comprising:divide the corpus from the novel document into a plurality of sentences;divide the sentences from the plurality of sentences from the novel document into a plurality of text units;identify a subset of all possible grammatical links from the plurality of text units from the novel document; anddetermine the relative likelihood of the grammatical links in the corpus of the novel document by using the factorized matrices or the factorized tensors to compute a score representing the likelihood of the grammatical links.
地址 Amherst MA US