发明名称 Apparatus and method for generating data useful in indexing and searching
摘要 Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation characters, etc.) and exactly what patterns of those characters (one or more contiguous characters, every individual character, etc.) comprise indexable and searchable units of data. Normalization rules are used to (potentially) modify the tokens created by the tokenizer in indexing and/or searching operations. Normalization accounts for things such as case-insensitive searching and language-specific nuances in which document authors can use accepted variations in the spelling of words. Query processing must employ the same tokenization and normalization rules as source processing in order for queries to accurately search the databases, and must also employ another set of concordable characters for use in the query language. This set of "reserved" characters includes characters for wildcard searching, quoted strings, field-qualified searching, range searching, and so forth.
申请公布号 US2003200199(A1) 申请公布日期 2003.10.23
申请号 US20030417548 申请日期 2003.04.16
申请人 DOW JONES REUTERS BUSINESS INTERACTIVE, LLC 发明人 SNYDER JAMES D.
分类号 G06F17/27;G06F17/30;(IPC1-7):G06F17/30 主分类号 G06F17/27
代理机构 代理人
主权项
地址