摘要 |
A system and method for identifying query-related keywords in documents found in a search using latent semantic analysis. The documents are represented as a document term matrix <U STYLE="SINGLE">M containing one or more document term-weight vectors d, which may be term-frequency (tf) vectors or term-frequency inverse-document-frequency (tf-idf) vectors. This matrix is subjected to a truncated singular value decomposition. The resulting transform matrix <U STYLE="SINGLE">U can be used to project a query term-weight vector q into the reduced N-dimensional space, followed by its expansion back into the full vector space using the inverse of <U STYLE="SINGLE">U. To perform a search, the similarity of q<SUB>expanded </SUB>is measured relative to each candidate document vector in this space. Exemplary similarity functions are dot product and cosine similarity. Keywords are selected with the highest values in q<SUB>expanded </SUB>that are also comprised in at least one document. Matching keywords from the query may be highlighted in the search results.
|