一种新的基于特征向量的中文Web文档表示方法,申请号CN201010618112.5-传众专利搜索

首页产品黄页商标征信

会员服务注册登录

法人/股东/高管

发明名称	一种新的基于特征向量的中文Web文档表示方法
摘要	本发明公开了提出了一种新的基于特征向量的中文Web文档表示方法，该方法包括：将中文词典分词法进行算法扩展，使其能够通过拼接零散词的方式发现新词，并扩展分词词典；将词—文档向量矩阵扩展为词—事务向量矩阵，并利用关联规则挖掘算法，在词—事务向量矩阵中挖掘出置信度高于经验阈值的规则，将规则中的词聚为一类，降低了特征向量空间的维数。利用本发明，解决了目前中文Web文档的向量表示中，不能将新词表示出来的缺陷，同时还降低了文档向量表示的维度，大大节省了存储开销，降低了后续文本数据挖掘计算的时间复杂度。
申请公布号	CN102541935A	申请公布日期	2012.07.04
申请号	CN201010618112.5	申请日期	2010.12.31
申请人	北京安码科技有限公司	发明人	宫哲;贺智铭;蒋琴琴
分类号	G06F17/30(2006.01)I	主分类号	G06F17/30(2006.01)I
代理机构		代理人
主权项	一种新的基于特征向量的中文Web文档表示方法，其特征在于，该方法包括：一种中文Web新词汇的发现方法；一种利用关联规则算法发现同类词的方法；通过以上两种方法，将中文Web文档更有效的表示为维度更低的特征向量。
地址	100082 北京市海淀区西直门北大街32号院1号楼612

您可能感兴趣的专利

Pipe running tool

Vented furring strip

Image forming apparatus and image forming method with paper cleaning device

Wavelength division multiplexing optical transmission method and system

Variable angle powered work implement

Telescoping slide assembly

Branch prediction method and apparatus

Arrangements to detect and respond to disturbances in electrical power systems

Electro-optical device, electronic apparatus, method for forming a colored layer, and method for manufacturing the electro-optical device

Multi-domain liquid crystal display device

Lighting element for liquid crystal display

Fastening structure for securing stator of motor

Refolding of membrane proteins

Phenol novolak resin, production process thereof, and positive photoresist composition using the same

Compositions comprising copolymers of N-vinylcarboxamides and monomers with a hydrophobic radical, and use of these copolymers

Methods and compositions for the treatment of cerebral palsy

Discontinuous dielectric interface for bipolar transistors

Structural elements

Ultrasonic imaging system utilizing a long-persistence contrast agent