发明名称 Number of clusters estimation
摘要 A method of determining a number of clusters for a dataset is provided. Centroid locations for a defined number of clusters are determined using a clustering algorithm. Boundaries for each of the defined clusters are defined. A reference distribution that includes a plurality of data points is created. The plurality of data points are within the defined boundary of at least one cluster of the defined clusters. Second centroid locations for the defined number of clusters are determined using the clustering algorithm and the reference distribution. A gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares and a second residual sum of squares is computed. The processing is repeated for a next number of clusters to create. An estimated best number of clusters for the received data is determined by comparing the gap statistic computed for each iteration of the number of clusters.
申请公布号 US9424337(B2) 申请公布日期 2016.08.23
申请号 US201414196299 申请日期 2014.03.04
申请人 SAS Institute Inc. 发明人 Hall Patrick;Kaynar Kabul Ilknur;Sarle Warren;Silva Jorge
分类号 G06F17/30;G06K9/62 主分类号 G06F17/30
代理机构 Bell & Manning, LLC 代理人 Bell & Manning, LLC
主权项 1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to: receive data to cluster; define a number of clusters to create; (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters; (b) define boundaries for each of the defined clusters by determining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; anddefining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension; (c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters; (d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters; (e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters; (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and (g) determine an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e).
地址 Cary NC US