当前位置：文档视界 › 基于自适应Nystr-m采样的大数据谱聚类算法

基于自适应Nystr-m采样的大数据谱聚类算法

软件学报ISSN 1000-9825, CODEN RUXUEW E-mail: jos@https://www.docsj.com/doc/d15650134.html,

Journal of Software,2014,25(9):2037?2049 [doi: 10.13328/https://www.docsj.com/doc/d15650134.html,ki.jos.004643] https://www.docsj.com/doc/d15650134.html,

基于自适应Nystr?m采样的大数据谱聚类算法

丁世飞1,2, 贾洪杰1,2, 史忠植2

1(中国矿业大学计算机科学与技术学院,江苏徐州 221116)

2(中国科学院计算技术研究所智能信息处理重点实验室,北京 100190)

通讯作者: 丁世飞, E-mail: dingsf@https://www.docsj.com/doc/d15650134.html,

摘要: 面对结构复杂的数据集,谱聚类是一种灵活而有效的聚类方法,它基于谱图理论,通过将数据点映射到一

个由特征向量构成的低维空间,优化数据的结构,得到令人满意的聚类结果.但在谱聚类的过程中,特征分解的计算

复杂度通常为O(n3),限制了谱聚类算法在大数据中的应用.Nystr?m扩展方法利用数据集中的部分抽样点,进行近似

计算,逼近真实的特征空间,可以有效降低计算复杂度,为大数据谱聚类算法提供了新思路.抽样策略的选择对

Nystr?m扩展技术至关重要,设计了一种自适应的Nystr?m采样方法,每个数据点的抽样概率都会在一次采样完成

后及时更新,而且从理论上证明了抽样误差会随着采样次数的增加呈指数下降.基于自适应的Nystr?m采样方法,提

出一种适用于大数据的谱聚类算法,并对该算法的可行性和有效性进行了实验验证.

关键词: 大数据;谱聚类;特征分解;Nystr?m扩展;自适应采样

中图法分类号: TP181

中文引用格式: 丁世飞,贾洪杰,史忠植.基于自适应Nystr?m采样的大数据谱聚类算法.软件学报,2014,25(9):2037?2049.

https://www.docsj.com/doc/d15650134.html,/1000-9825/4643.htm

英文引用格式: Ding SF, Jia HJ, Shi ZZ. Spectral clustering algorithm based on adaptive Nystr?m sampling for big data analysis.

Ruan Jian Xue Bao/Journal of Software, 2014,25(9):2037?2049 (in Chinese).https://www.docsj.com/doc/d15650134.html,/1000-9825/4643.htm

Spectral Clustering Algorithm Based on Adaptive Nystr?m Sampling for Big Data Analysis

DING Shi-Fei1,2, JIA Hong-Jie1,2, SHI Zhong-Zhi2

1(School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China)

2(Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, The Chinese Academy of Sciences, Beijing

100190, China)

Corresponding author: DING Shi-Fei, E-mail: dingsf@https://www.docsj.com/doc/d15650134.html,

Abstract: Spectral clustering is a flexible and effective clustering method for complex structure data sets. It is based on spectral graph

theory and can produce satisfactory clustering results by mapping the data points into a low-dimensional space constituted by eigenvectors

so that the data structure is optimized. But in the process of spectral clustering, the computational complexity of eigen-decomposition is

usually O(n3), which limits the application of spectral clustering algorithm in big data problems. Nystr?m extension method uses partial

points sampled from the data set and approximate calculation to simulate the real eigenspace. In this way, the computational complexity

can be effectively reduced, which provides a new idea for big data spectral clustering algorithm. The selection of sampling strategy is

essential for Nystr?m extension technology. In this paper, the design of an adaptive Nystr?m sampling method is presented. The sampling

probability of every data point will be updated after each sampling pass, and a proof is given that the sampling error will decrease

exponentially with the increase of sample times. Based on the adaptive Nystr?m sampling method, a spectral clustering algorithm for big

data analysis is presented, and its feasibility and effectiveness is verified by experiments.

Key words: big data; spectral clustering; eigen-decomposition; Nystr?m extension; adaptive sampling

?基金项目: 国家重点基础研究发展计划(973)(2013CB329502); 国家自然科学基金(61379101)

收稿时间:2014-04-07; 定稿时间: 2014-05-14