一種聚類算法的并行化改進(jìn)及其在微博用戶聚類中的應(yīng)用
發(fā)布時間:2018-04-12 19:40
本文選題:聚類算法 + 并行化; 參考:《上海交通大學(xué)》2014年碩士論文
【摘要】:聚類分析時數(shù)據(jù)挖掘中的重要技術(shù)。K均值算法是聚類分析中應(yīng)用最廣泛的算法之一,被廣泛應(yīng)用于計算機視覺、文本挖掘、客戶分析等各個領(lǐng)域。K均值算法具有簡單高效的優(yōu)點,同時也存在著對初始聚類中心敏感、聚類個數(shù)K需要人工給出等問題。凝聚模糊K均值算法是一種K均值算法的改進(jìn)算法,該算法不易受初始點影響并且可以通過一種凝聚的方式自動對聚類個數(shù)進(jìn)行搜索。但是凝聚模糊K均值算法也有迭代次數(shù)過多的缺陷。 該文首先針對凝聚模糊K均值算法的缺陷提出了一種改進(jìn)的凝聚模糊K均值算法。改進(jìn)算法使用一種初始中心選擇方法替代凝聚模糊K均值算法采用的隨機初始值選擇方法,減少了所需的迭代次數(shù)。同時改進(jìn)算法應(yīng)用基于MapReduce框架的分布式實現(xiàn)增加了算法處理大數(shù)據(jù)的能力,并在Hadoop及Mahout環(huán)境下進(jìn)行了實現(xiàn)。之后對微博用戶聚類分析中的方法和問題進(jìn)行了研究,引入了基于維基百科的微博文本主題分析方法提取用戶特征。最后應(yīng)用改進(jìn)算法對微博用戶進(jìn)行聚類并對聚類結(jié)果進(jìn)行分析。實驗結(jié)果表明,,改進(jìn)算法可以減少運行過程所需地迭代次數(shù)并且在集群上具有很好地伸縮性能。對微博用戶聚類的結(jié)果進(jìn)行分析表明,該算法可以獲得適合的用戶聚類結(jié)果。
[Abstract]:The clustering analysis of data mining technology in the important.K means algorithm is one of the most widely used algorithm in clustering analysis, is widely used in computer vision, text mining, customer analysis and other fields of.K means algorithm has the advantages of simple and efficient, there are also sensitive to the initial clustering center cluster number K manual is given other issues. Agglomerative fuzzy K means algorithm is an improved K algorithm for k-means algorithm, this algorithm is not easily affected by initial points and can be a way to automatically gather cluster number search. But the defect of condensed fuzzy K mean algorithm also has an excessive number of iterations.
This paper firstly condensed defects of fuzzy K means algorithm proposed an improved agglomerative fuzzy K means algorithm. The improved algorithm uses an initial center selection method instead of the random initial condensation of fuzzy K means algorithm uses value selection method to reduce the number of iterations required. Improved algorithm implementation of distributed MapReduce framework has increased the ability to handle large data based on the same algorithm, and implemented in Hadoop and Mahout environment. The method and problem analysis of micro-blog users clustering is studied, the introduction of micro blog Wikipedia this topic analysis method based on feature extraction of user. Finally, the improved algorithm is applied to clustering and clustering results of micro-blog users were analyzed. The experimental results show that the improved algorithm can reduce the number of iterations required for operation and has good scalability in cluster The results of the clustering of micro-blog users show that the algorithm can obtain the appropriate user clustering results.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 楊小朋;何躍;;騰訊微博用戶的特征分析[J];情報雜志;2012年03期
本文編號:1741139
本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/1741139.html
最近更新
教材專著