基于密度聚類算法的研究與改進

發(fā)布時間：2018-05-12 08:12

本文選題：聚類 + 密度峰值　；參考：《內(nèi)蒙古大學》2017年碩士論文

【摘要】：聚類分析,是一種在沒有任何先驗知識的情況下對待聚類數(shù)據(jù)根據(jù)數(shù)據(jù)間的相似性來進行分類的一種技術(shù),在模式識別中被稱為無監(jiān)督分類,在統(tǒng)計學中被稱為非參數(shù)估計。聚類分析被廣泛地應用于眾多學術(shù)領(lǐng)域,比如生物信息學、信息安全、文本聚類等。在過去發(fā)展的幾十年,數(shù)以千計的聚類算法被不同學者提出,但是仍存在很大的研究空間,例如如何處理不同形狀及密度的簇,對高維數(shù)據(jù)的合理計算,如何有效測定聚類結(jié)果當中簇的數(shù)量,噪聲點的合理檢測及如何定義及評判一個正確的簇等等。Alex Rodriguez與Alessandro Laio在2014年提出了一種新的啟發(fā)式聚類算法 CFSFDP(Clustering by Fast Search and Find of Density Peaks)。該算法具有初始參數(shù)少、執(zhí)行速度快、可有效探測目標簇數(shù)目及對噪聲數(shù)據(jù)不敏感的特點,本文通過一系列實驗證明了該算法的有效性,并且該算法提出者利用Olivetti人臉數(shù)據(jù)庫中的圖片聚類來證明該算法可以處理高維度數(shù)據(jù)。然而通過學習研究發(fā)現(xiàn),該算法在遇到某些情況時表現(xiàn)不好。首先,該算法的初始簇中心的選取需要依靠人工選定且對處于密度稀疏區(qū)域的簇中心無法有效提取。其次,該算法認定數(shù)據(jù)集中的每個簇有且僅有一個局部密度值極點,這將導致?lián)碛卸嗝芏葮O值點的簇及共享密度極值點的簇被錯誤劃分。再者,該算法對噪聲點的識別方法會致使較多的數(shù)據(jù)點被判定為噪聲�；谶@些發(fā)現(xiàn),本文提出一種新的基于密度峰值的算法,改進算法通過改進的決策值計算方法來構(gòu)建決策圖,通過發(fā)現(xiàn)決策圖拐點來自動識別簇中心。然后通過加入構(gòu)建子簇的局部密度分布圖的操作以及改進的層次聚類算法思想對錯誤劃分的子簇進行分割和合并,最后通過新引入的數(shù)據(jù)點離群度計算公式來識別噪聲。通過實驗表明,該改進算法在多個數(shù)據(jù)集上的聚類效果優(yōu)于原有的算法及其他基于密度的聚類算法。
[Abstract]:Clustering analysis is a technique to classify clustering data according to the similarity of data without any prior knowledge. It is called unsupervised classification in pattern recognition and nonparametric estimation in statistics. Clustering analysis is widely used in many academic fields, such as bioinformatics, information security, text clustering and so on. In the past decades, thousands of clustering algorithms have been proposed by different scholars, but there is still a lot of research space, such as how to deal with clusters with different shapes and densities, and how to calculate the high-dimensional data reasonably. In 2014, Alex Rodriguez and Alessandro Laio proposed a new heuristic clustering algorithm, CFSFDP(Clustering by Fast Search and Find of Density Peaks, how to effectively determine the number of clusters in clustering results, how to reasonably detect noise points and how to define and judge a correct cluster. The algorithm has the advantages of less initial parameters, fast execution speed, effective detection of the number of target clusters and insensitivity to noise data. The effectiveness of the algorithm is proved by a series of experiments in this paper. The proposed algorithm uses image clustering in Olivetti face database to prove that the algorithm can deal with high dimensional data. However, it is found that the algorithm does not perform well in some cases. Firstly, the selection of initial cluster centers depends on manual selection and can not be effectively extracted from clusters located in dense sparse regions. Secondly, the algorithm determines that each cluster in the dataset has only one local density extremum, which leads to the misdivision of clusters with multi-density extremum points and clusters with shared density extremum points. Furthermore, more data points are judged as noise by the method of noise recognition. Based on these findings, this paper proposes a new algorithm based on the peak density. The improved algorithm constructs the decision graph by the improved method of calculating the decision value, and automatically identifies the cluster center by finding the inflection point of the decision graph. Then the sub-clusters are segmented and merged by adding the operation of constructing the local density distribution map of the subclusters and the idea of improved hierarchical clustering algorithm. Finally, the noise is identified by the newly introduced formula for calculating the outliers of data points. The experimental results show that the improved algorithm is superior to the original algorithm and other density-based clustering algorithms in clustering performance on multiple datasets.
【學位授予單位】：內(nèi)蒙古大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13

【參考文獻】

相關(guān)期刊論文前10條

1 蔣禮青;張明新;鄭金龍;戴嬌;尚趙偉;;快速搜索與發(fā)現(xiàn)密度峰值聚類算法的優(yōu)化研究[J];計算機應用研究;2016年11期

2 謝明霞;郭建忠;張海波;陳科;;高維數(shù)據(jù)相似性度量方法研究[J];計算機工程與科學;2010年05期

3 王晶;夏魯寧;荊繼武;;一種基于密度最大值的聚類算法[J];中國科學院研究生院學報;2009年04期

4 周董;劉鵬;;VDBSCAN:變密度聚類算法[J];計算機工程與應用;2009年11期

5 曾依靈;許洪波;白碩;;改進的OPTICS算法及其在文本聚類中的應用[J];中文信息學報;2008年01期

6 程世輝;盧翠英;;算法的時間復雜度分析[J];河南教育學院學報(自然科學版);2007年04期

7 薛安榮;鞠時光;何偉華;陳偉鶴;;局部離群點挖掘算法研究[J];計算機學報;2007年08期

8 賀玲;吳玲達;蔡益朝;;數(shù)據(jù)挖掘中的聚類算法綜述[J];計算機應用研究;2007年01期

9 蔡穎琨,謝昆青,馬修軍;屏蔽了輸入?yún)?shù)敏感性的DBSCAN改進算法[J];北京大學學報(自然科學版);2004年03期

10 周水庚,周傲英,曹晶;基于數(shù)據(jù)分區(qū)的DBSCAN算法[J];計算機研究與發(fā)展;2000年10期

相關(guān)博士學位論文前2條

1 楊茂林;離群檢測算法研究[D];華中科技大學;2012年

2 薛安榮;空間離群點挖掘技術(shù)的研究[D];江蘇大學;2008年

相關(guān)碩士學位論文前2條

1 張文開;基于密度的層次聚類算法研究[D];中國科學技術(shù)大學;2015年

2 易星;半監(jiān)督學習若干問題的研究[D];清華大學;2004年

，

本文編號：1877838

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xixikjs/1877838.html

上一篇：基于RPCA模型的紅外與可見光圖像融合技術(shù)研究
下一篇：基于機器視覺的汽車組合儀表檢測系統(tǒng)的設計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于密度聚類算法的研究與改進