基于Spark的譜聚類算法及其在QAR數(shù)據(jù)中的應(yīng)用

發(fā)布時(shí)間：2018-03-23 15:06

本文選題：Spark　切入點(diǎn)：譜聚類　出處：《中國(guó)民航大學(xué)》2017年碩士論文

【摘要】：譜聚類(Spectral Clustering,SC)是一種基于圖論的聚類方法,相比傳統(tǒng)聚類算法,在聚類效果上表現(xiàn)得更加優(yōu)秀。因其將聚類問(wèn)題轉(zhuǎn)化成最優(yōu)子圖劃分問(wèn)題,所以能夠在任意樣本空間中進(jìn)行聚類,并且算法收斂于全局最優(yōu)解。該算法的基本思想是將樣本數(shù)據(jù)作為圖的頂點(diǎn),樣本數(shù)據(jù)之間的相似性作為圖的加權(quán)邊,利用該加權(quán)無(wú)向圖的拉普拉斯矩陣找到一種圖劃分方法,使得子圖內(nèi)邊的權(quán)重較大,而子圖之間邊的權(quán)重較小。然而,在對(duì)大規(guī)模數(shù)據(jù)集進(jìn)行譜聚類過(guò)程中出現(xiàn)如下問(wèn)題:第一,聚類對(duì)機(jī)器內(nèi)存容量的需求超出了單一計(jì)算機(jī)的硬件能力;第二,聚類時(shí)間過(guò)長(zhǎng)。如何在大規(guī)模數(shù)據(jù)集上使用譜聚類算法進(jìn)行聚類分析是值得研究的問(wèn)題。基于分布式平臺(tái)Hadoop、Spark的譜聚類算法成為處理大規(guī)模數(shù)據(jù)的可行方案,本文研究的主要工作如下:首先,針對(duì)譜聚類算法無(wú)法處理大規(guī)模數(shù)據(jù)的問(wèn)題提出了基于Spark的譜聚類算法解決方案。利用Spark GraphX的并行圖計(jì)算優(yōu)勢(shì)分析樣本數(shù)據(jù)之間的相似性,進(jìn)而得到圖的拉普拉斯矩陣,然后利用并行化Lanczos算法將拉普拉斯矩陣轉(zhuǎn)換為三對(duì)角矩陣,求取三對(duì)角矩陣前K個(gè)特征向量,最后采用并行化K-means算法對(duì)K個(gè)特征向量按列構(gòu)成的數(shù)據(jù)進(jìn)行聚類。其次,構(gòu)建了基于Hive的QAR數(shù)據(jù)倉(cāng)庫(kù)。為了能夠更加直觀展示QAR數(shù)據(jù)在分布式文件系統(tǒng)中的組織存儲(chǔ),本文以Hadoop平臺(tái)為基礎(chǔ),構(gòu)建了HDFS可視化系統(tǒng)。并在此基礎(chǔ)上,對(duì)基于Hive的QAR數(shù)據(jù)倉(cāng)庫(kù)的總體架構(gòu)及存儲(chǔ)結(jié)構(gòu)進(jìn)行描述,實(shí)驗(yàn)表明,該數(shù)據(jù)倉(cāng)庫(kù)能夠滿足對(duì)海量QAR數(shù)據(jù)的存儲(chǔ)及查詢需求。最后,以“空中顛簸”事件為例,在某航空公司真實(shí)的QAR數(shù)據(jù)上進(jìn)行并行化譜聚類分析。實(shí)驗(yàn)表明,在保證上述QAR數(shù)據(jù)倉(cāng)庫(kù)能夠滿足快速查詢需求的同時(shí),譜聚類算法能夠?yàn)镼AR數(shù)據(jù)分析提供有效的技術(shù)支持。
[Abstract]:Spectral clustering algorithm (SCS) is a kind of clustering method based on graph theory, which is more effective than the traditional clustering algorithm. Because it transforms the clustering problem into the optimal subgraph partition problem, it can be clustered in any sample space. The algorithm converges to the global optimal solution. The basic idea of the algorithm is to take the sample data as the vertex of the graph, the similarity between the sample data as the weighted edge of the graph, and to find a graph partition method using the Laplace matrix of the weighted undirected graph. The inner edge of a subgraph is more weighted than the edge of a subgraph. However, in the process of spectral clustering of large scale data sets, the following problems arise: first, Clustering demand for machine memory capacity is beyond the hardware capabilities of a single computer; second, The clustering time is too long. It is worth studying how to use spectral clustering algorithm in large-scale data sets. The spectral clustering algorithm based on Hadoop Spark, a distributed platform, has become a feasible scheme for dealing with large-scale data. The main work of this paper is as follows: firstly, a spectral clustering algorithm based on Spark is proposed to solve the problem that the spectral clustering algorithm can not deal with large scale data. The similarity of sample data is analyzed by using the parallel graph of Spark GraphX. Then the Laplacian matrix of the graph is obtained, and then the Laplace matrix is transformed into a tridiagonal matrix by using the parallelization Lanczos algorithm, and the K eigenvectors before the tridiagonal matrix are obtained. Finally, the parallel K-means algorithm is used to cluster the data composed by columns of K feature vectors. Secondly, the QAR data warehouse based on Hive is constructed. In order to show the organization and storage of QAR data in distributed file system more intuitively. Based on the Hadoop platform, a HDFS visualization system is constructed in this paper. On this basis, the overall architecture and storage structure of QAR data warehouse based on Hive are described. The experimental results show that, The data warehouse can meet the requirement of storing and querying massive QAR data. Finally, taking the "air turbulence" event as an example, parallel spectral clustering analysis is carried out on the real QAR data of an airline. The experimental results show that, At the same time, the spectral clustering algorithm can provide effective technical support for QAR data analysis.
【學(xué)位授予單位】：中國(guó)民航大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 孫瑞山;楊繹煊;;航班起飛階段QAR關(guān)鍵參數(shù)提取研究[J];綜合運(yùn)輸;2015年09期

2 馮興杰;趙杰;;基于MapReduce的H-mine算法[J];計(jì)算機(jī)應(yīng)用研究;2016年03期

3 楊藝芳;王宇平;;基于核模糊相似度度量的譜聚類算法[J];儀器儀表學(xué)報(bào);2015年07期

4 孫瑞山;楊繹煊;汪磊;;QAR數(shù)據(jù)在飛行安全評(píng)價(jià)中的應(yīng)用[J];中國(guó)安全科學(xué)學(xué)報(bào);2015年07期

5 王有為;王偉平;孟丹;;基于統(tǒng)計(jì)方法的Hive數(shù)據(jù)倉(cāng)庫(kù)查詢優(yōu)化實(shí)現(xiàn)[J];計(jì)算機(jī)研究與發(fā)展;2015年06期

6 張魯飛;郝子宇;陳左寧;;基于矩陣計(jì)算的并行譜聚類方法[J];計(jì)算機(jī)科學(xué)與探索;2015年10期

7 楊慧;王麗婧;;基于聚類和擬合的QAR數(shù)據(jù)離群點(diǎn)檢測(cè)算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2015年01期

8 楊慧;趙蘭草;;基于FP-Tree的QAR數(shù)據(jù)故障檢測(cè)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2014年10期

9 王興良;王立宏;武栓虎;;譜聚類中選取特征向量的動(dòng)態(tài)選擇性集成方法[J];模式識(shí)別與人工智能;2014年05期

10 白劍;杜杏虎;張國(guó)順;劉媛;;并行譜聚類算法[J];網(wǎng)絡(luò)安全技術(shù)與應(yīng)用;2013年11期

相關(guān)碩士學(xué)位論文前1條

1 孟松杰;基于QAR的數(shù)據(jù)倉(cāng)庫(kù)的建設(shè)及在故障分析中的應(yīng)用[D];中國(guó)民航大學(xué);2011年

，

本文編號(hào)：1653985

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1653985.html

上一篇：基于新型DNA遺傳螢火蟲(chóng)優(yōu)化的二維圖像盲恢復(fù)算法研究
下一篇：矮化密植棗園收獲作業(yè)視覺(jué)導(dǎo)航路徑提取

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark的譜聚類算法及其在QAR數(shù)據(jù)中的應(yīng)用