基于分布式計(jì)算的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-12-18 01:39

【摘要】：隨著互聯(lián)網(wǎng)訪問(wèn)便捷性的提高,互聯(lián)網(wǎng)的線上活動(dòng)已經(jīng)成為一個(gè)越來(lái)越受歡迎的新興領(lǐng)域�；ヂ�(lián)網(wǎng)的快速發(fā)展,擴(kuò)大了互聯(lián)網(wǎng)的應(yīng)用領(lǐng)域。由此,互聯(lián)網(wǎng)行業(yè)也產(chǎn)生了大量的用戶數(shù)據(jù)。傳統(tǒng)的單機(jī)計(jì)算方式,已經(jīng)逐漸難以滿足互聯(lián)網(wǎng)行業(yè)實(shí)際業(yè)務(wù)情景下的計(jì)算需求和計(jì)算速度要求。而基于分布式計(jì)算的數(shù)據(jù)挖掘算法研究,有助于在互聯(lián)網(wǎng)數(shù)據(jù)量日益增多的今天發(fā)揮其在計(jì)算能力和處理速度的優(yōu)勢(shì)。這就要求人們轉(zhuǎn)換傳統(tǒng)單機(jī)計(jì)算數(shù)據(jù)挖掘算法的設(shè)計(jì)思想,實(shí)現(xiàn)分布式計(jì)算的數(shù)據(jù)挖掘算法。為了實(shí)現(xiàn)這一要求,本文提出基于分布式計(jì)算的數(shù)據(jù)挖掘研究方法。本方法基于單機(jī)數(shù)據(jù)挖掘算法原理,對(duì)目前最為廣泛使用的分類算法——樸素貝葉斯分類算法、SVM分類算法,關(guān)聯(lián)規(guī)則——FP-Growth和聚類算法——Canopy算法、k-Means聚類算法來(lái)進(jìn)行基于分布式計(jì)算的數(shù)據(jù)挖掘算法研究和實(shí)現(xiàn),并將基于分布式樸素貝葉斯算法和FP-Growth關(guān)聯(lián)規(guī)則的文本分類以及基于分布式環(huán)境的改進(jìn)k-Means算法的聚類分析應(yīng)用在微博熱點(diǎn)博文分析系統(tǒng)中。本文的主要工作如下:1.研究數(shù)據(jù)挖掘算法的基本理論和分布式計(jì)算的基本設(shè)計(jì)思想,提出了本文的重點(diǎn)研究?jī)?nèi)容——基于分布式計(jì)算的數(shù)據(jù)挖掘算法,即分布式環(huán)境中的分類算法——樸素貝葉斯算法、SVM算法,關(guān)聯(lián)規(guī)則——FP-Growth 以及聚類算法 k-Means、Canopy、改進(jìn) k-Means 聚類算法;2.基于上一步提出的研究?jī)?nèi)容,本文對(duì)分布式環(huán)境中的數(shù)據(jù)挖掘算法進(jìn)行研究。本方法,首先,在充分研究數(shù)據(jù)挖掘算法的基礎(chǔ)上,結(jié)合分布式環(huán)境Hadoop中的MapReduce編程模型的特點(diǎn)來(lái)實(shí)現(xiàn)基于分布式環(huán)境的樸素貝葉斯分類算法、SVM分類算法、關(guān)聯(lián)規(guī)則FP-Growth、Canopy聚類算法、k-Means聚類算法以及改進(jìn)的k-Means聚類算法。基于對(duì)分布式計(jì)算數(shù)據(jù)挖掘算法的實(shí)現(xiàn),針對(duì)不同分布式數(shù)據(jù)挖掘算法對(duì)經(jīng)典數(shù)據(jù)集進(jìn)行實(shí)驗(yàn)對(duì)比,分析基于分布式計(jì)算的數(shù)據(jù)挖掘算法的處理效率等方面指標(biāo);3.基于上述分布式環(huán)境中的數(shù)據(jù)挖掘方法的實(shí)驗(yàn)結(jié)果和分析,本文設(shè)計(jì)并實(shí)現(xiàn)了微博熱點(diǎn)博文分析系統(tǒng)。實(shí)驗(yàn)表明,本方法能夠滿足微博熱點(diǎn)博文分析系統(tǒng)各模塊的基本功能,并驗(yàn)證了分布式數(shù)據(jù)挖掘算法相對(duì)于單機(jī)計(jì)算的性能優(yōu)勢(shì)。微博熱點(diǎn)博文分析系統(tǒng)首先結(jié)合分布式環(huán)境中的樸素貝葉斯算法、分類規(guī)則算法來(lái)對(duì)微博博文數(shù)據(jù)進(jìn)行主題分類,然后結(jié)合本文提出的分布式環(huán)境中數(shù)據(jù)挖掘算法的改進(jìn)k-Means算法來(lái)對(duì)基于主題的微博數(shù)據(jù)進(jìn)行微博熱點(diǎn)博文分析,最后根據(jù)博文分析結(jié)果對(duì)各項(xiàng)評(píng)價(jià)指標(biāo)進(jìn)行分析。
[Abstract]:With the improvement of Internet access convenience, the online activities of the Internet have become an increasingly popular emerging field. With the rapid development of the Internet, the application of the Internet has been expanded. As a result, the Internet industry has also produced a large number of user data. The traditional single computer computing method has been gradually difficult to meet the actual business situation of the Internet industry computing requirements and computing speed requirements. The research of data mining algorithm based on distributed computing is helpful to give full play to its advantage in computing power and processing speed in today's Internet data volume increasing day by day. This requires people to change the design idea of traditional single-machine computing data mining algorithm and realize the distributed computing data mining algorithm. In order to meet this requirement, this paper proposes a research method of data mining based on distributed computing. This method is based on the principle of single machine data mining algorithm. At present, the most widely used classification algorithms are naive Bayes classification algorithm, SVM classification algorithm, association rule FP-Growth and clustering algorithm Canopy algorithm. K-Means clustering algorithm is used to research and implement the data mining algorithm based on distributed computing. The text classification based on distributed naive Bayes algorithm and FP-Growth association rules and the clustering analysis of improved k-Means algorithm based on distributed environment are applied to Weibo hot spot blog analysis system. The main work of this paper is as follows: 1. The basic theory of data mining algorithm and the basic design idea of distributed computing are studied. That is, naive Bayesian algorithm, SVM algorithm, association rule FP-Growth and k-Means-Canopyalgorithm, which are the classification algorithms in distributed environment, improve the k-Means clustering algorithm. 2. Based on the previous research content, this paper studies the data mining algorithm in distributed environment. In this method, first of all, based on the research of data mining algorithm, combining the characteristics of MapReduce programming model in distributed environment Hadoop, the naive Bayes classification algorithm, SVM classification algorithm and association rule FP-Growth, are implemented based on distributed environment. Canopy clustering algorithm, k-Means clustering algorithm and improved k-Means clustering algorithm. Based on the implementation of distributed computing data mining algorithm, this paper compares the classical data sets with different distributed data mining algorithms, and analyzes the processing efficiency of data mining algorithms based on distributed computing. 3. Based on the experimental results and analysis of the data mining methods in the distributed environment mentioned above, this paper designs and implements Weibo hot spot blog analysis system. Experiments show that this method can meet the basic functions of Weibo hot spot blog analysis system and verify the performance of distributed data mining algorithm compared with single computer. Weibo Hot spot blog Analysis system first combines naive Bayes algorithm and classification rule algorithm in distributed environment to classify the topic of Weibo blog data. Then combine the improved k-Means algorithm of data mining algorithm in distributed environment to analyze the Weibo data based on topic, then analyze the hot spot blog on the basis of the analysis result of blog. Finally, the evaluation index is analyzed according to the result of the analysis.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 陳文鋒;;基于統(tǒng)計(jì)信息的數(shù)據(jù)挖掘算法[J];統(tǒng)計(jì)與決策;2008年15期

2 王清毅,張波,蔡慶生;目前數(shù)據(jù)挖掘算法的評(píng)價(jià)[J];小型微型計(jì)算機(jī)系統(tǒng);2000年01期

3 胡浩紋,魏軍,胡濤;模糊數(shù)據(jù)挖掘算法在人力資源管理中的應(yīng)用[J];計(jì)算機(jī)與數(shù)字工程;2002年05期

4 萬(wàn)國(guó)華,陳宇曉;數(shù)據(jù)挖掘算法及其在股市技術(shù)分析中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2004年11期

5 文俊浩,胡顯芝,何光輝,徐玲;小波在數(shù)據(jù)挖掘算法中的運(yùn)用[J];重慶大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年12期

6 鄒志文,朱金偉;數(shù)據(jù)挖掘算法研究與綜述[J];計(jì)算機(jī)工程與設(shè)計(jì);2005年09期

7 趙澤茂,何坤金,胡友進(jìn);基于距離的異常數(shù)據(jù)挖掘算法及其應(yīng)用[J];計(jì)算機(jī)應(yīng)用與軟件;2005年09期

8 趙晨,諸靜;過(guò)程控制中的一種數(shù)據(jù)挖掘算法[J];武漢大學(xué)學(xué)報(bào)(工學(xué)版);2005年05期

9 王振華,柴玉梅;基于決策樹的分布式數(shù)據(jù)挖掘算法研究[J];河南科技;2005年02期

10 胡作霆;董蘭芳;王洵;;圖的數(shù)據(jù)挖掘算法研究[J];計(jì)算機(jī)工程;2006年03期

相關(guān)會(huì)議論文前10條

1 賀煒;邢春曉;潘泉;;因果不完備條件下的數(shù)據(jù)挖掘算法[A];第二十二屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集（技術(shù)報(bào)告篇）[C];2005年

2 劉玲;張興會(huì);;基于神經(jīng)網(wǎng)絡(luò)的數(shù)據(jù)挖掘算法研究[A];全國(guó)第二屆信號(hào)處理與應(yīng)用學(xué)術(shù)會(huì)議專刊[C];2008年

3 陳曦;曾凡鋒;;數(shù)據(jù)挖掘算法在風(fēng)險(xiǎn)評(píng)估中的應(yīng)用[A];2007通信理論與技術(shù)新發(fā)展——第十二屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集（上冊(cè)）[C];2007年

4 郭新宇;梁循;;大型數(shù)據(jù)庫(kù)中數(shù)據(jù)挖掘算法SLIQ的研究及仿真[A];2004年中國(guó)管理科學(xué)學(xué)術(shù)會(huì)議論文集[C];2004年

5 張沫;欒媛媛;秦培玉;羅丹;;基于聚類算法的多維客戶行為細(xì)分模型研究與實(shí)現(xiàn)[A];2011年通信與信息技術(shù)新進(jìn)展——第八屆中國(guó)通信學(xué)會(huì)學(xué)術(shù)年會(huì)論文集[C];2011年

6 潘國(guó)林;楊帆;;數(shù)據(jù)挖掘算法在保險(xiǎn)客戶分析中的應(yīng)用[A];全國(guó)第20屆計(jì)算機(jī)技術(shù)與應(yīng)用學(xué)術(shù)會(huì)議（CACIS·2009）暨全國(guó)第1屆安全關(guān)鍵技術(shù)與應(yīng)用學(xué)術(shù)會(huì)議論文集（上冊(cè)）[C];2009年

7 張乃岳;張力;張學(xué)燕;;基于字段匹配的CRM數(shù)據(jù)挖掘算法與應(yīng)用[A];邏輯學(xué)及其應(yīng)用研究——第四屆全國(guó)邏輯系統(tǒng)、智能科學(xué)與信息科學(xué)學(xué)術(shù)會(huì)議論文集[C];2008年

8 祖巧紅;陳定方;胡吉全;;客戶分析中的數(shù)據(jù)挖掘算法比較研究[A];12省區(qū)市機(jī)械工程學(xué)會(huì)2006年學(xué)術(shù)年會(huì)湖北省論文集[C];2006年

9 李怡凌;馬亨冰;;一種基于本體的關(guān)聯(lián)規(guī)則挖掘算法[A];全國(guó)第19屆計(jì)算機(jī)技術(shù)與應(yīng)用（CACIS）學(xué)術(shù)會(huì)議論文集（下冊(cè)）[C];2008年

10 盛立;劉希玉;高明;;基于粗糙集理論的數(shù)據(jù)挖掘算法研究[A];山東省計(jì)算機(jī)學(xué)會(huì)2005年信息技術(shù)與信息化研討會(huì)論文集（二）[C];2005年

相關(guān)重要報(bào)紙文章前1條

1 ;選擇合適的數(shù)據(jù)挖掘算法[N];計(jì)算機(jī)世界;2007年

相關(guān)博士學(xué)位論文前4條

1 陳云開;基于粗糙集和聚類的數(shù)據(jù)挖掘算法及其在反洗錢中的應(yīng)用研究[D];華中科技大學(xué);2007年

2 張靜;基于粗糙集理論的數(shù)據(jù)挖掘算法研究[D];西北工業(yè)大學(xué);2006年

3 沙朝鋒;基于信息論的數(shù)據(jù)挖掘算法[D];復(fù)旦大學(xué);2008年

4 梁瑾;模糊粗糙單調(diào)數(shù)據(jù)挖掘算法及在污水處理中應(yīng)用研究[D];華南理工大學(xué);2011年

相關(guān)碩士學(xué)位論文前10條

1 祁丹;基于分布式計(jì)算的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];北京郵電大學(xué);2016年

2 謝亞鑫;基于Hadoop的數(shù)據(jù)挖掘算法的研究[D];華北電力大學(xué);2015年

3 彭軍;基于新型異構(gòu)計(jì)算平臺(tái)的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];電子科技大學(xué);2015年

4 楊維;基于Hadoop的健康物聯(lián)網(wǎng)數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];東北大學(xué);2013年

5 張永芳;基于Hadoop平臺(tái)的并行數(shù)據(jù)挖掘算法研究[D];安徽理工大學(xué);2016年

6 李圍成;基于FP-樹的時(shí)空數(shù)據(jù)挖掘算法研究[D];河南工業(yè)大學(xué);2016年

7 官凱;基于MapReduce的圖挖掘研究[D];貴州師范大學(xué);2016年

8 陳名輝;基于YARN和Spark框架的數(shù)據(jù)挖掘算法并行研究[D];湖南師范大學(xué);2016年

9 劉少龍;面向大數(shù)據(jù)的高效數(shù)據(jù)挖掘算法研究[D];華北電力大學(xué)(北京);2016年

10 羅俊;數(shù)據(jù)挖掘算法的并行化研究及其應(yīng)用[D];青島大學(xué);2016年

，

本文編號(hào)：2385101

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2385101.html

上一篇：基于大數(shù)據(jù)質(zhì)檢信息多維管理系統(tǒng)研究
下一篇：結(jié)構(gòu)光三維掃描測(cè)量技術(shù)的研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于分布式計(jì)算的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)