當(dāng)前位置：主頁 > 社科論文 > 社會(huì)學(xué)論文 >

基于大數(shù)據(jù)的密度偏差抽樣算法及應(yīng)用研究

發(fā)布時(shí)間：2019-05-19 14:07

【摘要】：隨著“大數(shù)據(jù)”概念的提出,數(shù)據(jù)挖掘成為“大數(shù)據(jù)”學(xué)科領(lǐng)域的研究熱點(diǎn)。針對(duì)大數(shù)據(jù)挖掘所消耗的計(jì)算資源和空間資源問題,提高處理巨大規(guī)模數(shù)據(jù)的效率已成為解決此類問題的關(guān)鍵。目前在聚類分析領(lǐng)域提高數(shù)據(jù)挖掘執(zhí)行效率方法主要有:一是通過改進(jìn)經(jīng)典聚類算法;二是借助于抽樣技術(shù)約簡(jiǎn)原始數(shù)據(jù)集規(guī)模。在大數(shù)據(jù)背景下面臨數(shù)據(jù)快速增長(zhǎng),數(shù)據(jù)增長(zhǎng)速度遠(yuǎn)遠(yuǎn)大于算法改進(jìn)、更新速度。因此,抽樣技術(shù)在聚類分析中顯得尤為重要。傳統(tǒng)抽樣技術(shù)運(yùn)用于偏斜較大和未知分布的數(shù)據(jù)集,其將導(dǎo)致抽樣效果不理想、樣本代表性差和類丟失等問題,而采用密度偏差抽樣能有效解決此類問題。本文主要利用密度抽樣算法對(duì)分布不均勻的數(shù)據(jù)集進(jìn)行研究,探究適用于該類數(shù)據(jù)的抽樣算法。近年,對(duì)密度偏差抽樣算法研究主要在于如何根據(jù)原始數(shù)據(jù)集的信息特征劃分與數(shù)據(jù)集保持一致的網(wǎng)格空間。文中針對(duì)構(gòu)建可變網(wǎng)格占用時(shí)間資源多的問題,改進(jìn)已有的可變網(wǎng)格劃分方法。首先,該方法根據(jù)原始數(shù)據(jù)集每維數(shù)據(jù)的均值信息動(dòng)態(tài)確定每維數(shù)據(jù)劃分粒度。其次,利用區(qū)間密度相似性調(diào)整區(qū)間,構(gòu)建與原始數(shù)據(jù)集分布保持一致的可變網(wǎng)格空間。最后,將網(wǎng)格空間與密度偏差抽樣算法相結(jié)合,設(shè)計(jì)一種基于均值信息構(gòu)建可變網(wǎng)格的密度偏差抽樣優(yōu)化算法。通過對(duì)算法進(jìn)行驗(yàn)證分析,結(jié)果表明該算法處理大規(guī)模分布不均勻的數(shù)據(jù)集,不僅能避免類丟失、有效提高樣本質(zhì)量和縮短抽樣時(shí)間,而且在執(zhí)行效率上具有一定優(yōu)勢(shì)。
[Abstract]:With the introduction of the concept of big data, data mining has become a hot research topic in the field of big data. Aiming at the problem of computing resources and spatial resources consumed by big data mining, improving the efficiency of processing large-scale data has become the key to solve this kind of problem. At present, the main methods to improve the implementation efficiency of data mining in the field of clustering analysis are as follows: one is to improve the classical clustering algorithm, the other is to reduce the size of the original data set by means of sampling technology. Under the background of big data, the data growth rate is much faster than the algorithm improvement and update speed. Therefore, sampling technology is particularly important in cluster analysis. The traditional sampling technique is applied to the data set with large deviation and unknown distribution, which will lead to the problems of poor sampling effect, poor sample representativeness and class loss, and density deviation sampling can effectively solve this kind of problem. In this paper, the density sampling algorithm is used to study the uneven distribution of data sets, and the sampling algorithm suitable for this kind of data is explored. In recent years, the research on density deviation sampling algorithm mainly lies in how to divide the grid space which is consistent with the data set according to the information characteristics of the original data set. In order to solve the problem that the construction of variable grid takes up a lot of time resources, the existing variable grid partition method is improved in this paper. Firstly, the granularity of each dimension data is determined dynamically according to the mean information of each dimension data of the original data set. Secondly, the interval density similarity is used to adjust the interval to construct a variable grid space which is consistent with the distribution of the original dataset. Finally, a density deviation sampling optimization algorithm based on mean information is designed by combining grid space with density deviation sampling algorithm. Through the verification and analysis of the algorithm, the results show that the algorithm can not only avoid class loss, effectively improve sample quality and shorten sampling time, but also has some advantages in execution efficiency.
【學(xué)位授予單位】：貴州民族大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：C81

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 盛開元;錢雪忠;吳秦;;基于可變網(wǎng)格劃分的密度偏差抽樣算法[J];計(jì)算機(jī)應(yīng)用;2013年09期

2 余波;朱東華;劉嵩;鄭濤;;密度偏差抽樣技術(shù)在聚類算法中的應(yīng)用研究[J];計(jì)算機(jī)科學(xué);2009年02期

3 紀(jì)良浩;;基于密度偏差抽樣的聚類算法研究[J];重慶郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年06期

4 張建錦;吳渝;劉小霞;;一種改進(jìn)的密度偏差抽樣算法[J];計(jì)算機(jī)應(yīng)用;2007年07期

5 李雙虎,王鐵洪;Kmeans聚類分析算法中一個(gè)新的確定聚類個(gè)數(shù)有效性的指標(biāo)[J];河北省科學(xué)院學(xué)報(bào);2003年04期

6 趙恒,楊萬海;模糊K-Modes聚類精確度分析[J];計(jì)算機(jī)工程;2003年12期

相關(guān)會(huì)議論文前1條

1 張建錦;劉小霞;;密度偏差抽樣及其在海量數(shù)據(jù)挖掘中的應(yīng)用[A];2006北京地區(qū)高校研究生學(xué)術(shù)交流會(huì)——通信與信息技術(shù)會(huì)議論文集（下）[C];2006年

相關(guān)碩士學(xué)位論文前10條

1 孫志鵬;高維數(shù)據(jù)聚類算法的研究及應(yīng)用[D];江南大學(xué);2017年

2 肖雪平;面向大規(guī)模數(shù)據(jù)集的自適應(yīng)聚類算法并行化研究[D];曲阜師范大學(xué);2016年

3 孫佳;基于聚類算法的大數(shù)據(jù)樣本集優(yōu)化的研究[D];長(zhǎng)春工業(yè)大學(xué);2016年

4 張曉;基于超網(wǎng)絡(luò)的高維數(shù)據(jù)聚類方法研究[D];山東師范大學(xué);2015年

5 呂輝;基于大數(shù)據(jù)和高維數(shù)據(jù)的聚類方法的研究與設(shè)計(jì)實(shí)現(xiàn)[D];云南大學(xué);2015年

6 盛開元;聚類算法在大規(guī)模數(shù)據(jù)集上的應(yīng)用研究[D];江南大學(xué);2014年

7 趙卓真;一種基于密度與網(wǎng)格的聚類方法[D];中山大學(xué);2012年

8 段明秀;層次聚類算法的研究及應(yīng)用[D];中南大學(xué);2009年

9 連健;基于GIS的抽樣框編制與抽樣技術(shù)方法研究[D];首都師范大學(xué);2008年

10 朱強(qiáng);粒度計(jì)算在聚類分析中的應(yīng)用[D];安徽大學(xué);2007年

，

本文編號(hào)：2480786

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/shekelunwen/shgj/2480786.html

上一篇：上海對(duì)外貿(mào)易學(xué)院舉行“國際戰(zhàn)略與政策分析研究所”揭牌儀式和學(xué)術(shù)研討會(huì)
下一篇：后鄉(xiāng)土中國的家族力量及其影響的文化取向

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于大數(shù)據(jù)的密度偏差抽樣算法及應(yīng)用研究