基于大數(shù)據(jù)的密度偏差抽樣算法及應(yīng)用研究
[Abstract]:With the introduction of the concept of big data, data mining has become a hot research topic in the field of big data. Aiming at the problem of computing resources and spatial resources consumed by big data mining, improving the efficiency of processing large-scale data has become the key to solve this kind of problem. At present, the main methods to improve the implementation efficiency of data mining in the field of clustering analysis are as follows: one is to improve the classical clustering algorithm, the other is to reduce the size of the original data set by means of sampling technology. Under the background of big data, the data growth rate is much faster than the algorithm improvement and update speed. Therefore, sampling technology is particularly important in cluster analysis. The traditional sampling technique is applied to the data set with large deviation and unknown distribution, which will lead to the problems of poor sampling effect, poor sample representativeness and class loss, and density deviation sampling can effectively solve this kind of problem. In this paper, the density sampling algorithm is used to study the uneven distribution of data sets, and the sampling algorithm suitable for this kind of data is explored. In recent years, the research on density deviation sampling algorithm mainly lies in how to divide the grid space which is consistent with the data set according to the information characteristics of the original data set. In order to solve the problem that the construction of variable grid takes up a lot of time resources, the existing variable grid partition method is improved in this paper. Firstly, the granularity of each dimension data is determined dynamically according to the mean information of each dimension data of the original data set. Secondly, the interval density similarity is used to adjust the interval to construct a variable grid space which is consistent with the distribution of the original dataset. Finally, a density deviation sampling optimization algorithm based on mean information is designed by combining grid space with density deviation sampling algorithm. Through the verification and analysis of the algorithm, the results show that the algorithm can not only avoid class loss, effectively improve sample quality and shorten sampling time, but also has some advantages in execution efficiency.
【學(xué)位授予單位】:貴州民族大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:C81
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 盛開元;錢雪忠;吳秦;;基于可變網(wǎng)格劃分的密度偏差抽樣算法[J];計(jì)算機(jī)應(yīng)用;2013年09期
2 余波;朱東華;劉嵩;鄭濤;;密度偏差抽樣技術(shù)在聚類算法中的應(yīng)用研究[J];計(jì)算機(jī)科學(xué);2009年02期
3 紀(jì)良浩;;基于密度偏差抽樣的聚類算法研究[J];重慶郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年06期
4 張建錦;吳渝;劉小霞;;一種改進(jìn)的密度偏差抽樣算法[J];計(jì)算機(jī)應(yīng)用;2007年07期
5 李雙虎,王鐵洪;Kmeans聚類分析算法中一個(gè)新的確定聚類個(gè)數(shù)有效性的指標(biāo)[J];河北省科學(xué)院學(xué)報(bào);2003年04期
6 趙恒,楊萬海;模糊K-Modes聚類精確度分析[J];計(jì)算機(jī)工程;2003年12期
相關(guān)會(huì)議論文 前1條
1 張建錦;劉小霞;;密度偏差抽樣及其在海量數(shù)據(jù)挖掘中的應(yīng)用[A];2006北京地區(qū)高校研究生學(xué)術(shù)交流會(huì)——通信與信息技術(shù)會(huì)議論文集(下)[C];2006年
相關(guān)碩士學(xué)位論文 前10條
1 孫志鵬;高維數(shù)據(jù)聚類算法的研究及應(yīng)用[D];江南大學(xué);2017年
2 肖雪平;面向大規(guī)模數(shù)據(jù)集的自適應(yīng)聚類算法并行化研究[D];曲阜師范大學(xué);2016年
3 孫佳;基于聚類算法的大數(shù)據(jù)樣本集優(yōu)化的研究[D];長(zhǎng)春工業(yè)大學(xué);2016年
4 張曉;基于超網(wǎng)絡(luò)的高維數(shù)據(jù)聚類方法研究[D];山東師范大學(xué);2015年
5 呂輝;基于大數(shù)據(jù)和高維數(shù)據(jù)的聚類方法的研究與設(shè)計(jì)實(shí)現(xiàn)[D];云南大學(xué);2015年
6 盛開元;聚類算法在大規(guī)模數(shù)據(jù)集上的應(yīng)用研究[D];江南大學(xué);2014年
7 趙卓真;一種基于密度與網(wǎng)格的聚類方法[D];中山大學(xué);2012年
8 段明秀;層次聚類算法的研究及應(yīng)用[D];中南大學(xué);2009年
9 連健;基于GIS的抽樣框編制與抽樣技術(shù)方法研究[D];首都師范大學(xué);2008年
10 朱強(qiáng);粒度計(jì)算在聚類分析中的應(yīng)用[D];安徽大學(xué);2007年
,本文編號(hào):2480786
本文鏈接:http://www.sikaile.net/shekelunwen/shgj/2480786.html