面向海量商品數(shù)據(jù)的分布式層次聚類算法設(shè)計與實現(xiàn)

發(fā)布時間：2019-03-16 15:14

【摘要】：得益于計算機科學(xué)與信息技術(shù)的進步,企業(yè)可以方便的收集并儲存大量數(shù)據(jù)。但收集到的數(shù)據(jù)僅僅占用了大量的存儲空間,無法對企業(yè)的價值產(chǎn)生有效的幫助,因此企業(yè)開始著手于從數(shù)據(jù)中挖掘信息。以往的信息挖掘過程由專家分析并解釋數(shù)據(jù),這種方式隨著數(shù)據(jù)量以及屬性的急劇增加而變得越來越困難。所以,如何有效地從巨大數(shù)據(jù)庫中自動的發(fā)現(xiàn)知識,更進一步加工轉(zhuǎn)化成企業(yè)不可或缺的商業(yè)智慧,逐漸成為二十一世紀企業(yè)和機構(gòu)所必須面對的重要課題。在生產(chǎn)實踐中,數(shù)據(jù)的增加速度與數(shù)據(jù)分析所消耗的大量時間已經(jīng)形成了越來越突出的矛盾。數(shù)據(jù)挖掘正是為了解決傳統(tǒng)分析方法的問題,針對大規(guī)模數(shù)據(jù)的分析處理而出現(xiàn)的技術(shù)。數(shù)據(jù)挖掘通過將自學(xué)習(xí)算法應(yīng)用在大規(guī)模數(shù)據(jù)集上,得到隱藏在數(shù)據(jù)中難以獲取的知識與信息。海關(guān)作為國家商品進出口的主要監(jiān)管單位,是海量進出口數(shù)據(jù)的生產(chǎn)者和擁有者。隨著業(yè)務(wù)流程信息化建設(shè)的深入和完善,海關(guān)已經(jīng)基本實現(xiàn)了較為完整的數(shù)據(jù)化監(jiān)管和數(shù)字化運營能力。但同時,相對有限的數(shù)據(jù)分析手段與不斷增長的數(shù)據(jù)和業(yè)務(wù)復(fù)雜度之間的矛盾也日益突出。如何對海量的報關(guān)商品進行有效的歸類和管理成為海關(guān)監(jiān)管中亟待解決的問題。本論文以海關(guān)商品數(shù)據(jù)分析項目為主線,在MapReduce框架的基礎(chǔ)上實現(xiàn)了對商品數(shù)據(jù)的一系列處理模塊,形成了商品數(shù)據(jù)的分布式聚類系統(tǒng)。主要內(nèi)容包括商品數(shù)據(jù)的預(yù)處理、TF-IDF計算、倒排索引的構(gòu)建、相似度矩陣的計算、單連接層次聚類計算等。最后利用層次聚類的結(jié)果對海關(guān)的商品數(shù)據(jù)進行了整理,為海關(guān)情報分析研判模塊提供精確的分組統(tǒng)計依據(jù),在實際應(yīng)用中產(chǎn)生了效果。
[Abstract]:Thanks to advances in computer science and information technology, businesses can easily collect and store large amounts of data. However, the collected data only takes up a large amount of storage space and can not effectively help the value of the enterprise. Therefore, the enterprise begins to mine information from the data. In the past, the process of information mining was analyzed and interpreted by experts, which became more and more difficult with the rapid increase of data and attributes. Therefore, how to discover knowledge automatically from the huge database and further process into the indispensable business wisdom of enterprises has gradually become an important subject that enterprises and organizations have to face in the 21 century. In production practice, the increasing speed of data and the time consumed by data analysis have formed a more and more prominent contradiction. Data mining is just to solve the problem of traditional analysis methods, aiming at the analysis of large-scale data processing technology. By applying the self-learning algorithm to large-scale data sets, data mining can get the knowledge and information hidden in the data. As the main regulatory unit of national commodity import and export, customs is the producer and owner of mass import and export data. With the deepening and perfection of business process information construction, customs has basically realized relatively complete data-based supervision and digital operation capability. But at the same time, the contradiction between the relatively limited data analysis means and the increasing data and business complexity is becoming more and more prominent. How to effectively classify and manage the vast quantities of customs declaration goods becomes an urgent problem to be solved in customs supervision. Based on the main line of customs commodity data analysis project, a series of processing modules of commodity data are implemented on the basis of MapReduce framework, and a distributed clustering system of commodity data is formed in this paper. The main contents include commodity data preprocessing, TF-IDF calculation, inverted index construction, similarity matrix calculation, single join hierarchical clustering calculation and so on. Finally, the result of hierarchical clustering is used to sort out the commodity data of customs, which provides the accurate statistical basis for the module of customs information analysis and judgment, and produces an effect in practical application.
【學(xué)位授予單位】：浙江大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【相似文獻】

相關(guān)期刊論文前10條

1 李遠敏,林錦章;基于分治遞歸的層次聚類算法實現(xiàn)[J];湖北職業(yè)技術(shù)學(xué)院學(xué)報;2005年03期

2 陳旭玲;樓佩煌;;改進層次聚類算法在文獻分析中的應(yīng)用[J];數(shù)值計算與計算機應(yīng)用;2009年04期

3 楊棟;詹海亮;蘇錦旗;;基于區(qū)域最近鄰生長的層次聚類算法[J];化工自動化及儀表;2010年05期

4 王嫻;楊緒兵;周宇;周溜溜;;一種基于類中心矯正的層次聚類算法[J];微電子學(xué)與計算機;2011年10期

5 謝振平;王士同;王曉明;;一種基于軟邊界球分的分裂式層次聚類算法[J];模式識別與人工智能;2008年04期

6 姚玉欽;李金廣;;一種基于網(wǎng)格的層次聚類算法[J];河南師范大學(xué)學(xué)報(自然科學(xué)版);2009年04期

7 李俊輝;;基于不確定圖的層次聚類算法研究[J];中國管理信息化;2012年24期

8 李新良;;基于層次聚類算法的改進研究[J];軟件導(dǎo)刊;2007年19期

9 劉興波;;凝聚型層次聚類算法的研究[J];科技信息(科學(xué)教研);2008年11期

10 郭曉娟;劉曉霞;李曉玲;;層次聚類算法的改進及分析[J];計算機應(yīng)用與軟件;2008年06期

相關(guān)會議論文前3條

1 馬曉艷;唐雁;;層次聚類算法研究[A];2008年計算機應(yīng)用技術(shù)交流會論文集[C];2008年

2 饒金通;董槐林;姜青山;;基于孤立因子的層次聚類算法與應(yīng)用[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2004年

3 吳楠楠;史亮;饒金通;姜青山;董槐林;;一種改進的高效層次聚類算法[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2005年

相關(guān)博士學(xué)位論文前1條

1 陳遠浩;非監(jiān)督的結(jié)構(gòu)學(xué)習(xí)及其應(yīng)用[D];中國科學(xué)技術(shù)大學(xué);2008年

相關(guān)碩士學(xué)位論文前10條

1 郭芳芳;面向分類型集值數(shù)據(jù)的層次聚類算法研究[D];山西大學(xué);2015年

2 李彩云;基于密度的改進型層次聚類算法研究[D];蘭州大學(xué);2016年

3 喬端瑞;基于K-means算法及層次聚類算法的研究與應(yīng)用[D];吉林大學(xué);2016年

4 程東東;基于自然鄰的層次聚類算法研究[D];重慶大學(xué);2016年

5 呂琳;基于蟻群優(yōu)化的層次聚類算法及其在網(wǎng)絡(luò)取證中的應(yīng)用[D];山東師范大學(xué);2017年

6 周俊林;面向海量商品數(shù)據(jù)的分布式層次聚類算法設(shè)計與實現(xiàn)[D];浙江大學(xué);2017年

7 瞿俊;基于重疊度的層次聚類算法研究及其應(yīng)用[D];廈門大學(xué);2007年

8 楊海斌;一種新的層次聚類算法的研究及應(yīng)用[D];西北師范大學(xué);2011年

9 張冬梅;基于輪廓系數(shù)的層次聚類算法研究[D];燕山大學(xué);2010年

10 李慧馳;基于三度信息的雙重層次聚類算法[D];武漢理工大學(xué);2013年

，

本文編號：2441622

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2441622.html

上一篇：基于圖像的人臉特征提取與發(fā)型分類
下一篇：基于特征推理的圖標(biāo)搜索特性實驗研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向海量商品數(shù)據(jù)的分布式層次聚類算法設(shè)計與實現(xiàn)