基于Multi-Agent的分布式文本聚類模型

發(fā)布時間：2019-01-06 16:38

【摘要】：Internet網(wǎng)絡大數(shù)據(jù)與日俱增,當前亟需設計出能夠處理大規(guī)模半結(jié)構(gòu)化和無結(jié)構(gòu)化文本數(shù)據(jù)的新型聚類方法.現(xiàn)有工作的不足體現(xiàn)在:應用的文本集較為單一,對半結(jié)構(gòu)和無結(jié)構(gòu)的Web文本進行聚類的準確性較低,當文檔規(guī)模較大時聚類的時效性無法得到保證.針對上述不足,提出新的基于群體智能的文本聚類模型Switch(a Swarm intelligence based text clustering algorithm),支持包括藏文、漢文、英文等多語言的文本聚類.基本思想為:構(gòu)建文本的向量空間模型,借助自然語言處理和數(shù)據(jù)預處理技術(shù)得到由特征向量構(gòu)成的文本集合;對群體智能文本聚類算法的參數(shù)進行初始化,不同智能體可以在二維文本空間上任意移動,計算其所在網(wǎng)格區(qū)域文本與其他樣本的相似度,利用概率轉(zhuǎn)換函數(shù)求取智能體拿起和放下樣本的概率,進而實現(xiàn)文本聚類.提出分布式動態(tài)文本流聚類的multi-agent架構(gòu),將這一架構(gòu)應用于群體智能文本聚類算法中,分布式工作環(huán)境被設計成相互通信的軟agents集合,設計了相似度計算,智能體狀態(tài)感知,文本解析三類智能體.通過解決智能體狀態(tài)同步、處理器負載均衡和處理器之間通信的代價問題,將計算任務分成不同子任務,在多處理器上分布執(zhí)行.此外,闡述了基于multi-agent的分布式群體智能文本聚類方法的工作原理,給出一種分布式通信架構(gòu),各種智能體相互通信,相互協(xié)作完成文本聚類工作.基于multi-agent通過JADE(Java Agent Development Framework)中間件實現(xiàn)集群上的分布式文本聚類,優(yōu)勢在于:分布式計算和大內(nèi)存處理較單機具有更好的處理能力,借助JADE中間件能夠使智能體間相互通信及協(xié)作,實現(xiàn)高效的文本聚類.在大量真實的半結(jié)構(gòu)化包含藏文、漢文和英文多語言的Web文本數(shù)據(jù)集上進行實驗,以藏文為例,實驗結(jié)果表明:相比于k-means和單節(jié)點上的群體智能聚類算法,提出的分布式架構(gòu)下文本聚類算法準確性平均高出12.2%和3.8%,時間代價平均縮減了73.0%和50.6%.在n個節(jié)點集群下agents數(shù)量介于150~250之間時,文本聚類時間代價近似可以達到單節(jié)點的1/n.
[Abstract]:With the increasing number of big data in Internet network, there is an urgent need to design a new clustering method which can deal with large scale semi-structured and unstructured text data. The shortcomings of the existing work are that the text set applied is relatively single, the accuracy of clustering semi-structured and unstructured Web texts is low, and the timeliness of clustering cannot be guaranteed when the document size is large. A new text clustering model (Switch (a Swarm intelligence based text clustering algorithm),) based on swarm intelligence is proposed to support text clustering in Tibetan, Chinese, English and other languages. The basic ideas are as follows: construct the vector space model of text and obtain the text set composed of feature vectors by natural language processing and data preprocessing technology; The parameters of the swarm intelligence text clustering algorithm are initialized. Different agents can move arbitrarily in the two-dimensional text space to calculate the similarity between the text in the grid region and other samples. The probabilistic transformation function is used to obtain the probability of the agent picking up and dropping the sample, and then the text clustering is realized. The multi-agent architecture of distributed dynamic text flow clustering is proposed. The architecture is applied to the swarm intelligence text clustering algorithm. The distributed working environment is designed as a soft agents set that communicates with each other. The similarity calculation and agent state awareness are designed. There are three kinds of agents for text parsing. By solving the problem of agent state synchronization, processor load balancing and communication between processors, computing tasks are divided into different sub-tasks and executed on multi-processors. In addition, the working principle of distributed swarm intelligence text clustering method based on multi-agent is described, and a distributed communication architecture is presented, in which various agents communicate with each other and cooperate with each other to complete text clustering. Based on multi-agent, distributed text clustering on cluster is realized by JADE (Java Agent Development Framework) middleware. The advantage of distributed computing and large memory processing is that distributed computing and large memory processing have better processing capability than single computer. With the help of JADE middleware, agents can communicate and cooperate with each other to achieve efficient text clustering. Experiments are carried out on a large number of real semi-structured Web text datasets containing Tibetan, Chinese and English languages. Taking Tibetan as an example, the experimental results show that compared with k-means and single-node swarm intelligence clustering algorithm, In the distributed architecture, the accuracy of the proposed text clustering algorithm is higher than that of the average of 12.2% and 3.8%, and the time cost is reduced by 73.0% and 50.6% on average. When the number of agents in n node clusters is between 150 and 250, the time cost of text clustering is approximately 1 / nnof that of a single node.
【作者單位】：成都信息工程大學網(wǎng)絡空間安全學院成都信息工程大學管理學院華東師范大學數(shù)據(jù)科學與工程學院浙江大學計算機科學與技術(shù)學院西南交通大學信息科學與技術(shù)學院四川大學計算機學院
【基金】：國家自然科學基金(61772091,61165013,61363037) 教育部人文社會科學研究規(guī)劃基金(15YJAZH058) 四川高�？蒲袆�(chuàng)新團隊建設計劃(18TD0027) 成都信息工程大學中青年學術(shù)帶頭人科研基金(J201701) 四川省科技計劃項目(2018JY0448) 廣西自然科學基金項目(2017JJD170122y)資助~~
【分類號】：TP391.1

【相似文獻】

相關期刊論文前10條

1 喬少杰;韓楠;金澈清;高云君;李天瑞;唐常杰;康健;;基于Multi-Agent的分布式文本聚類模型[J];計算機學報;2018年08期

2 黃建宇;周愛武;肖云;譚天誠;;基于特征空間的文本聚類[J];計算機技術(shù)與發(fā)展;2017年09期

3 楊婉霞;孫理和;黃永峰;;結(jié)合語義與統(tǒng)計的特征降維短文本聚類[J];計算機工程;2012年22期

4 馬娜;;文本聚類研究[J];電腦知識與技術(shù);2009年20期

5 張毓;陳軍清;;基于深度特征語義學習模型的垃圾短信文本聚類研究[J];現(xiàn)代計算機(專業(yè)版);2018年07期

6 畢強;劉健;鮑玉來;;基于語義相似度的文本聚類研究[J];現(xiàn)代圖書情報技術(shù);2016年12期

7 吳錫坤;劉洋;;基于社交網(wǎng)絡中非平衡文本聚類方法的研究[J];科技創(chuàng)新導報;2016年13期

8 李向東;劉曉斌;武利平;常洪梅;;面向路線圖編制的模糊均值文本聚類挖掘方法研究[J];河北工業(yè)大學學報;2011年03期

9 趙世奇;劉挺;李生;;一種基于主題的文本聚類方法[J];中文信息學報;2007年02期

10 車蕾;楊小平;;多特征融合文本聚類的新聞話題發(fā)現(xiàn)模型[J];國防科技大學學報;2017年03期

相關會議論文前10條

1 趙世奇;劉挺;李生;;一種基于主題的文本聚類方法[A];第三屆學生計算語言學研討會論文集[C];2006年

2 張越今;丁丁;;敏感話題發(fā)現(xiàn)中的增量型文本聚類模型[A];第30次全國計算機安全學術(shù)交流會論文集[C];2015年

3 章成志;;基于多語文本聚類的主題層次體系生成研究1)[A];國家自然科學基金委員會管理科學部宏觀管理與政策學科青年基金獲得者交流研討會論文集[C];2010年

4 王洪俊;俞士汶;蘇祺;施水才;肖詩斌;;中文文本聚類的特征單元比較[A];第二屆全國信息檢索與內(nèi)容安全學術(shù)會議（NCIRCS-2005）論文集[C];2005年

5 胡吉祥;許洪波;劉悅;王斌;程學旗;;基于重復串的短文本聚類研究[A];全國第八屆計算語言學聯(lián)合學術(shù)會議（JSCL-2005）論文集[C];2005年

6 王樂;田李;賈焰;韓偉紅;;一個并行的文本聚類混合算法[A];第二十四屆中國數(shù)據(jù)庫學術(shù)會議論文集（研究報告篇）[C];2007年

7 林靈;張百霞;李彥文;王耘;李志勇;;基于文本挖掘與計算機輔助藥物設計的中藥候選新藥發(fā)現(xiàn)方法[A];第十二次全國中西醫(yī)結(jié)合實驗醫(yī)學專業(yè)委員會暨第七次湖南省中西醫(yī)結(jié)合神經(jīng)科專業(yè)委員會學術(shù)年會論文集[C];2015年

8 孫承杰;朱文煥;林磊;劉遠超;;BBS短文本聚類技術(shù)研究[A];第五屆全國信息檢索學術(shù)會議論文集[C];2009年

9 趙飛;周渝慧;;基于Multi-Agent的電價預測支持系統(tǒng)設計[A];2009電力行業(yè)信息化年會論文集[C];2009年

10 張剛;殷國富;鄧克文;李火生;;基于Multi-Agent的復雜結(jié)構(gòu)產(chǎn)品設計模型[A];全國第13屆計算機輔助設計與圖形學（CAD/CG）學術(shù)會議論文集[C];2004年

相關博士學位論文前10條

1 徐森;文本聚類集成關鍵技術(shù)研究[D];哈爾濱工程大學;2010年

2 倪興良;問答系統(tǒng)中的短文本聚類研究與應用[D];中國科學技術(shù)大學;2011年

3 李春梅;基于Internet/Intranet和Multi-Agent的企業(yè)經(jīng)營戰(zhàn)略群體決策支持系統(tǒng)研究[D];昆明理工大學;2001年

4 何增鎮(zhèn);基于Multi-Agent與博弈論的城市交通控制誘導系統(tǒng)及其關鍵技術(shù)研究[D];中南大學;2010年

5 孟憲軍;互聯(lián)網(wǎng)文本聚類與檢索技術(shù)研究[D];哈爾濱工業(yè)大學;2009年

6 郝立麗;漢語文本數(shù)據(jù)挖掘[D];吉林大學;2009年

7 李芳;文本挖掘若干關鍵技術(shù)研究[D];北京化工大學;2010年

8 李群;主題搜索引擎聚類算法的研究[D];北京林業(yè)大學;2011年

9 王縱虎;聚類分析優(yōu)化關鍵技術(shù)研究[D];西安電子科技大學;2012年

10 高茂庭;文本聚類分析若干問題研究[D];天津大學;2007年

相關碩士學位論文前10條

1 鄒雪君;基于全覆蓋粒計算的文本特征選擇和聚類研究[D];太原理工大學;2018年

2 蔣喬薇;面向特定領域的話題檢測系統(tǒng)的設計與實現(xiàn)[D];北京郵電大學;2018年

3 王惠;基于LDA主題模型的文本聚類研究[D];蘭州大學;2018年

4 張瑞琴;基于Hadoop云計算平臺的文本聚類并行化研究[D];沈陽工業(yè)大學;2018年

5 王豐;基于GPU并行的K-MEANS算法研究及其在文本聚類的應用[D];武漢郵電科學研究院;2018年

6 王偉超;基于Hadoop的中文微博熱點話題發(fā)現(xiàn)方法研究[D];東北大學;2016年

7 方自云;基于hSync算法的文本聚類方法研究[D];武漢理工大學;2015年

8 王宇;基于統(tǒng)計學習方法的高斯LDA模型的文本聚類研究[D];華僑大學;2017年

9 李舒穎;移動應用缺陷報告的文本聚類技術(shù)研究[D];南京大學;2017年

10 張?zhí)煊?基于改進CFSFDP算法的電信投訴文本聚類方法研究[D];杭州電子科技大學;2017年

，

本文編號：2403046

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2403046.html

上一篇：基于分區(qū)曝光融合的不均勻亮度視頻增強
下一篇：基于PBR的輕量級WebGL實時真實感渲染算法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Multi-Agent的分布式文本聚類模型