基于隨機變分的在線監(jiān)督主題模型與并行化實現(xiàn)

發(fā)布時間：2018-03-17 02:28

本文選題：監(jiān)督主題模型　切入點：MapReduce　出處：《吉林大學》2017年碩士論文　論文類型：學位論文

【摘要】：在機器學習研究領域中,主題模型(Topic Models)和監(jiān)督主題模型(Supervised Topic Models)是對自然語言進行分析的通用模型。此類模型能夠通過概率分布揭示語言文字內(nèi)部的結(jié)構(gòu)特征,并將其以“主題結(jié)構(gòu)”以及“標簽”的形式可視化。監(jiān)督主題模型在現(xiàn)實中的文本分析、輿論監(jiān)控以及電子商務等方面有著廣泛的應用,因而成為機器學習的研究熱點。然而,作為一種常用的監(jiān)督主題模型,s LDA模型采用了一種變分EM算法以及坐標上升算法相嵌套的學習算法。隨著數(shù)據(jù)量的增加,兩種迭代優(yōu)化算法的疊加使s LDA的訓練時間呈指數(shù)級增長。此外,s LDA的學習算法屬于離線訓練的算法,這種特性不適用于日常生活中實時性要求高、數(shù)據(jù)量大的應用場景,如文本分類、輿論監(jiān)控等問題,所有這些問題都嚴重地制約了監(jiān)督主題模型的發(fā)展。針對以上問題,本文主要做出如下工作:1,提出了一種高效的監(jiān)督主題模型的在線學習算法。本文采用隨機變分推斷的思想改進s LDA的學習算法,通過黎曼空間的自然梯度能夠更準確的指向極大似然的理論,在學習過程中利用自然梯度替代了s LDA學習算法中的歐式空間梯度,從而加快了算法收斂的速度。此外,采用隨機優(yōu)化的思想,在迭代算法的每輪迭代中隨機采樣訓練子集用以估計全局參數(shù)的梯度,以此降低模型的計算負擔,而且賦予了s LDA在線學習的能力。2,提出了一種在線監(jiān)督主題模型的并行學習算法,并實現(xiàn)了其對多種應用場景下的支持。由于在線監(jiān)督主題模型中每輪迭代所采樣的文檔數(shù)量會對標簽預測結(jié)果造成影響,所以訓練算法需要能夠靈活的設置每輪采集樣本的大小。本文采用流行的Map Reduce并行計算框架,對在線監(jiān)督主題模型采用分布式處理,使其能夠應用于大規(guī)模數(shù)據(jù)的場景。另外,本文利用Python以及Mrjob的靈活性,實現(xiàn)了該算法支持單機單進程、單機多進程、分布式計算以及云計算的版本,進一步擴展其應用范圍。
[Abstract]:In the field of machine learning, topic models and supervised Topic models are common models for analyzing natural languages. And it is visualized in the form of "theme structure" and "label". The supervisory subject model has been widely used in text analysis, public opinion monitoring and electronic commerce in reality, so it has become a research hotspot in machine learning. As a common supervised topic model, the LDA model adopts a variational EM algorithm and a learning algorithm nested with the coordinate rise algorithm. With the increase of the amount of data, The superposition of two iterative optimization algorithms makes the training time of s LDA increase exponentially. In addition, the learning algorithm of s LDA belongs to the offline training algorithm, which is not suitable for the application of high real-time and large amount of data in daily life. Such as text classification, public opinion monitoring and so on, all of these problems have seriously restricted the development of supervisory subject models. In this paper, we propose an efficient online learning algorithm for supervised topic models by doing the following work: 1. This paper uses the idea of random variational inference to improve the learning algorithm of s LDA. Through the theory that the natural gradient of Riemannian space can point to the maximum likelihood more accurately, the natural gradient is used to replace the Euclidean space gradient in the s LDA learning algorithm in the learning process, which speeds up the convergence of the algorithm. Using the idea of stochastic optimization, the random sampling training subset is used to estimate the gradient of global parameters in each iteration of the iterative algorithm, so as to reduce the computational burden of the model. Moreover, the ability of online learning of s LDA is given. 2. A parallel learning algorithm for online supervised topic model is proposed. It also supports various application scenarios. Because the number of documents sampled per iteration in the online monitoring topic model will affect the tag prediction results. Therefore, the training algorithm needs to be able to flexibly set the size of samples collected in each round. In this paper, the popular Map Reduce parallel computing framework is adopted, and the online supervisory subject model is distributed, which can be applied to large-scale data scenarios. This paper makes use of the flexibility of Python and Mrjob to implement the algorithm to support single machine single process, single machine multi-process, distributed computing and cloud computing.
【學位授予單位】：吉林大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1;TP181

【相似文獻】

相關期刊論文前10條

1 高俊波;安博文;王曉峰;;在線論壇中潛在影響力主題的發(fā)現(xiàn)研究[J];計算機應用;2008年01期

2 吳玲達,謝毓湘,欒悉道,肖鵬;互聯(lián)網(wǎng)多媒體主題信息自動收集與處理系統(tǒng)的研制[J];計算機應用研究;2005年05期

3 蔣凡,高俊波,張敏,王煦法;BBS中主題發(fā)現(xiàn)原型系統(tǒng)的設計與實現(xiàn)[J];計算機工程與應用;2005年31期

4 周亦鵬;杜軍平;;基于時空情境模型的主題跟蹤[J];華南理工大學學報(自然科學版);2012年08期

5 陳雄;都云程;李渝勤;施水才;;基于頁面結(jié)構(gòu)分析的論壇主題信息定位方法研究[J];微計算機信息;2010年27期

6 何利益;陸國鋒;羅鵬;;動態(tài)新聞主題信息推薦系統(tǒng)設計[J];指揮信息系統(tǒng)與技術;2013年04期

7 關慧芬;師軍;;基于本體的主題爬蟲技術研究[J];計算機仿真;2009年10期

8 張宇;宋巍;劉挺;李生;;基于URL主題的查詢分類方法[J];計算機研究與發(fā)展;2012年06期

9 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學學報(自然科學版);2005年S1期

10 呂聚旺;都云程;王弘蔚;施水才;;基于新型主題信息量化方法的Web主題信息提取研究[J];現(xiàn)代圖書情報技術;2008年12期

相關會議論文前6條

1 吳晨;宋丹;薛德軍;師慶輝;;科技主題識別及表示[A];第五屆全國信息檢索學術會議論文集[C];2009年

2 熊方;王曉宇;鄭駿;周傲英;;ITED:一種基于鏈接的主題提取和主題發(fā)現(xiàn)系統(tǒng)[A];第十九屆全國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2002年

3 王玉婷;杜亞軍;涂騰濤;;基于Web鏈接的主題爬行蟲初始URL的研究[A];第四屆全國信息檢索與內(nèi)容安全學術會議論文集（上）[C];2008年

4 馮少卿;都云程;施水才;;基于模板的網(wǎng)頁主題信息抽取[A];第三屆全國信息檢索與內(nèi)容安全學術會議論文集[C];2007年

5 王琦;唐世渭;楊冬青;王騰蛟;;基于DOM的網(wǎng)頁主題信息自動提取[A];第二十一屆中國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2004年

6 刁宇峰;王昊;林鴻飛;楊亮;;博客中重復評論發(fā)現(xiàn)[A];中國計算語言學研究前沿進展（2009-2011）[C];2011年

相關博士學位論文前4條

1 楊肖;基于主題的互聯(lián)網(wǎng)信息抓取研究[D];浙江大學;2014年

2 趙一鳴;基于多維尺度分析的潛在主題可視化研究[D];華中師范大學;2013年

3 吳永輝;面向?qū)I(yè)領域的網(wǎng)絡信息采集及主題檢測技術研究與應用[D];哈爾濱工業(yè)大學;2010年

4 薛利;面向證券應用的WEB主題觀點挖掘若干關鍵問題研究[D];復旦大學;2013年

相關碩士學位論文前10條

1 解琰;主題優(yōu)化過濾方法研究與應用[D];大連海事大學;2015年

2 楊春艷;基于語義和引用加權(quán)的文獻主題提取研究[D];浙江大學;2015年

3 盧洋;基于主題模型的混合推薦算法研究[D];電子科技大學;2014年

4 黃志;基于維基歧義頁的搜索結(jié)果聚類方法研究[D];北京理工大學;2015年

5 王亮;基于主題模型的文本挖掘的研究[D];大連理工大學;2015年

6 任昱鳳;基于Hadoop的分布式主題爬蟲及其實現(xiàn)[D];陜西師范大學;2015年

7 韓琳;基于貝葉斯主題爬蟲的研究與實現(xiàn)[D];北京工業(yè)大學;2015年

8 黎楠;面向?qū)＠闹黝}挖掘技術研究及應用[D];北京工業(yè)大學;2015年

9 劉學江;超大規(guī)模社交網(wǎng)絡中基于結(jié)構(gòu)與主題的社團挖掘[D];電子科技大學;2015年

10 黃文強;安卓技術信息的主題爬蟲技術研究與實現(xiàn)[D];東南大學;2015年

，

本文編號：1622777

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/1622777.html

上一篇：基于最大熵原理的NBA賽事勝負預測與方法研究
下一篇：火電機組主蒸汽溫度優(yōu)化系統(tǒng)控制策略的研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于隨機變分的在線監(jiān)督主題模型與并行化實現(xiàn)