天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

基于W-BTM的短文本主題挖掘及文本分類(lèi)應(yīng)用

發(fā)布時(shí)間:2018-01-20 12:58

  本文關(guān)鍵詞: W-BTM模型 主題挖掘 短文本 文本分類(lèi) 出處:《山西財(cái)經(jīng)大學(xué)》2017年碩士論文 論文類(lèi)型:學(xué)位論文


【摘要】:隨著互聯(lián)網(wǎng)和各類(lèi)社交網(wǎng)站以及電子商務(wù)的快速興起,以文本信息為代表的非結(jié)構(gòu)化信息大量涌現(xiàn),從中挖掘出有價(jià)值的信息變得越來(lái)越重要,但同時(shí)復(fù)雜的語(yǔ)義也使得信息價(jià)值的提取變得越來(lái)越困難。尤其是短文本信息,其稀疏性和不完整性也給文本挖掘帶來(lái)了新的巨大挑戰(zhàn)。因此,對(duì)于文本信息挖掘的研究逐步轉(zhuǎn)向了對(duì)于短文本信息挖掘的研究。BTM是一個(gè)針對(duì)短文本的主題挖掘模型,在處理短文本的稀疏性和不完整性問(wèn)題上相對(duì)于其它主題模型有很大的優(yōu)勢(shì)。但包括BTM模型在內(nèi)的現(xiàn)有文本挖掘模型,模型中都沒(méi)有特殊的參數(shù)設(shè)置等對(duì)其進(jìn)行處理,只是在數(shù)據(jù)預(yù)處理時(shí)加載停用詞表對(duì)其進(jìn)行刪除操作。而不同的語(yǔ)料選擇會(huì)有差異性,千篇一律的使用同樣的停用詞表并不具有科學(xué)性。因此,對(duì)于不同的語(yǔ)料集,應(yīng)該找出可以反映其文本特征的停用詞。基于對(duì)上述短文本特點(diǎn)和停用詞處理的考慮,以差異系數(shù)作為權(quán)重模型,表示文本中詞語(yǔ)的權(quán)重,然后將其作為BTM模型的一個(gè)參數(shù)形成最終的W-BTM模型,從而消除短文本和停用詞對(duì)文本主題挖掘的影響。模型中使用吉布斯抽樣對(duì)參數(shù)進(jìn)行估計(jì),從潛在變量的先驗(yàn)分布中抽樣,對(duì)后驗(yàn)參數(shù)進(jìn)行估計(jì)。最后將模型應(yīng)用于當(dāng)當(dāng)網(wǎng)圖書(shū)簡(jiǎn)介數(shù)據(jù),使用支持向量機(jī)對(duì)W-BTM模型產(chǎn)生的結(jié)果矩陣進(jìn)行分類(lèi),并對(duì)比不同模型的分類(lèi)結(jié)果,證明W-BTM模型的優(yōu)越性。W-BTM模型在整個(gè)語(yǔ)料集中尋找“詞對(duì)”的前提是“詞對(duì)”中每個(gè)詞在整個(gè)文檔中的權(quán)重即差異系數(shù)已知。在這種情況下,“詞對(duì)”有了更深層次的含義,它不再只是單一的表示文檔中同時(shí)出現(xiàn)的兩個(gè)詞語(yǔ),而且還代表著詞語(yǔ)本身的性質(zhì),即是否為停用詞。這就可以消除停用詞的不恰當(dāng)選擇對(duì)于文本信息挖掘準(zhǔn)確性的影響。為了驗(yàn)證W-BTM的有效性和科學(xué)性,以LDA模型和BTM模型做對(duì)比進(jìn)行文本分類(lèi)實(shí)驗(yàn)和應(yīng)用,從主題挖掘和文本分類(lèi)兩個(gè)角度對(duì)整個(gè)的實(shí)驗(yàn)結(jié)果進(jìn)行評(píng)價(jià),最終證明了W-BTM模型的分類(lèi)效果優(yōu)于LDA模型和BTM模型。本文的創(chuàng)新之處如下:(1)對(duì)于停用詞的處理,拋棄傳統(tǒng)的選擇停用詞表并將停用詞直接去除的方法,而是使用權(quán)重模型取而代之,使得文本挖掘的結(jié)果更加科學(xué)和準(zhǔn)確。(2)將權(quán)重模型與BTM模型相結(jié)合,形成新的主題模型W-BTM,既可以用于短文本的分類(lèi),解決短文本的稀疏性問(wèn)題,也彌補(bǔ)了數(shù)據(jù)預(yù)處理時(shí)停用詞處理的漏洞。(3)將W-BTM模型應(yīng)用于當(dāng)當(dāng)網(wǎng)圖書(shū)簡(jiǎn)介分類(lèi),賦予模型更加實(shí)際的現(xiàn)實(shí)意義。通過(guò)對(duì)數(shù)據(jù)不平衡性的處理、W-BTM模型的使用以及支持向量機(jī)對(duì)于文本-主題矩陣的分類(lèi),最終驗(yàn)證了W-BTM模型的有效性。針對(duì)分類(lèi)結(jié)果,將W-BTM模型與LDA模型和BTM模型進(jìn)行對(duì)比,驗(yàn)證了W-BTM模型的優(yōu)越性。
[Abstract]:With the rapid rise of Internet, social networking sites and electronic commerce, unstructured information, represented by text information, emerges in large numbers, and it becomes more and more important to mine valuable information from it. But at the same time, the complexity of semantics also makes it more and more difficult to extract the information value. Especially, the sparsity and incompleteness of the short text text information also bring a great challenge to text mining. The research of text information mining has gradually turned to the research of short text information mining. BTM is a topic mining model for short text. It has a great advantage over other topic models in dealing with the sparsity and incompleteness of short text, but the existing text mining models, including BTM model. There are no special parameter settings in the model to deal with them, only when the data preprocessing loading stop vocabulary to delete the operation, and different corpus selection will be different. It is not scientific to use the same stop thesaurus all the time. Therefore, for different corpus. Based on the consideration of the characteristics of the text and the processing of the stop word, the difference coefficient is used as the weight model to express the weight of the words in the text. Then, as a parameter of BTM model, the final W-BTM model is formed to eliminate the influence of short text and stop word on text topic mining. Gibbs sampling is used to estimate the parameters in the model. Sampling from the prior distribution of potential variables, the posterior parameters are estimated. Finally, the model is applied to the Dangdang network book profile data, and the support vector machine is used to classify the result matrix generated by the W-BTM model. The classification results of different models were compared. The premise of W-BTM model searching for word pair in the whole corpus is that the weight of each word in the whole document is known, that is, the coefficient of difference is known. "word to" has a deeper meaning, it is not only a single representation of the two words in the document, but also represents the nature of the word itself. This can eliminate the influence of improper choice of discontinuation words on the accuracy of text information mining. In order to verify the validity and scientific nature of W-BTM. The experiment and application of text classification are carried out by comparing LDA model with BTM model, and the whole experiment result is evaluated from two angles of topic mining and text classification. Finally, it is proved that the classification effect of W-BTM model is better than that of LDA model and BTM model. Instead of the traditional method of choosing to stop the word table and removing the stop word directly, the weight model is used instead. Make the result of text mining more scientific and accurate. 2) combine the weight model and BTM model to form a new topic model W-BTM. it can be used in the classification of short text. To solve the problem of short text sparsity, it also makes up the loophole of discontinuation word processing in data preprocessing. (3) the W-BTM model is applied to the classification of book profiles in Dangdang. By dealing with the imbalance of data, the use of W-BTM model and the classification of text-topic matrix by support vector machine (SVM) are given more practical significance. Finally, the validity of W-BTM model is verified, and the superiority of W-BTM model is verified by comparing W-BTM model with LDA model and BTM model.
【學(xué)位授予單位】:山西財(cái)經(jīng)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 鞠哲;曹雋U,

本文編號(hào):1448271


資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1448271.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)ba878***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com