網(wǎng)絡(luò)信息文本挖掘若干問題研究

發(fā)布時(shí)間：2018-04-12 06:34

本文選題：文本挖掘 + 特征聚簇�。� 參考：《北京理工大學(xué)》2015年博士論文

【摘要】：面對(duì)規(guī)模龐大、維數(shù)極高的文本信息,如何設(shè)計(jì)合理的、便于擴(kuò)展的文本挖掘算法已成為數(shù)據(jù)挖掘領(lǐng)域的熱點(diǎn)方向。針對(duì)這一方向,本文對(duì)文本挖掘所涉及的若干問題進(jìn)行了深入研究,主要?jiǎng)?chuàng)新點(diǎn)包含如下五方面:1.針對(duì)傳統(tǒng)的向量空間模型維數(shù)過高并且不能處理同義詞與近義詞的問題,本文提出基于特征聚簇的向量空間模型,該模型首先將每個(gè)特征進(jìn)行向量表示;然后將這些特征進(jìn)行聚類,將得到的每一個(gè)聚簇整體作為一個(gè)特征;此外,針對(duì)專有名詞的非連續(xù)短語進(jìn)行識(shí)別,使得文本表示向量中的特征信息變得更為豐富、精準(zhǔn)。這種方法不但能夠有效降低文本向量的維度,而且能進(jìn)一步體現(xiàn)文本特征之間的語義關(guān)系,因而能夠提高文本挖掘的質(zhì)量。實(shí)驗(yàn)結(jié)果證明,使用該方法得到的文本表示向量具有較高的特征約簡(jiǎn)率,聚類F值較傳統(tǒng)方法也有明顯的提升。2.傳統(tǒng)的K-means算法對(duì)于初始中心點(diǎn)的選擇是隨機(jī)的,容易引起分析結(jié)果的波動(dòng)。針對(duì)這一問題,本文提出一種基于相似度矩陣的K-means算法。該方法不再隨機(jī)地選取初始聚簇中心,而是使用相似度矩陣有針對(duì)性地選擇更加有效的初始聚簇中心,這樣能為整個(gè)聚類過程產(chǎn)生一個(gè)良好的開端,也降低了初始中心點(diǎn)對(duì)最終的聚類結(jié)果所造成的不穩(wěn)定性影響,從而能夠取得較好的聚類質(zhì)量。實(shí)驗(yàn)結(jié)果表明改進(jìn)的算法使聚類的F值得到了明顯的提高,并且聚類結(jié)果也比較穩(wěn)定。3.針對(duì)文本挖掘應(yīng)用面臨的標(biāo)注數(shù)據(jù)不充足的現(xiàn)象,本文提出半監(jiān)督K-means算法。這種方法同時(shí)使用標(biāo)注數(shù)據(jù)和未標(biāo)注數(shù)據(jù),它充分利用已標(biāo)注數(shù)據(jù)的特點(diǎn)來輔助未標(biāo)注數(shù)據(jù)的標(biāo)注。該方法在選擇初始點(diǎn)時(shí),一部分使用標(biāo)注數(shù)據(jù)的類別中心點(diǎn),另一部分則使用距離已選的標(biāo)注數(shù)據(jù)較遠(yuǎn)的未標(biāo)注數(shù)據(jù),這樣能夠保證初始點(diǎn)分屬于不同的聚簇,從而獲得較高準(zhǔn)確率的結(jié)果。實(shí)驗(yàn)結(jié)果表明該算法是一種有效的方法,在一定程度上解決了標(biāo)注數(shù)據(jù)不充足的問題。4.不均衡訓(xùn)練語料是一種普遍現(xiàn)象,它會(huì)造成分類質(zhì)量的下降。針對(duì)這種現(xiàn)象,本文提出混合加權(quán)KNN算法。這種方法通過分析訓(xùn)練樣本的分布情況,使用比例倒數(shù)加權(quán),使得每個(gè)訓(xùn)練樣本落到待分類樣本區(qū)域中的可能性相等,從而不再受類別分布不均衡的影響,同時(shí)還結(jié)合距離加權(quán),保證了訓(xùn)練樣本距離待分類樣本越近,其權(quán)重就會(huì)越大,獲得比較理想的分類效果。實(shí)驗(yàn)結(jié)果表明該算法可以獲得較好的分類準(zhǔn)確率,是一種解決針對(duì)不均衡訓(xùn)練語料分類問題的有效方法。5.為了提高運(yùn)算效率和便于處理大數(shù)據(jù)集,對(duì)本文提出的文本聚類和文本分類算法進(jìn)行基于MapReduce的并行化處理,并把這些算法作為模塊集成于一個(gè)完整的文本挖掘系統(tǒng),實(shí)現(xiàn)文本挖掘全流程的自動(dòng)化處理。實(shí)驗(yàn)結(jié)果表明對(duì)所改進(jìn)算法的并行化處理,一方面沒有影響文本挖掘的準(zhǔn)確率,另一方面還大大提高了運(yùn)行效率。
[Abstract]:In the face of large scale and high dimension text information, how to design reasonable and easy to expand text mining algorithm has become a hot topic in the field of data mining.Aiming at this direction, this paper makes a deep research on some problems involved in text mining. The main innovations include the following five aspects: 1.Aiming at the problem that the dimension of traditional vector space model is too high to deal with synonyms and synonyms, this paper proposes a vector space model based on feature clustering.Then these features are clustered and each cluster is taken as a feature. In addition, the discontinuous phrases of proper nouns are recognized, which makes the feature information in the text representation vector more abundant and accurate.This method not only can effectively reduce the dimension of text vector, but also can further reflect the semantic relationship between text features, so it can improve the quality of text mining.The experimental results show that the text representation vector obtained by this method has a higher feature reduction rate, and the clustering F value also has a significant improvement of .2. compared with the traditional method.The traditional K-means algorithm is random for the selection of initial center points, which can easily cause fluctuation of the analysis results.To solve this problem, this paper proposes a K-means algorithm based on similarity matrix.Instead of randomly selecting initial cluster centers, the method uses similarity matrix to select more effective initial clustering centers, which can make a good start for the whole clustering process.The effect of the initial center on the instability of the final clustering results is also reduced, so that the better clustering quality can be achieved.The experimental results show that the improved algorithm can significantly improve the F value of the clustering, and the clustering results are also relatively stable. 3.In this paper, a semi-supervised K-means algorithm is proposed to solve the problem of insufficient annotated data in text mining applications.This method uses both annotated data and unannotated data, and makes full use of the characteristics of annotated data to assist in the tagging of unannotated data.When selecting the initial point, one part uses the class center of the annotated data, the other part uses the unlabeled data which is far away from the selected tagged data, which can ensure that the initial points belong to different clusters.Thus, the result of higher accuracy is obtained.Experimental results show that the algorithm is an effective method, to some extent, the problem of insufficient tagging data. 4.Unbalanced training corpus is a common phenomenon, which can lead to the decline of classification quality.In view of this phenomenon, a hybrid weighted KNN algorithm is proposed in this paper.By analyzing the distribution of training samples and using proportional reciprocal weighting, the probability of each training sample falling into the region to be classified is equal, so that it is no longer affected by the unbalanced distribution of categories.At the same time, the distance weighting ensures that the closer the training sample is to the sample to be classified, the greater the weight of the training sample is, and the better the classification effect is.Experimental results show that the algorithm can achieve better classification accuracy, and it is an effective method to solve the problem of uneven training corpus classification.In order to improve the operation efficiency and facilitate the processing of big data set, the text clustering and text classification algorithms proposed in this paper are parallelized based on MapReduce, and these algorithms are integrated into a complete text mining system as a module.The automatic processing of the whole process of text mining is realized.The experimental results show that the parallelization of the improved algorithm does not affect the accuracy of text mining on the one hand, and improves the running efficiency greatly on the other hand.
【學(xué)位授予單位】：北京理工大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2015
【分類號(hào)】：TP391.1

【共引文獻(xiàn)】

相關(guān)期刊論文前5條

1 楊柳;于劍;景麗萍;;一種自適應(yīng)的大間隔近鄰分類算法[J];計(jì)算機(jī)研究與發(fā)展;2013年11期

2 石鑫鑫;胡學(xué)鋼;林耀進(jìn);;融合互近鄰和可信度的K-近鄰分類算法[J];合肥工業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年09期

3 滕敏;衛(wèi)文學(xué);滕寧;;K-最近鄰分類算法應(yīng)用研究[J];軟件導(dǎo)刊;2015年03期

4 吳潤(rùn)秀;;一種結(jié)合DS證據(jù)理論的改進(jìn)KNN分類算法[J];統(tǒng)計(jì)與決策;2015年15期

5 林耀進(jìn);李進(jìn)金;陳錦坤;馬周明;;融合鄰域信息的k-近鄰分類[J];智能系統(tǒng)學(xué)報(bào);2014年02期

相關(guān)博士學(xué)位論文前4條

1 李自強(qiáng);大規(guī)模文本分類的若干問題研究[D];電子科技大學(xué);2013年

2 于霄;基于間隔理論的序列數(shù)據(jù)挖掘研究[D];哈爾濱工業(yè)大學(xué);2012年

3 劉志亮;基于數(shù)據(jù)驅(qū)動(dòng)的行星齒輪箱故障診斷方法研究[D];電子科技大學(xué);2013年

4 李子龍;智能交通系統(tǒng)中視頻目標(biāo)檢測(cè)與識(shí)別的關(guān)鍵算法研究[D];華南理工大學(xué);2014年

，

本文編號(hào)：1738575

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xxkjbs/1738575.html

上一篇：可見光通信系統(tǒng)的集成芯片設(shè)計(jì)及其關(guān)鍵器件研究
下一篇：生物視覺啟發(fā)下的邊緣檢測(cè)方法及其應(yīng)用研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

網(wǎng)絡(luò)信息文本挖掘若干問題研究