基于語義的文本聚類算法研究

發(fā)布時間：2018-04-09 17:44

本文選題：文本聚類　切入點：連續(xù)詞向量　出處：《北京交通大學》2017年碩士論文

【摘要】：隨著信息技術的飛速發(fā)展,網絡數據呈現指數級增長。如何快速、準確地從海量網絡資源中篩選出目標信息,已成為人們面臨的重要問題之一。文本聚類作為涵蓋了數據挖掘、機器學習和自然語言處理等領域的一種重要的文本挖掘技術在這樣的時代背景下應運而生。向量空間模型由于其簡便、高效性而被廣泛應用于文本聚類研究中,然而,由于傳統(tǒng)的向量空間模型直接將文本中的詞作為文本表示的特征,忽略了詞間可能存在的語義關系,進而導致文本語義信息丟失的問題。針對這一問題,一些學者提出通過語義消歧的方式將文本中的詞映射至WordNet中與其詞義對應的概念,來識別文本中的歧義詞和同義詞。通過對這些方法的分析,我們發(fā)現其語義消歧策略存在一些不足的地方,由此,本文提出了一種基于連續(xù)詞向量的語義消歧算法,該算法探索運用神經網絡語言模型來深度挖掘概念與上下文間的語義相似度信息,進而提高語義消歧的準確性。通過將該算法應用于文本聚類分析,本文實現了一種基于連續(xù)詞向量語義消歧的文本聚類算法。由于本體WordNet中包含有大量的語義知識,且被以結構化的形式進行組織,一些旨在豐富文本語義表達、基于WordNet的文本表示方法被相繼提出,并應用于文本聚類研究中。然而,由于文本數據語義信息的復雜、多樣性,且WordNet中概念多達十萬個,因此這類方法普遍存在文本向量維度過高的問題。針對這一問題,本文提出了一種基于概念簇的特征降維算法,旨在通過概念聚類來對文本進行粗粒度特征抽取,從而達到降低文本表示維度的目的。在該算法中,最棘手,同時也是最關鍵的一個問題是如何獲取概念的語義表示,以用于后續(xù)概念聚類分析。本文基于神經網絡語言模型在語義特征抽取研究中的有效性,探索將WordNet中概念間的釋義關系編碼至一個概念語料庫中,并利用神經網絡語言模型基于概念在該語料庫中的共現情況來學習概念的語義表示。通過結合本文提出的基于連續(xù)詞向量的語義消歧算法與基于概念簇的特征降維算法,本文實現了一種基于連續(xù)詞向量和概念簇的文本聚類算法,旨在提升文本聚類準確性的同時,提高聚類算法的效率。通過與若干經典文本聚類算法的實驗比較,我們發(fā)現,本文提出的文本聚類算法不僅能有效提高文本聚類的準確性,而且很好的解決了文本表示高維度問題。
[Abstract]:With the rapid development of information technology, network data presents exponential growth.How to quickly and accurately screen out the target information from massive network resources has become one of the important problems that people are facing.Text clustering is an important text mining technology which covers the fields of data mining, machine learning and natural language processing.Vector space model is widely used in text clustering research because of its simplicity and efficiency. However, because the traditional vector space model directly takes the words in the text as the feature of text representation, it ignores the semantic relations that may exist between words.Then it leads to the loss of text semantic information.To solve this problem, some scholars have proposed to identify ambiguous words and synonyms in the text by semantic disambiguation by mapping the words in the text to the concepts corresponding to their meanings in WordNet.Through the analysis of these methods, we find that there are some shortcomings in the semantic disambiguation strategy. Therefore, a semantic disambiguation algorithm based on continuous word vector is proposed in this paper.The algorithm explores the use of neural network language model to deeply mine semantic similarity information between concepts and contexts, thus improving the accuracy of semantic disambiguation.By applying this algorithm to text clustering analysis, a text clustering algorithm based on continuous word vector semantic disambiguation is implemented in this paper.Because ontology WordNet contains a lot of semantic knowledge and is organized in a structured form, some text representation methods based on WordNet have been proposed and applied to text clustering research.However, due to the complexity and diversity of semantic information of text data and the fact that there are as many as 100, 000 concepts in WordNet, this kind of method generally exists the problem of high dimension of text vector.To solve this problem, a feature reduction algorithm based on concept cluster is proposed in this paper, which aims to extract coarse-grained feature of text through concept clustering, so as to reduce the dimensionality of text representation.In this algorithm, one of the most difficult and crucial problems is how to obtain the semantic representation of concepts for subsequent conceptual clustering analysis.Based on the validity of neural network language model in semantic feature extraction, this paper explores how to encode the definitions of concepts in WordNet into a concept corpus.The neural network language model is used to study the semantic representation of concepts based on the co-occurrence of concepts in the corpus.By combining the semantic disambiguation algorithm based on continuous word vector and the feature dimension reduction algorithm based on concept cluster, a text clustering algorithm based on continuous word vector and concept cluster is implemented in this paper.In order to improve the accuracy of text clustering and improve the efficiency of clustering algorithm.By comparing with some classical text clustering algorithms, we find that the proposed text clustering algorithm can not only effectively improve the accuracy of text clustering, but also solve the problem of high dimension of text representation.
【學位授予單位】：北京交通大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

相關期刊論文前10條

1 洪韻佳;許鑫;;基于領域本體的知識庫多層次文本聚類研究——以中華烹飪文化知識庫為例[J];現代圖書情報技術;2013年12期

2 張黎;徐蔚然;;中文分詞研究[J];軟件;2012年12期

3 張玉峰;何超;王志芳;周磊;;融合語義聚類的企業(yè)競爭力影響因素分析研究[J];現代圖書情報技術;2012年09期

4 張丹;;中文分詞算法綜述[J];黑龍江科技信息;2012年08期

5 白旭;靳志軍;;K-中心點聚類算法優(yōu)化模型的仿真研究[J];計算機仿真;2011年01期

6 王剛;邱玉輝;;基于本體及相似度的文本聚類研究[J];計算機應用研究;2010年07期

7 呂剛;鄭誠;;基于加權的本體相似度計算方法[J];計算機工程與設計;2010年05期

8 趙捧未;袁穎;;基于領域本體的語義相似度計算方法研究[J];科技情報開發(fā)與經濟;2010年08期

9 孫海霞;錢慶;成穎;;基于本體的語義相似度計算方法研究綜述[J];現代圖書情報技術;2010年01期

10 王剛;邱玉輝;蒲國林;;一個基于語義元的相似度計算方法研究[J];計算機應用研究;2008年11期

相關碩士學位論文前5條

1 李雷;基于人工智能機器學習的文字識別方法研究[D];電子科技大學;2013年

2 曹巧玲;基于網格的聚類融合算法的研究[D];鄭州大學;2011年

3 張睿;基于k-means的中文文本聚類算法的研究與實現[D];西北大學;2009年

4 鄭韞e，

本文編號：1727475

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xixikjs/1727475.html

上一篇：報紙和微博對山東疫苗事件的報道框架建構比較
下一篇：結合情感分析的股票預測研究

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于語義的文本聚類算法研究