基于維基百科的短文本相關度計算
本文關鍵詞:基于維基百科的短文本相關度計算 出處:《太原理工大學》2017年碩士論文 論文類型:學位論文
更多相關文章: 維基百科 相關性 短文本 語義關聯(lián)度 關聯(lián)規(guī)則
【摘要】:隨著移動通信技術與社交媒體的發(fā)展,中文短文本形式的信息已滲透在社會和生活的各個領域。巨大信息量的增長也催生出巨大的使用價值,如何挖掘出這些文本的深層價值成為了一個熱門話題。因此自然語言處理成為了研究者的研究熱點。語義相關度計算作為自然語言處理領域一項基本性的研究工作,被廣泛地應用于查詢擴展、詞義消歧、機器翻譯、知識抽取、自動糾錯等領域。而短文本作為一種新興的文本信息源,字數(shù)較少,所表述的概念信號弱、特征信息模糊,因而難以抽取有效的特征信息。鑒于短文本所表達的信息有限,因此需要大量的背景知識來對樣本特征進行擴展。維基百科作為目前世界上最大的、多語種的、開放式的在線百科全書,得到很多研究者的青睞,因此本文選擇中文維基百科作為外部語料庫,維基百科的結構信息以及語義信息也為短文本語義分析提供了基礎。本文將短文本分為詞語和句子兩部分,首先提出了一種基于維基百科的詞語間相關度的計算方法。該方法主要結合維基百科中的結構信息及語義信息,維基百科的主要結構包括分類體系結構、摘要中的鏈接結構、正文中鏈接結構以及重定向消歧頁等,提出一種綜合類別相關度與鏈接相關度的計算詞語間相關度的方法。為了探究詞語語義深層信息,提出了利用關聯(lián)規(guī)則計算詞語相關度的計算方法。在此基礎上,本文提出了句子間相關度的計算方法,主要從三大方面入手:句子結構間的相關度計算、基于詞對的相關度計算以及利用聚類對主題詞加權的聚類相關度計算。其中,句子結構又包括兩方面:詞形和詞序。在詞形相關度計算上,主要通過計算詞共現(xiàn)的頻率來體現(xiàn);在詞序計算上,通過逆序數(shù)的計算來體現(xiàn);谠~對的相關度計算主要考慮句子中詞語的深度語義信息,更符合人類主觀認識。聚類主要是將語義相關的詞語或文本聚為一類或一簇,本文將其利用到句子間相關度的計算上,提高句子相關度計算的準確率。在理論方法成型的基礎上,完成實驗方案的設計。首先,下載處理中文維基百科語料;其次完成詞語以及句子間相關度的計算;最后將計算結果與人工標注集進行對比,本實驗選用了人工翻譯Word Similarity-353測試集以及國防科技大學所統(tǒng)計的Words-240作為詞語相關度的測試集,句子相關度的測試集選擇中國數(shù)據(jù)庫萬維網(wǎng)知識提取大賽所提供的短文本語義相關度比賽評測數(shù)據(jù)集,通過對比Spearman參數(shù)和準確率等相關系數(shù),在詞語相關度計算方面,本文方法的Spearman參數(shù)比傳統(tǒng)算法提高2.8%,句子相關度準確率達到73.3%,取得較好實驗效果。證明了本文方法的合理性和實用性。
[Abstract]:With the rapid development of mobile communication technology and social media, Chinese short text information has penetrated in all fields of society and life. The large amount of information growth has also spawned a huge value, how to dig out the deep value of these texts has become a hot topic. Therefore, Natural Language Processing has become a research hotspot of researchers. The research work of semantic relevance calculation as a basic Natural Language Processing field, is widely used in word sense disambiguation, query expansion, Machine Translation, knowledge extraction, automatic error correction and other fields. And this essay as a new text information source, fewer words, concepts expressed in the weak signal, fuzzy feature information, feature so it is difficult to extract effective information. In view of the expression of short text information is limited, so a lot of background knowledge need to be extended to the wiki hundred sample characteristics. At present, as the world's largest, multilingual, open online encyclopedia, by many researchers of all ages, so this thesis chooses Chinese Wikipedia as an external corpus, provides the basis of the structure of Wikipedia information and semantic information for short text semantic. The short text is divided into two parts: words and sentences, first of all based on Wikipedia word correlation calculation method. This method is based on the structural information and semantic information in Wikipedia, Wikipedia's main structure including the classification system structure, link structure abstract, text link structure and page redirection disambiguation, this paper proposes a comprehensive method and related categories the link correlation calculation of correlation degree between words. In order to explore the deep semantic information, proposes the use of association rules to calculate the correlation of the words Calculation method. On this basis, this paper puts forward the calculation method of correlation degree between sentences, mainly from three aspects: the calculation of correlation between sentence structure, correlation calculation of the clustering and the use of theme words weighted clustering correlation calculation. Based on the sentence structure and consists of two aspects: the form and word order. In the calculation of correlation form, which reflected by calculating word co-occurrence frequency; word order in calculation, embodied by the reverse calculation of the number of the words of the correlation calculation. The main deep semantics of words in sentences based on the information, more consistent with human subjective understanding. Clustering is mainly semantic Related words or text together as a class or a cluster, this paper will use to calculate the correlation between the sentence, to improve the accuracy of calculating the correlation of the sentence. Based on theoretical methods of forming on the complete experimental design at first. Download Wikipedia, Chinese corpus; secondly to complete the calculation of correlation degree between words and sentences; the results were compared with manual annotation, we choose the Word Similarity-353 manual translation test set and the National University of Defense Technology statistics Words-240 as word correlation test set sentence correlation test set selection Chinese web database knowledge extraction contest provides short text semantic correlation match data sets, the correlation coefficient compared Spearman parameters and accuracy, calculating the relationship of words, Spearman parameter method in this paper is 2.8% higher than the traditional sentence correlation algorithm, the accuracy rate reached 73.3%, achieved good experimental results proved that this method. The rationality and practicability.
【學位授予單位】:太原理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 孫琛琛;申德榮;單菁;聶鐵錚;于戈;;WSR:一種基于維基百科結構信息的語義關聯(lián)度計算算法[J];計算機學報;2012年11期
2 涂新輝;張紅春;周琨峰;何婷婷;;中文維基百科的結構化信息抽取及詞語相關度計算方法[J];中文信息學報;2012年03期
3 范云杰;劉懷亮;;基于維基百科的中文短文本分類研究[J];現(xiàn)代圖書情報技術;2012年03期
4 汪祥;賈焰;周斌;丁兆云;梁政;;基于中文維基百科鏈接結構與分類體系的語義相關度計算[J];小型微型計算機系統(tǒng);2011年11期
5 王錦;王會珍;張俐;;基于維基百科類別的文本特征表示[J];中文信息學報;2011年02期
6 劉軍;姚天f ;;基于Wikipedia的語義相關度計算[J];計算機工程;2010年19期
7 呂曉燕;羅立民;李祥生;;FCM算法的改進及仿真實驗研究[J];計算機工程與應用;2009年20期
8 江敏;肖詩斌;王弘蔚;施水才;;一種改進的基于《知網(wǎng)》的詞語語義相似度計算[J];中文信息學報;2008年05期
9 戈國華;肖海波;張敏;;基于FCM的數(shù)據(jù)聚類分析及Matlab實現(xiàn)[J];福建電腦;2007年04期
10 吳勤,侯朝楨,原菊梅;基于Kohonen網(wǎng)絡的軟件可靠性模型選擇[J];計算機應用;2005年10期
相關博士學位論文 前1條
1 李峗;基于中文維基百科的語義知識挖掘相關研究[D];北京郵電大學;2009年
,本文編號:1388896
本文鏈接:http://www.sikaile.net/shoufeilunwen/xixikjs/1388896.html