基于互信息改進算法和t-測試差的壯文分詞算法研究
發(fā)布時間:2018-06-22 05:28
本文選題:壯文分詞 + MI改進算法 ; 參考:《中南民族大學學報(自然科學版)》2017年04期
【摘要】:針對傳統(tǒng)的壯文分詞方法將單詞之間的空格作為分隔標志,在多數(shù)情況下,會破壞多個單詞關聯(lián)組合而成的語義詞所要表達的完整且獨立的語義信息,在借鑒前人使用互信息MI方法來度量相鄰單詞間關聯(lián)程度的基礎上,首次采用互信息改進算法MI~k和t-測試差對壯文文本分詞,并結合兩者在評價相鄰單詞間的靜態(tài)結合能力和動態(tài)結合能力的各自優(yōu)勢,提出了一種MI~k和t-測試差相結合的TD-MIk混合算法對壯文文本分詞,并對互信息改進算法MI~k、t-測試差、TD-MI~k混合算法三種方法的分詞效果進行了比較.使用人民網(wǎng)壯文版上的文本集作為訓練及測試語料進行了實驗,結果表明:三種分詞方法都能夠較準確而有效地提取文本中的語義詞,并且TD-MI~k混合算法的分詞準確率最高.
[Abstract]:In view of the traditional Zhuang word segmentation method, the space between words is taken as the separation mark, in most cases, the complete and independent semantic information to be expressed by the semantic words formed by the association of multiple words will be destroyed. On the basis of using the mutual information MI method to measure the correlation degree between adjacent words, the improved mutual information algorithms MIK and t- test difference are used for the first time. Combined with their respective advantages in evaluating the static and dynamic combination of adjacent words, a TD-MIK hybrid algorithm combining MIK and t- test difference is proposed for word segmentation in Zhuang text. The segmentation effect of the improved mutual information algorithm, MIGK / TD-MIK hybrid algorithm, is compared in this paper. The experimental results show that the three word segmentation methods can extract the semantic words from the text accurately and effectively, and the segmentation accuracy of the TD-MIPK hybrid algorithm is the highest. The experiment results show that the text set on the Zhuang text version of people's net can be used as the training and testing corpus, and the results show that all the three word segmentation methods can extract the semantic words from the text more accurately and effectively.
【作者單位】: 中南民族大學計算機科學學院;河池學院計算機與信息工程學院;
【基金】:國家科技支撐計劃項目子課題(2015BAD29B01) 中南民族大學研究生學術創(chuàng)新基金項目(2017sycxjj051)
【分類號】:TP391.1
,
本文編號:2051772
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2051772.html
最近更新
教材專著