融合多特征的漢緬雙語主題模型構建方法研究

發(fā)布時間：2018-12-29 18:33

【摘要】：漢-緬雙語平行語料是開展面向漢語-緬語機器翻譯、跨語言檢索、平行句對抽取和雙語實體抽取等研究的基礎性資源�？缯Z言的主題模型作為多語言文檔分析的基礎模型,它能夠從語義層面來計算不同語言文檔之間的相關性,為我們獲取漢-緬可比文檔以及平行語料庫的建設提供了良好的支撐,因此,研究如何構建漢-緬雙語主題模型對于漢-緬可比文檔的獲取具有重要的意義。本文以語料庫構建為出發(fā)點,通過主題模型獲取可比語料為目的,對雙語主題模型的構建展開了研究工作,主要取得了以下成果:(1)詳述漢-緬雙語平行語料庫的構建。漢-緬雙語文本的資源稀缺,國內外還沒有公開權威的漢-緬文本語料集,構建漢緬雙語主題模型需要一定量的雙語平行文檔作為訓練集,并且平行文檔的質量將影響后續(xù)的文本主題模型的研究。本文詳細介紹了漢-緬雙語文本的獲取方法,包括網頁文本、電子雜志和微信平臺等資源。對于網頁文本,詳細介紹了利用爬蟲技術自動獲取的過程,對于電子雜志和微信平臺,也說明了人工獲取的過程。最后將資源整合為漢-緬雙語平行語料庫以及說明相應的數據存儲方法。(2)提出一種融合上下文特征的漢-緬雙語主題模型。該模型以雙語LDA主題模型為基礎,融合了文本的上下文特征。雙語LDA模型利用了平行文本的關聯(lián)性,即平行文本共享同一文本主題分布矩陣,而融合上下文特征則解決了詞袋模型沒有考慮文本結構的問題。融合后的模型實質是對降低了高頻詞對文本主題分布的負面影響,通過實驗結果表明,本文提出的融合上下文特征的漢-緬雙語主題模型在文本主題分布上有著更好的效果。(3)提出一種融合語義擴展的漢-緬雙語主題模型。以融合上下文特征的主題模型為基礎,進一步融合了漢-緬語義擴展詞典,通過對詞典的解析和處理,構建了漢-緬語義的擴展集合,本文通過上下文特征對詞語加權權值,設定一個閾值,對超過閾值的詞語通過擴展集合擴展對應的緬甸語文本,通過這種語義擴展,可以解決緬甸語中一種詞語,多種表述的問題。我們將上下文特征和語義擴展特征融合在同一個雙語LDA模型中,最后通過實驗結果比較分析,本文構建的融合多特征的雙語主題模型同對比實驗比較有著更好的表現。
[Abstract]:Chinese-Burmese bilingual parallel corpus is the basic resource for the research of Chinese-Burmese machine translation, cross-language retrieval, parallel sentence pair extraction and bilingual entity extraction. As the basic model of multilingual document analysis, the cross-language topic model can calculate the correlation between different language documents from the semantic level. It provides a good support for the construction of Chinese-Burmese comparable documents and parallel corpus. Therefore, it is of great significance to study how to construct a Chinese-Burmese bilingual thematic model for the acquisition of Chinese-Burmese comparable documents. Taking corpus construction as the starting point and obtaining comparable corpus through thematic model, this paper studies the construction of bilingual thematic model. The main achievements are as follows: (1) the construction of Chinese-Myanmar bilingual parallel corpus is described in detail. The resources of Chinese-Myanmar bilingual texts are scarce, and there is no open and authoritative Chinese-Burmese text corpus at home and abroad. To construct the Chinese-Myanmar bilingual thematic model, a certain amount of bilingual parallel documents are needed as training sets. And the quality of parallel documents will affect the research of text topic model. This paper introduces the methods of obtaining Chinese-Burmese bilingual texts, including web text, e-magazine and WeChat platform. For the text of web pages, the process of automatically obtaining web pages using crawler technology is introduced in detail. For electronic magazines and WeChat platforms, the process of manual acquisition is also explained. Finally, the resources are integrated into a Chinese-Burmese bilingual parallel corpus and the corresponding data storage methods are illustrated. (2) A Chinese-Burmese bilingual thematic model is proposed, which combines the contextual features. The model is based on the bilingual LDA thematic model and combines the contextual features of the text. The bilingual LDA model utilizes the relevance of parallel text, that is, parallel text sharing the same text topic distribution matrix, while the fusion of context features solves the problem that the lexical bag model does not consider the text structure. The fusion model essentially reduces the negative influence of high-frequency words on the theme distribution of the text. The experimental results show that, The Chinese-Myanmar bilingual thematic model with contextual features proposed in this paper has a better effect on the text theme distribution. (3) A Chinese-Myanmar bilingual thematic model with semantic extension is proposed. Based on the subject model of blending context features, this paper further fuses the Chinese-Burmese semantic extension dictionary. Through the analysis and processing of the dictionary, the extended set of Chinese-Myanmar semantics is constructed, and the weighted weight of the words is given by the context feature in this paper. A threshold is set to extend the corresponding Myanmar language text by extending the set of words over the threshold. By this semantic extension, the problem of one word or a variety of expressions in the Burmese language can be solved. We fuse context features and semantic extended features into the same bilingual LDA model. Finally, by comparing and analyzing the experimental results, we conclude that the multi-feature bilingual thematic model constructed in this paper has a better performance than the comparative experiment.
【學位授予單位】：昆明理工大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

中國期刊全文數據庫前5條

1 關鵬;王曰芬;傅柱;;不同語料下基于LDA主題模型的科學文獻主題抽取效果分析[J];圖書情報工作;2016年02期

2 趙煜;邵必林;邊根慶;;一種融合詞序信息的多粒度文本話題情感聯(lián)合模型[J];西安交通大學學報;2014年11期

3 陳霞楓;;緬甸改革對中緬關系的影響及中國的對策[J];東南亞研究;2013年01期

4 馬穎華,王永成,蘇貴洋,張宇萌;一種基于字同現頻率的漢語文本主題抽取方法[J];計算機研究與發(fā)展;2003年06期

5 楊沐昀;A Research on Bilingual Dictionary Based Sentence Alignment for Chinese English Parallel Corpus[J];High Technology Letters;2002年01期

，

本文編號：2395219

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2395219.html

上一篇：基于Retinex方法的無人機影像陰影去除應用研究
下一篇：CLM:面向軌跡發(fā)布的差分隱私保護方法

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

融合多特征的漢緬雙語主題模型構建方法研究