天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

融合泰語特征的句子級實體關系抽取研究

發(fā)布時間:2018-05-15 10:56

  本文選題:泰語句子切分 + 命名實體識別。 參考:《昆明理工大學》2017年碩士論文


【摘要】:泰語句子的實體關系抽取研究是泰語自然語言處理的重要內(nèi)容,其性能對事件抽取、知識庫構建和搜索引擎等上層應用研究有著直接影響。然而泰語構詞復雜,語氣詞使用頻繁,不習慣書寫標點符號造成泰語句子邊界模糊等語言特點都增加了泰語信息智能處理的難度。本文結合泰語語言特征和統(tǒng)計機器學習模型,針對泰語句子切分、泰語句子命名實體識別和泰語句子從屬實體關系抽取進行了研究探討。取得了如下三個方面的研究成果。(1)在泰語文本信息中,通常書寫的泰語句子之間僅以簡單的空格符在句子末尾作為句子分界符,并且泰語中也存在大量的非句末空格符,所以使得泰語句子邊界模糊。本文首先分析歸納了一些與泰語句子邊界相關的實用語法規(guī)則,然后使用統(tǒng)計機器學習中的最大熵分類算法,將關于泰語句子切分的任務轉換為對泰語文本中空格符的分類問題。結合泰語文本中空格符的上下文特征來訓練最大熵分類模型,從而對泰語信息中的空格符進行類別分類。最后在使用構建的相關語法規(guī)則庫來對最大熵分類模型的空格符分類結果進行校正。本文的方法相對于只使用泰語語法規(guī)則的方法,簡化了大量復雜泰語語法知識的規(guī)則構建工作,僅針對與泰語句子邊界識別相關的主要知識構建了語法規(guī)則,并且通過最大熵分類模型更好的利用了在泰語輸入語塊或段落文本中空格符的上下文特征,從而在泰語句子切分任務中獲得了較好的效果,并且性能穩(wěn)定,為泰語句子的命名實體識別任務奠定了基礎。(2)將泰語句子命名實體識別任務轉化為對泰語句子中的詞匯序列進行標記的任務。本文利用泰語句子中詞匯的上下文語言特征,分別使用隱馬爾科夫模型和條件隨機場模型在泰語實體識別訓練語料上進行了模型構建,并且分別使用所構建的序列標注模型在泰語測試語料上進行了實驗驗證。最終的實驗結果也驗證了本文使用序列標注方法在泰語命名實體識別任務中的有效性,并且為泰語句子的實體關系抽取研究奠定了基礎。(3)在泰語句子命名實體識別的基礎上,將泰語句子從屬實體關系抽取任務轉化為對泰語句子中的實體關系三元組的分類問題。本文首先在缺少泰語從屬實體關系語料的情況下,利用句子對齊的漢泰平行句對和漢泰詞典構建泰語實體關系語料庫。然后使用泰語實體詞匯周圍的上下文特征訓練最大熵分類模型,對泰語句子中候選實體關系三元組的從屬實體關系類型進行識別,從而實現(xiàn)泰語句子中的從屬實體關系抽取。最后通過實驗驗證了本文方法在針對泰語句子中從屬實體關系進行抽取時的有效性。
[Abstract]:The research on entity relation extraction of Thai sentences is an important part of natural language processing in Thai. Its performance has a direct impact on the research of event extraction, knowledge base construction and search engine. However, the complexity of Thai word-formation, the frequent use of modal words, the unaccustomed writing of punctuation marks, and the blurring of the boundaries of Thai sentences all increase the difficulty of intelligent processing of Thai information. Based on the features of Thai language and the statistical machine learning model, this paper discusses Thai sentence segmentation, Thai sentence naming entity recognition and Thai sentence subordinate entity relation extraction. In Thai text information, only simple blanks are used between Thai sentences as sentence delimiters at the end of the sentence, and there are a large number of non-sentence end blanks in Thai. Therefore, the boundary of Thai sentences is blurred. This paper first analyzes and induces some practical grammar rules related to the boundary of Thai sentences, and then uses the maximum entropy classification algorithm in statistical machine learning. The task of Thai sentence segmentation is converted to the classification of whitespace in Thai text. The maximum entropy classification model is trained by combining the contextual features of white space characters in Thai text, and the whitespace characters in Thai language information are classified. Finally, the whitespace classification results of the maximum entropy classification model are corrected by using the constructed grammar rules. Compared with only using Thai grammar rules, the method in this paper simplifies the construction of a large number of complex Thai grammar rules, and only constructs grammar rules for the main knowledge related to Thai sentence boundary recognition. And the maximum entropy classification model makes better use of the context features of the blanks in the Thai input chunks or paragraph text, thus obtaining a better effect in the Thai sentence segmentation task, and the performance is stable. It lays the foundation for the task of named entity recognition in Thai sentences.) the task of identifying named entities in Thai sentences is transformed into the task of tagging the lexical sequences in Thai sentences. Based on the contextual features of the words in Thai sentences, this paper uses the hidden Markov model and the conditional random field model to construct the model on the training corpus of Thai entity recognition. And the sequence tagging model is used to test the Thai language test corpus. The final experimental results also verify the effectiveness of the method of sequence tagging in the task of Thai named entity recognition, and lay a foundation for the research of entity relation extraction of Thai sentences based on named entity recognition of Thai sentences. In this paper, the subordinate entity relation extraction task of Thai sentence is transformed into the classification problem of the entity relation triple in Thai sentence. In this paper, in the absence of Thai subordinate entity relation corpus, a corpus of Thai entity relations is constructed by using Chinese-Thai parallel sentence pairs with sentence alignment and Chinese-Thai Dictionary. Then the maximum entropy classification model is trained by using the contextual features around the Thai entity vocabulary to identify the subordinate entity relation types of candidate entity relation triples in Thai sentences, so as to achieve subordinate entity extraction in Thai sentences. Finally, the effectiveness of the proposed method in extracting subordinate entities in Thai sentences is verified by experiments.
【學位授予單位】:昆明理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1

【參考文獻】

相關期刊論文 前10條

1 王紅斌;沈強;線巖團;;融合遷移學習的中文命名實體識別[J];小型微型計算機系統(tǒng);2017年02期

2 李麗雙;何紅磊;劉珊珊;黃德根;;基于詞表示方法的生物醫(yī)學命名實體識別[J];小型微型計算機系統(tǒng);2016年02期

3 陳鴻;金培權;岳麗華;胡玉娟;殷鳳梅;;基于上下文特征分類的評論長句切分方法[J];計算機工程;2015年09期

4 鄒嘉齡;劉春臘;尹國慶;唐志鵬;;中國與“一帶一路”沿線國家貿(mào)易格局及其經(jīng)濟貢獻[J];地理科學進展;2015年05期

5 陳鵬;郭劍毅;余正濤;嚴馨;張志坤;高盛祥;;融合領域知識短語樹核函數(shù)的中文領域實體關系抽取[J];南京大學學報(自然科學);2015年01期

6 母克東;萬琪;;關系抽取研究綜述[J];現(xiàn)代計算機(專業(yè)版);2015年03期

7 劉紹毓;周杰;李弼程;席耀一;唐浩浩;;基于多分類SVM-KNN的實體關系抽取方法[J];數(shù)據(jù)采集與處理;2015年01期

8 何炎祥;羅楚威;胡彬堯;;基于CRF和規(guī)則相結合的地理命名實體識別方法[J];計算機應用與軟件;2015年01期

9 郭喜躍;何婷婷;胡小華;陳前軍;;基于句法語義特征的中文實體關系抽取[J];中文信息學報;2014年06期

10 栗偉;趙大哲;李博;彭新茗;劉積仁;;CRF與規(guī)則相結合的醫(yī)學病歷實體識別[J];計算機應用研究;2015年04期

相關博士學位論文 前1條

1 何冬梅;泰語構詞研究[D];上海師范大學;2012年

相關碩士學位論文 前2條

1 趙世瑜;泰語詞法分析關鍵技術研究[D];昆明理工大學;2016年

2 陳暉;半監(jiān)督的命名實體識別[D];北京交通大學;2011年



本文編號:1892168

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1892168.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶bd1c9***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com