天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

老撾語命名實(shí)體識別研究

發(fā)布時(shí)間:2018-02-09 12:24

  本文關(guān)鍵詞: 機(jī)構(gòu)名識別 雙層模型 半監(jiān)督學(xué)習(xí) 條件隨機(jī)場(CRF) 分歧 命名實(shí)體識別 支持向量機(jī)(SVM) 老撾語 出處:《昆明理工大學(xué)》2017年碩士論文 論文類型:學(xué)位論文


【摘要】:命名實(shí)體識別(NER),自從命名實(shí)體這個(gè)任務(wù)提出以來一直作為自然語言處理領(lǐng)域的重要基礎(chǔ)工作任務(wù)。在老撾語方面,命名實(shí)體的研究工作還是相當(dāng)薄弱,隨著我國與老撾政治經(jīng)濟(jì)交往日益密切,老撾語的信息化處理也在兩國的經(jīng)濟(jì)文化交流十分重要,因此為了更好的順應(yīng)兩國經(jīng)濟(jì)、政治等各個(gè)方面的發(fā)展,對老撾語的命名實(shí)體識別的研究是必要且不可或缺的。本文針對老撾語特有的命名實(shí)體的特征,以及目前老撾語命名實(shí)體語料稀缺的問題。主要針對老撾語的地名、人名和組織機(jī)構(gòu)名的識別方法進(jìn)行研究。主要研究成果如下:(1)基于分歧的老撾語命名實(shí)體識別針對老撾語的特點(diǎn)研究老撾語命名實(shí)體識別,主要的問題就是老撾語命名實(shí)體的語料稀缺,并且獲取速度較慢,在國內(nèi)外的研究還十分少,僅僅靠網(wǎng)上資源,以及專家老師、老撾學(xué)生的人工標(biāo)注,所獲得的語料對于研究是遠(yuǎn)遠(yuǎn)不夠的,針對這種情況,本文提出了一種基于分歧的老撾語命名實(shí)體識別算法,首先通過有標(biāo)記的老撾命名實(shí)體語料訓(xùn)練3個(gè)有監(jiān)督分類器,本文采用的是條件隨機(jī)場CRF進(jìn)行訓(xùn)練,進(jìn)而通過三個(gè)分類器分別訓(xùn)練相同的未標(biāo)記語料,在這個(gè)過程中我們主要采用分類加權(quán)的投票策略對沒有標(biāo)記的樣本進(jìn)行初步標(biāo)記。其次,對初步標(biāo)記的語料進(jìn)行第二次驗(yàn)證,最后把新增的樣本添加到我們已有的老撾語料集中。(2)基于層疊條件隨機(jī)場的老撾機(jī)構(gòu)名識別通過上述的實(shí)驗(yàn)我們擴(kuò)充了一部分實(shí)驗(yàn)的語料集,在實(shí)驗(yàn)室之前的研究中,通過單層的條件隨機(jī)場及基于規(guī)則和統(tǒng)計(jì)結(jié)合的方法,對老撾語人名、地名進(jìn)行識別。在小規(guī)模語料的實(shí)驗(yàn)中,已經(jīng)取得了不錯(cuò)的識別結(jié)果。但是針對老撾機(jī)構(gòu)名的實(shí)體識別,還沒有專門的研究,而且由于老撾機(jī)構(gòu)名中,含有許多嵌套的名詞,僅僅通過單層模型是很難識別的,因此,本文提出了一種基于層疊條件隨機(jī)場模型老撾語機(jī)構(gòu)名識別算法。這個(gè)算法主要是利用兩層條件隨機(jī)場對老撾機(jī)構(gòu)名進(jìn)行識別,首先在第一層,我們主要通過識別簡單的老撾人名、老撾地名、以及老撾機(jī)構(gòu)名,并且結(jié)合觀察值把結(jié)果傳遞給第二層的條件隨機(jī)場模型。在第二層條件隨機(jī)場模型中,我們結(jié)合第一步分結(jié)果,制定出相應(yīng)的老撾語特征模板,實(shí)現(xiàn)對老撾復(fù)雜組織機(jī)構(gòu)名的識別。實(shí)驗(yàn)結(jié)果表明對老撾機(jī)構(gòu)名的識別有不錯(cuò)的效果。(3)基于條件隨機(jī)場和支持向量機(jī)的雙層模型的老撾機(jī)構(gòu)名別在深入分析了老撾語機(jī)構(gòu)的一些構(gòu)成特點(diǎn)后,我們發(fā)現(xiàn)在老撾機(jī)構(gòu)名的特征中,大部分老撾機(jī)構(gòu)名都會(huì)有一個(gè)邊界特征詞,如果我們通過專門識別老撾機(jī)構(gòu)名的邊界特征詞進(jìn)而識別老撾機(jī)構(gòu)名,識別率應(yīng)該會(huì)有所提高。而上面所提出的基于層疊條件隨機(jī)場的方法并不能很好的解決這個(gè)問題。因此本文針對老撾機(jī)構(gòu)名的邊界識別問題,提出了另一種老撾機(jī)構(gòu)名的識別方法。應(yīng)用條件隨機(jī)場和支持向量機(jī)的的混合方法來識別老撾的機(jī)構(gòu)名。在這個(gè)方法中,首先,在第一層,我們主要通過識別簡單的老撾人名、老撾地名、以及老撾機(jī)構(gòu)名,并且把結(jié)果結(jié)合觀察值后的結(jié)果再傳遞給第二層模型(支持向量機(jī)模型),在第二層,我們采用基于驅(qū)動(dòng)的方法通過識別老撾機(jī)構(gòu)名的邊界特征,對老撾機(jī)構(gòu)名進(jìn)行識別。并且在最后我們通過置信的計(jì)算對老撾機(jī)構(gòu)名識別結(jié)果進(jìn)行一個(gè)修正。實(shí)驗(yàn)結(jié)果表明針對老撾機(jī)構(gòu)名邊界對老撾機(jī)構(gòu)名的識別的正確率有了明顯的提高。
[Abstract]:Named entity recognition (NER), has been an important task in the field of Natural Language Processing based named entity since this task is put forward. In the Lao language, the research work of named entities is quite weak, with China and Laos political and economic exchanges increasingly close, information processing in Lao is also important in the economic and cultural exchanges between the two countries therefore, in order to better adapt to the development of bilateral economic, political and other aspects of the study, named entity recognition of Lao is necessary and indispensable. In this paper, according to the characteristics of Lao special named entities, as well as the current problems of Lao ne corpus scarce. Mainly for Lao names, studied the recognition method the names and organization names. The main research results are as follows: (1) the Lao language differences of named entity recognition according to the characteristics of the research based on Lao The Lao language named entity recognition, the main problem is that the Lao language named entity corpus is scarce, and gets slower, the research at home and abroad are very few, only rely on online resources, as well as the expert teachers, labeled Lao students, the corpus for research is not enough, in view of this situation, this paper named entity recognition is proposed based on a different Lao, the first 3 supervised classifier named entity corpus training by marked Laos, is adopted in this paper CRFs CRF training, and through the three classifiers are training the same unlabeled corpus, in this process, we mainly use the weighted classification the voting strategy preliminary labeling on unlabeled samples. Secondly, the initial labeled corpus second times to verify, finally the new samples are added to our existing The central Laos corpus. (2) identify cascaded conditional random fields based on the mechanism of Laos experiment we extend the experimental part of the corpus, in the research laboratory before, through the monolayer of CRFs and method combining rules and statistics based on the names of Lao, name recognition. The small scale corpus in the experiment has achieved good recognition results. But in Laos organization name entity recognition, there is no specialized research, and because the organization name in Laos, the noun contains many nested, only by single model is difficult to identify, therefore, this paper proposes a method based on cascaded conditional random the airport Lao model organization name recognition algorithm. This algorithm is mainly based on two layer CRFs to identify the mechanism of Laos, in the first layer, we mainly through the simple recognition of the Lao People 鍚,

本文編號:1497922

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/jingjilunwen/jiliangjingjilunwen/1497922.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶914b2***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com