基于Hadoop的漢語(yǔ)詞語(yǔ)搭配抽取系統(tǒng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-09-01 05:42

【摘要】：搭配是一種重復(fù)出現(xiàn)、遵從一定句法結(jié)構(gòu)但又具有任意性、不可類推的詞語(yǔ)組合。搭配抽取是指通過(guò)計(jì)算機(jī)的計(jì)算能力和程序設(shè)計(jì)語(yǔ)言從語(yǔ)料庫(kù)中自動(dòng)提取搭配。隨著計(jì)算機(jī)技術(shù)的快速發(fā)展,自動(dòng)抽取搭配已經(jīng)成為人們?cè)絹?lái)越重視的自然語(yǔ)言處理任務(wù)。一方面,詞語(yǔ)搭配抽取研究在自然語(yǔ)言處理領(lǐng)域的諸多應(yīng)用如機(jī)器翻譯、詞義消歧、語(yǔ)言生成和信息檢索等方面起著重要作用,此外,詞語(yǔ)搭配對(duì)于語(yǔ)言教學(xué)、二語(yǔ)習(xí)得也有著十分重要的輔助作用。另一方面,隨著互聯(lián)網(wǎng)數(shù)據(jù)和大規(guī)模語(yǔ)料庫(kù)成為計(jì)算語(yǔ)言學(xué)搭配研究的重要知識(shí)來(lái)源,互聯(lián)網(wǎng)數(shù)據(jù)井噴式增長(zhǎng)和語(yǔ)料庫(kù)規(guī)模的不斷擴(kuò)大使得開(kāi)發(fā)出有效的方法來(lái)實(shí)現(xiàn)搭配的自動(dòng)抽取顯得尤為重要。本文從Google研究所的n-gram語(yǔ)料庫(kù)三元組數(shù)據(jù)出發(fā),以自動(dòng)抽取漢語(yǔ)實(shí)詞類典型搭配為目的,利用Hadoop分布式計(jì)算平臺(tái)關(guān)鍵技術(shù)為主導(dǎo),綜合漢語(yǔ)語(yǔ)言學(xué)知識(shí),并借鑒統(tǒng)計(jì)學(xué)方法,研究了基于java Web和Hadoop的分布式詞語(yǔ)搭配檢索系統(tǒng),為用戶提供了一種智能、便捷獲取詞語(yǔ)搭配信息的新途徑。主要研究?jī)?nèi)容包括首先,對(duì)現(xiàn)有的統(tǒng)計(jì)學(xué)詞語(yǔ)搭配抽取方法與Hadoop分布式平臺(tái)關(guān)鍵技術(shù)進(jìn)行闡述,對(duì)這些方法的優(yōu)缺點(diǎn)進(jìn)行比較分析,引入介紹搭配抽取的評(píng)估指標(biāo):準(zhǔn)確率、召回率和F值。其次,結(jié)合漢語(yǔ)語(yǔ)言學(xué)知識(shí)和語(yǔ)料庫(kù)內(nèi)容,通過(guò)分析搭配詞語(yǔ)間詞性構(gòu)成規(guī)則,選取漢語(yǔ)實(shí)詞的典型搭配類型,給出漢語(yǔ)實(shí)詞搭配的詞性構(gòu)成描述。最后,實(shí)驗(yàn)部分給出從n-gram語(yǔ)料庫(kù)中抽取漢語(yǔ)實(shí)詞典型搭配的具體實(shí)現(xiàn)方法。主要研究成果如下:(1)借鑒統(tǒng)計(jì)學(xué)的搭配抽取方法和Hadoop分布式平臺(tái)相關(guān)技術(shù),結(jié)合漢語(yǔ)語(yǔ)言學(xué)搭配詞性構(gòu)成規(guī)則,實(shí)現(xiàn)了搭配自動(dòng)抽取的具體化。本文在MapReduce模式下去除稀疏數(shù)據(jù)和非中文數(shù)據(jù),調(diào)用NLPIR漢語(yǔ)分詞系統(tǒng)進(jìn)行分詞和詞性標(biāo)注,實(shí)現(xiàn)語(yǔ)料預(yù)處理,選擇跨距提取候選搭配集,利用搭配詞性構(gòu)成規(guī)則篩選實(shí)詞類搭配,并根據(jù)三種統(tǒng)計(jì)學(xué)方法——共現(xiàn)頻次、互信息和卡方檢驗(yàn)公式計(jì)算統(tǒng)計(jì)量。采用HBase分布式數(shù)據(jù)庫(kù)對(duì)抽取的中間結(jié)果和最終結(jié)果進(jìn)行存儲(chǔ),構(gòu)建了漢語(yǔ)詞語(yǔ)搭配用戶詞典。(2)開(kāi)發(fā)了基于Hadoop的漢語(yǔ)詞語(yǔ)搭配抽取系統(tǒng)的前臺(tái),便于用戶有效獲取搭配信息。使用bootstrap開(kāi)發(fā)框架設(shè)計(jì)了前臺(tái)頁(yè)面,實(shí)現(xiàn)了詞語(yǔ)檢索區(qū)域條件設(shè)置和結(jié)果展示功能。(3)總結(jié)了一種以實(shí)詞為中心詞的典型搭配的抽取方法,將這一大數(shù)據(jù)技術(shù)、語(yǔ)言學(xué)知識(shí)和統(tǒng)計(jì)學(xué)方法綜合的方法運(yùn)用于四類實(shí)詞名詞、動(dòng)詞、形容詞和副詞搭配抽取實(shí)驗(yàn),通過(guò)定量比較分析,得出基于共現(xiàn)頻率方法抽取搭配的實(shí)驗(yàn)結(jié)果最優(yōu),其中名詞類搭配抽取的準(zhǔn)確率是86%,召回率是59.72%,F值是70.49%,動(dòng)詞類搭配抽取的準(zhǔn)確率是80%,召回率是65.57%,F值是72.07%,形容詞類抽取準(zhǔn)確率是82%,召回率是78.85%,F值是80.39%,副詞類準(zhǔn)確率是88%,召回率是43.56%,F值是58.28%,其中形容詞和名詞類抽取的準(zhǔn)確率較現(xiàn)有搭配抽取軟件高了2%-4%,說(shuō)明該方法在漢語(yǔ)搭配自動(dòng)抽取方面具有一定價(jià)值。
[Abstract]:Collocation is a repetitive, syntactic, but arbitrary, non-analogous combination of words. Collocation extraction refers to the automatic extraction of collocations from a corpus by computer computing power and programming language. With the rapid development of computer technology, automatic extraction of collocations has become more and more important. On the one hand, collocation extraction plays an important role in many applications in natural language processing, such as machine translation, word sense disambiguation, language generation and information retrieval. On the other hand, collocation plays an important role in language teaching and second language acquisition. Data and large-scale corpus are important sources of knowledge in Computational Linguistics collocation research. The explosive growth of Internet data and the continuous expansion of corpus size make it particularly important to develop effective methods for automatic collocation extraction. To extract typical collocations of Chinese substantive parts, a distributed word collocation retrieval system based on Java Web and Hadoop is studied by using the key technology of Hadoop distributed computing platform as the leading factor, integrating the knowledge of Chinese linguistics and referring to statistical methods. This system provides a new intelligent and convenient way for users to obtain collocation information. The research contents include: firstly, the existing statistical word collocation extraction methods and the key technologies of Hadoop distributed platform are described, the advantages and disadvantages of these methods are compared and analyzed, and the evaluation indicators of collocation extraction are introduced: accuracy, recall and F value. This paper analyzes the rules of part-of-speech formation between collocation words, selects the typical collocation types of Chinese notional words, and gives the description of the part-of-speech formation of Chinese notional words collocation. Finally, the experimental part gives the concrete implementation method of extracting Chinese notional lexical collocation from n-gram corpus. In this paper, sparse data and non-Chinese data are removed from the MapReduce model, and the NLPIR Chinese word segmentation system is called for word segmentation and part-of-speech tagging to realize corpus preprocessing, select the candidate collocation set for cross-distance extraction, and make use of lap. The matching rules are used to filter the collocation of real parts of speech, and the statistics are calculated according to three statistical methods: co-occurrence frequency, mutual information and chi-square test formula. The intermediate and final results are stored in HBase distributed database, and a Chinese word collocation user dictionary is constructed. (2) Hadoop-based Chinese word collocation dictionary is developed. The front-end page of the collocation extraction system is designed with the bootstrap development framework, and the function of setting the conditions of the word retrieval area and displaying the results is realized. (3) A typical collocation extraction method based on the content words is summarized, and this data technology, linguistic knowledge and statistics are used. Methods The comprehensive method was applied to four types of noun, verb, adjective and adverb collocation extraction experiments. Through quantitative comparative analysis, it was found that collocation extraction based on co-occurrence frequency method was the best. The accuracy rate of noun collocation extraction was 86%, recall rate was 59.72%, F value was 70.49%, verb collocation extraction was 80%. The recall rate is 65.57%, the F value is 72.07%, the accuracy of adjective extraction is 82%, the recall rate is 78.85%, the F value is 80.39%, the accuracy of adverbs is 88%, the recall rate is 43.56%, the F value is 58.28%. The accuracy of adjective and noun extraction is 2% - 4% higher than that of the existing collocation extraction software. Certain value.
【學(xué)位授予單位】：長(zhǎng)江大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前8條

1 曲維光,陳小荷,吉根林;基于框架的詞語(yǔ)搭配自動(dòng)抽取方法[J];計(jì)算機(jī)工程;2004年23期

2 乃禾;詞語(yǔ)搭配要得當(dāng)[J];新聞通訊;1984年03期

3 王漫宇;;辭忌失朋[J];新聞戰(zhàn)線;1982年11期

4 鄧耀臣,王同順;詞語(yǔ)搭配抽取的統(tǒng)計(jì)方法及計(jì)算機(jī)實(shí)現(xiàn)[J];外語(yǔ)電化教學(xué);2005年05期

5 王璐;張仰森;;基于典型句型的詞語(yǔ)搭配定量分析及提取算法[J];計(jì)算機(jī)科學(xué);2012年S1期

6 高明陽(yáng);;淺談?dòng)⒄Z(yǔ)詞語(yǔ)搭配和教學(xué)[J];甘肅科技縱橫;2012年01期

7 羅琴琴;周江林;;基于語(yǔ)料庫(kù)的詞語(yǔ)搭配研究綜述[J];外語(yǔ)教育;2005年00期

8 王素格;楊軍玲;張武;;自動(dòng)獲取漢語(yǔ)詞語(yǔ)搭配[J];中文信息學(xué)報(bào);2006年06期

相關(guān)重要報(bào)紙文章前5條

1 譚志龍;句子中，詞語(yǔ)搭配有講究[N];語(yǔ)言文字周報(bào);2013年

2 小波;助你解決詞語(yǔ)搭配困惑[N];中國(guó)圖書商報(bào);2002年

3 《語(yǔ)言文字報(bào)》原主編杜永道;權(quán)力與權(quán)利[N];人民日?qǐng)?bào)海外版;2011年

4 卡克西·海爾江　(哈薩克族) 努爾巴汗譯;在翻譯中要注意文化差異[N];文藝報(bào);2013年

5 張輝李國(guó)清陳群安;“只字關(guān)天”[N];湖北日?qǐng)?bào);2004年

相關(guān)博士學(xué)位論文前3條

1 馮奇;核心句的詞語(yǔ)搭配研究[D];上海外國(guó)語(yǔ)大學(xué);2006年

2 申修瑛;現(xiàn)代漢語(yǔ)詞語(yǔ)搭配研究[D];復(fù)旦大學(xué);2007年

3 徐潤(rùn)華;基于詞語(yǔ)搭配知識(shí)和語(yǔ)法功能匹配的句法分析器[D];南京師范大學(xué);2013年

相關(guān)碩士學(xué)位論文前10條

1 張曉花;藏語(yǔ)形容詞的結(jié)構(gòu)及搭配庫(kù)構(gòu)建研究[D];西北民族大學(xué);2016年

2 劉慧平;注釋方式和任務(wù)投入量對(duì)高中學(xué)生英語(yǔ)詞語(yǔ)搭配附帶習(xí)得的影響[D];揚(yáng)州大學(xué);2017年

3 梁君華;高級(jí)階段詞語(yǔ)搭配的輸出及其對(duì)外語(yǔ)教學(xué)的啟示[D];上海外國(guó)語(yǔ)大學(xué);2005年

4 Diana Batsenkova;中文為外語(yǔ)翻譯中的詞語(yǔ)搭配錯(cuò)誤[D];上海外國(guó)語(yǔ)大學(xué);2014年

5 李獻(xiàn)慧;中國(guó)不同階段學(xué)生英語(yǔ)詞語(yǔ)搭配現(xiàn)狀研究[D];華北電力大學(xué)（北京）;2011年

6 朱鑫;詞語(yǔ)搭配自動(dòng)抽取方法對(duì)比研究[D];大連海事大學(xué);2011年

7 李然;英語(yǔ)詞語(yǔ)搭配教學(xué)干預(yù)對(duì)大學(xué)英語(yǔ)寫作的影響[D];北京林業(yè)大學(xué);2012年

8 周智慧;多項(xiàng)選擇注釋和單項(xiàng)注釋對(duì)附帶詞語(yǔ)搭配學(xué)習(xí)的影響[D];華南理工大學(xué);2012年

9 周莎莎;母語(yǔ)習(xí)得者與二語(yǔ)習(xí)得者寫作中詞語(yǔ)搭配的描述性研究[D];貴州大學(xué);2009年

10 司云偉;詞語(yǔ)搭配及搭配不當(dāng)實(shí)例分析[D];延邊大學(xué);2003年

，

本文編號(hào)：2216281

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2216281.html

上一篇：定長(zhǎng)密文且快速解密的分布式屬性基加密方案研究
下一篇：一種新的全比特位嵌入數(shù)字圖像水印算法

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的漢語(yǔ)詞語(yǔ)搭配抽取系統(tǒng)的研究與實(shí)現(xiàn)