當(dāng)前位置：主頁(yè) > 經(jīng)濟(jì)論文 > 技術(shù)經(jīng)濟(jì)論文 >

基于領(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法

發(fā)布時(shí)間：2018-10-31 07:49

【摘要】：近年來(lái),隨著經(jīng)濟(jì)社會(huì)的快速發(fā)展,大量新詞出現(xiàn)在人們生活中。在自然語(yǔ)言處理領(lǐng)域,許多研究方向等都離不開(kāi)新詞的自動(dòng)抽取。作為語(yǔ)言信息處理領(lǐng)域的一項(xiàng)基礎(chǔ)技術(shù),新詞抽取技術(shù)具有巨大的研究?jī)r(jià)值和實(shí)際應(yīng)用前景。本文提出了一種新穎的新詞抽取方法,主要工作如下:1.提出了一個(gè)基于領(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法。通過(guò)觀察、分析語(yǔ)料的特點(diǎn),采用基于領(lǐng)域特殊性的垃圾串過(guò)濾方法過(guò)濾垃圾串,得到候選新詞列表;然后基于統(tǒng)計(jì)語(yǔ)言知識(shí)(包括詞頻、內(nèi)部結(jié)合緊密性)對(duì)新詞進(jìn)行抽取。實(shí)驗(yàn)驗(yàn)證了該方法的有效性。2.新詞抽取方法的優(yōu)化,從兩個(gè)方面對(duì)新詞抽取方法進(jìn)行了優(yōu)化:優(yōu)化內(nèi)部結(jié)合緊密性,采用EMI來(lái)衡量,替換PMI;引入上下文外部特征,采用左熵和右熵來(lái)衡量詞語(yǔ)的自由度。并從多方面采用多種方法評(píng)估比較該方法的效果,評(píng)估不同統(tǒng)計(jì)特征的結(jié)合以及調(diào)整參數(shù)。實(shí)驗(yàn)結(jié)果顯示,相比未優(yōu)化前的方法,新詞抽取的效果得到大大提升,準(zhǔn)確率最大提升39%,召回率最大提升63%。3.新詞抽取方法的應(yīng)用驗(yàn)證,將抽取的新詞應(yīng)用在分詞系統(tǒng)中,實(shí)驗(yàn)結(jié)果顯示,在含有新詞的語(yǔ)料上,分詞效果提升了10%;另外,新詞抽取方法能夠應(yīng)用在英文領(lǐng)域詞典的構(gòu)建上。實(shí)驗(yàn)驗(yàn)證了本文方法可擴(kuò)展性和語(yǔ)言獨(dú)立性的特點(diǎn)�；陬I(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法是一種無(wú)監(jiān)督的方法,它不需要訓(xùn)練語(yǔ)料,不需要定義規(guī)則,克服了傳統(tǒng)方法的缺點(diǎn)。此外,本文方法具有很強(qiáng)的可擴(kuò)展性和語(yǔ)言獨(dú)立性,能夠抽取大量的新詞和領(lǐng)域詞語(yǔ)。
[Abstract]:In recent years, with the rapid development of economy and society, a large number of new words appear in people's lives. In the field of natural language processing, many research directions are inseparable from the automatic extraction of new words. As a basic technology in the field of language information processing, neologism extraction technology has great research value and practical application prospect. In this paper, a novel new word extraction method is proposed. The main work is as follows: 1. A new word extraction method based on domain particularity and statistical language knowledge is proposed. By observing and analyzing the characteristics of the corpus, the garbage string filtering method based on domain particularity is used to filter the garbage string, and the list of candidate new words is obtained, and then the new words are extracted based on the knowledge of statistical language (including word frequency, internal compactness). The experimental results show that the method is effective. 2. This paper optimizes the neologism extraction method from two aspects: optimizing the internal compactness, using EMI to measure, replacing PMI; introducing the external features of context, and using left entropy and right entropy to measure the degree of freedom of words. Various methods are used to evaluate and compare the effect of the method, to evaluate the combination of different statistical characteristics and to adjust the parameters. The experimental results show that, compared with the unoptimized method, the effect of neologism extraction is greatly improved, the accuracy is increased by 39 and the recall rate is increased by 63. 3. The new words extraction method is applied to the word segmentation system. The experimental results show that the segmentation effect has been improved by 10% in the corpus containing new words. In addition, neologism can be applied to the construction of English domain dictionaries. Experiments verify the extensibility and language independence of this method. The new word extraction method based on domain particularity and statistical language knowledge is an unsupervised method. It does not require training corpus and does not need to define rules, which overcomes the shortcomings of traditional methods. In addition, this method has strong extensibility and language independence, it can extract a large number of new words and domain words.
【學(xué)位授予單位】：北京理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類(lèi)號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 石楨;姚天f ;;一種基于統(tǒng)計(jì)和規(guī)則的核心地名抽取方法[J];微型電腦應(yīng)用;2013年02期

2 張世輝;一種新的基于距離的漢字筆畫(huà)抽取方法[J];計(jì)算機(jī)工程;2003年14期

3 王大亮;涂序彥;鄭雪峰;佟子健;;多策略融合的搭配抽取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2008年04期

4 楊建明;;關(guān)系抽取方法研究[J];電子技術(shù);2009年04期

5 孫繼鵬;賈民;劉增寶;;一種面向文本的概念抽取方法的研究[J];計(jì)算機(jī)應(yīng)用與軟件;2009年09期

6 鄭偉;呂建新;張建偉;;文本分類(lèi)中特征預(yù)抽取方法研究[J];情報(bào)科學(xué);2011年01期

7 肖明軍,張巍,鄒翔,蔡慶生;一種多策略聯(lián)合信息抽取方法[J];小型微型計(jì)算機(jī)系統(tǒng);2005年04期

8 郝博一;夏云慶;鄔曉鈞;鄭方;劉軼;;基于泛化和繁殖的自舉式意見(jiàn)目標(biāo)抽取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年S1期

9 栗春亮;朱艷輝;徐葉強(qiáng);;中文產(chǎn)品評(píng)論中屬性詞抽取方法研究[J];計(jì)算機(jī)工程;2011年12期

10 蔡虹,葉水生;基于KPS的Web信息抽取[J];計(jì)算機(jī)與現(xiàn)代化;2005年06期

相關(guān)會(huì)議論文前10條

1 宋濤;李素建;;基于流形排序的領(lǐng)域詞抽取方法[A];第五屆全國(guó)青年計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2010年

2 卞真旭;;一種關(guān)鍵詞抽取方法研究[A];2011年安徽省智能電網(wǎng)技術(shù)論壇論文集[C];2011年

3 羅斐;毛宇光;;基于領(lǐng)域分類(lèi)的查詢(xún)接口模式抽取方法[A];2009年研究生學(xué)術(shù)交流會(huì)通信與信息技術(shù)論文集[C];2009年

4 栗春亮;朱艷輝;徐葉強(qiáng);;中文產(chǎn)品評(píng)論中屬性詞抽取方法研究[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

5 劉昊;王健;林鴻飛;;一種模板與圖核融合的蛋白質(zhì)關(guān)系抽取方法[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

6 翁偉;王厚峰;;基于LDA的關(guān)鍵詞抽取方法[A];第五屆全國(guó)青年計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2010年

7 何莉;林鴻飛;;一種面向WEB的生物醫(yī)學(xué)領(lǐng)域英漢術(shù)語(yǔ)翻譯對(duì)抽取方法[A];中國(guó)計(jì)算機(jī)語(yǔ)言學(xué)研究前沿進(jìn)展（2007-2009）[C];2009年

8 左云存;宗成慶;;基于HMM的短語(yǔ)翻譯對(duì)抽取方法[A];全國(guó)第八屆計(jì)算語(yǔ)言學(xué)聯(lián)合學(xué)術(shù)會(huì)議（JSCL-2005）論文集[C];2005年

9 王裴巖;張桂平;白宇;;一種基于核函數(shù)的技術(shù)關(guān)鍵詞連接關(guān)系抽取方法[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

10 蒲宇達(dá);關(guān)毅;王強(qiáng);;基于數(shù)據(jù)挖掘思想的網(wǎng)頁(yè)正文抽取方法的研究[A];第三屆學(xué)生計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2006年

相關(guān)博士學(xué)位論文前1條

1 李傳席;基于本體的自適應(yīng)Web信息抽取方法研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2012年

相關(guān)碩士學(xué)位論文前10條

1 陳倩;基于特征模型的跨領(lǐng)域信息抽取方法研究[D];上海大學(xué);2015年

2 劉驍;基于產(chǎn)品評(píng)論的意見(jiàn)抽取方法研究[D];黑龍江大學(xué);2015年

3 洪軍建;面向社會(huì)網(wǎng)絡(luò)應(yīng)用的人物關(guān)系抽取方法研究[D];西藏大學(xué);2016年

4 梅莉莉;基于領(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法[D];北京理工大學(xué);2016年

5 呂云云;基于集成學(xué)習(xí)的中文觀點(diǎn)句抽取方法研究[D];山西大學(xué);2013年

6 楊云;基于句法結(jié)構(gòu)的評(píng)價(jià)對(duì)象抽取方法研究[D];東北師范大學(xué);2015年

7 方瑩;基于句子聚類(lèi)的信息抽取方法研究[D];山西大學(xué);2005年

8 徐曉明;面向手機(jī)用戶(hù)的社團(tuán)抽取方法研究[D];吉林大學(xué);2014年

9 李震;基于聚類(lèi)的事件蘊(yùn)涵抽取方法研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2011年

10 王立;中文復(fù)述模板及搭配抽取方法研究[D];華中師范大學(xué);2013年

，

本文編號(hào)：2301434

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/jingjilunwen/jiliangjingjilunwen/2301434.html

上一篇：中級(jí)商務(wù)漢語(yǔ)精讀教材課文分析
下一篇：京津冀碳排放影響因素分解分析及對(duì)比研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于領(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法