基于CRFs和詞典信息的中古漢語自動分詞

發(fā)布時間：2018-02-26 06:02

本文關(guān)鍵詞： CRFs模型分詞一致性中古漢語自動分詞　出處：《數(shù)據(jù)分析與知識發(fā)現(xiàn)》2017年05期 　論文類型：期刊論文

【摘要】：【目的】驗證中古時期分詞一致性和語料類別對CRFs分詞效率的影響,在此基礎(chǔ)上進(jìn)一步提高分詞效率,降低人工校對的工作量。【方法】以中古時期的史書、佛經(jīng)、小說類語料為例,針對中古漢語的自動分詞問題,優(yōu)化分詞原則,運用CRFs模型和詞典相結(jié)合的方法,消除中古漢語人工分詞結(jié)果中易出現(xiàn)的分詞不一致問題;同時在CRFs分詞中引入字符分類、字典信息兩種特征,并通過對比實驗選取每種特征最合適的分詞模板�！窘Y(jié)果】實驗結(jié)果顯示,分詞結(jié)果的總F值在封閉測試中達(dá)到99%以上,開放測試的綜合測試中也達(dá)到89%-95%。【局限】分詞不一致研究主要針對雙字詞,因此三字以上詞語(多字詞)的識別效果稍有欠缺。【結(jié)論】在有效提高分詞一致性的前提下,字符分類、詞典標(biāo)記特征能夠有效提高中古漢語CRFs分詞的精確度。同時本文提出的中古漢語分詞系統(tǒng)可以服務(wù)于中古時期多類別的漢語語料。
[Abstract]:[objective] to verify the influence of word segmentation consistency and corpus classification on the efficiency of CRFs participle, and to further improve the efficiency of word segmentation and reduce the workload of artificial proofreading. [methods] the history books and Buddhist scriptures of the Middle Ancient period were used to improve the efficiency of word segmentation and reduce the workload of artificial proofreading. For the example of novel corpus, aiming at the problem of automatic word segmentation in middle ancient Chinese, the principle of word segmentation is optimized, and the method of combining CRFs model with dictionary is used to eliminate the disconsistency of word segmentation in the result of artificial word segmentation in middle ancient Chinese. At the same time, we introduce character classification and dictionary information into CRFs word segmentation, and select the most suitable segmentation template for each feature by contrast experiment. [results] the experimental results show that the total F value of word segmentation results is more than 99% in the closed test. In the comprehensive test of open test, 89% -95% is also achieved. The research on the inconsistency of participle is mainly aimed at two-character words, so the recognition effect of more than three words (multi-character words) is slightly deficient. [conclusion] on the premise of effectively improving the consistency of participle, Character classification and dictionary tagging features can effectively improve the accuracy of middle ancient Chinese CRFs participle. At the same time, the middle ancient Chinese word segmentation system proposed in this paper can serve for many kinds of Chinese corpus of Middle Ancient Chinese.
【作者單位】：南京師范大學(xué)文學(xué)院;
【基金】：國家社會科學(xué)基金重大項目“漢語史研究語料庫建設(shè)研究”(項目編號:10&ZD117);國家社會科學(xué)基金重大項目“基于《漢學(xué)引得叢刊》的典籍知識庫構(gòu)建及人文計算研究”(項目編號:15ZDB127)的研究成果之一教育部人文社會科學(xué)青年項目“漢語歷時詞匯數(shù)據(jù)庫的構(gòu)建與計量研究”(項目編號:16YJC740034)
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 王f捎，

本文編號：1536773

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1536773.html

上一篇：關(guān)于漢語移動學(xué)習(xí)軟件的調(diào)查與分析
下一篇：微博文本的句向量表示及相似度計算方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于CRFs和詞典信息的中古漢語自動分詞