基于CRFs和詞典信息的中古漢語自動分詞
發(fā)布時間:2018-02-26 06:02
本文關(guān)鍵詞: CRFs模型 分詞一致性 中古漢語 自動分詞 出處:《數(shù)據(jù)分析與知識發(fā)現(xiàn)》2017年05期 論文類型:期刊論文
【摘要】:【目的】驗證中古時期分詞一致性和語料類別對CRFs分詞效率的影響,在此基礎(chǔ)上進(jìn)一步提高分詞效率,降低人工校對的工作量。【方法】以中古時期的史書、佛經(jīng)、小說類語料為例,針對中古漢語的自動分詞問題,優(yōu)化分詞原則,運用CRFs模型和詞典相結(jié)合的方法,消除中古漢語人工分詞結(jié)果中易出現(xiàn)的分詞不一致問題;同時在CRFs分詞中引入字符分類、字典信息兩種特征,并通過對比實驗選取每種特征最合適的分詞模板!窘Y(jié)果】實驗結(jié)果顯示,分詞結(jié)果的總F值在封閉測試中達(dá)到99%以上,開放測試的綜合測試中也達(dá)到89%-95%。【局限】分詞不一致研究主要針對雙字詞,因此三字以上詞語(多字詞)的識別效果稍有欠缺。【結(jié)論】在有效提高分詞一致性的前提下,字符分類、詞典標(biāo)記特征能夠有效提高中古漢語CRFs分詞的精確度。同時本文提出的中古漢語分詞系統(tǒng)可以服務(wù)于中古時期多類別的漢語語料。
[Abstract]:[objective] to verify the influence of word segmentation consistency and corpus classification on the efficiency of CRFs participle, and to further improve the efficiency of word segmentation and reduce the workload of artificial proofreading. [methods] the history books and Buddhist scriptures of the Middle Ancient period were used to improve the efficiency of word segmentation and reduce the workload of artificial proofreading. For the example of novel corpus, aiming at the problem of automatic word segmentation in middle ancient Chinese, the principle of word segmentation is optimized, and the method of combining CRFs model with dictionary is used to eliminate the disconsistency of word segmentation in the result of artificial word segmentation in middle ancient Chinese. At the same time, we introduce character classification and dictionary information into CRFs word segmentation, and select the most suitable segmentation template for each feature by contrast experiment. [results] the experimental results show that the total F value of word segmentation results is more than 99% in the closed test. In the comprehensive test of open test, 89% -95% is also achieved. The research on the inconsistency of participle is mainly aimed at two-character words, so the recognition effect of more than three words (multi-character words) is slightly deficient. [conclusion] on the premise of effectively improving the consistency of participle, Character classification and dictionary tagging features can effectively improve the accuracy of middle ancient Chinese CRFs participle. At the same time, the middle ancient Chinese word segmentation system proposed in this paper can serve for many kinds of Chinese corpus of Middle Ancient Chinese.
【作者單位】: 南京師范大學(xué)文學(xué)院;
【基金】:國家社會科學(xué)基金重大項目“漢語史研究語料庫建設(shè)研究”(項目編號:10&ZD117);國家社會科學(xué)基金重大項目“基于《漢學(xué)引得叢刊》的典籍知識庫構(gòu)建及人文計算研究”(項目編號:15ZDB127)的研究成果之一 教育部人文社會科學(xué)青年項目“漢語歷時詞匯數(shù)據(jù)庫的構(gòu)建與計量研究”(項目編號:16YJC740034)
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 王f捎,
本文編號:1536773
本文鏈接:http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1536773.html
最近更新
教材專著