面向醫(yī)療領域的中文命名實體識別
發(fā)布時間:2018-10-29 12:00
【摘要】:隨著近幾年文本數(shù)據(jù)量的爆炸式增長、大規(guī)模知識庫的建立和普及,命名實體識別研究已經(jīng)逐漸成為自然語言處理領域的一大研究熱點。然而,傳統(tǒng)的基于有監(jiān)督學習的方法,需要大規(guī)模的標注語料。在標注語料稀缺的醫(yī)療領域,傳統(tǒng)的命名實體識別方法并不能夠達到理想的效果。隨著深度學習的火熱發(fā)展和普及,循環(huán)神經(jīng)網(wǎng)絡(RNN,Recurrent Ne ural Network),尤其是長短期存儲單元LSTM(Long-Short Term Memory)被廣泛應用于自然語言處理領域,并在多個研究方向上取得顯著高于傳統(tǒng)方法的成績。因此,我們首先利用LSTM模型進行醫(yī)療領域的命名實體識別的研究,并證明其無論是在研究效果評價還是實際應用層面,都能夠達到比傳統(tǒng)的條件隨機場模型(CRF,Conditional Random Fields)更好的效果。由于醫(yī)療領域的規(guī)范的標注語料相對稀少,我們在LSTM模型已經(jīng)取得比CRF模型更好的效果的基礎上,還希望它能夠通過融合外部信息,同時學習到新聞領域的語言學特征和醫(yī)療領域的無監(jiān)督語義信息,達到更好的效果。我們利用了深度學習中遷移學習和預訓練的相關知識,對醫(yī)療領域的模型進行了參數(shù)融合和模型調優(yōu),使得模型的效果進一步提升。最后,由于LSTM模型在實際應用中的缺陷,我們希望能夠利用另一種方法進行領域自適應的命名實體識別。為了找尋不同知識域的領域差異,我們進行了多組混合不同領域語料的對比實驗進行分析和探究。并通過GB DT模型集成領域差異和無監(jiān)督的醫(yī)療領域的語義向量進行命名實體識別的研究,取得了較好的研究效果。
[Abstract]:With the explosive growth of text data in recent years and the establishment and popularization of large-scale knowledge base, the research of named entity recognition has gradually become a research hotspot in the field of natural language processing. However, traditional methods based on supervised learning require large scale tagging corpus. In the medical field where tagging data is scarce, the traditional naming entity recognition method can not achieve the desired results. With the development and popularization of deep learning, cyclic neural network (RNN,Recurrent Ne ural Network), especially LSTM (long and short term memory unit) (Long-Short Term Memory), has been widely used in the field of natural language processing. And in many research directions, the results are significantly higher than the traditional methods. Therefore, we first use the LSTM model to study the named entity recognition in medical field, and prove that it can achieve more than the traditional conditional random field model (CRF,), both in the evaluation of the research effect and in the practical application level. Conditional Random Fields) works better. Because the standard annotated corpus in the medical field is relatively scarce, we hope that LSTM model can integrate external information on the basis that the LSTM model has achieved better results than the CRF model. At the same time, we learn the linguistic features of the news field and the unsupervised semantic information in the medical field to achieve better results. We make use of the knowledge of transfer learning and pre-training in deep learning to fuse the parameters and optimize the models in the medical field, so that the effectiveness of the model can be further improved. Finally, due to the defects of LSTM model in practical application, we hope to use another method for domain adaptive named entity recognition. In order to find out the domain differences of different knowledge domains, we conducted a comparative experiment of mixing different domain corpus to analyze and explore. The named entity recognition is studied by integrating the semantic vectors of domain difference and unsupervised medical field with GB DT model, and good results are obtained.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1;TP18
本文編號:2297637
[Abstract]:With the explosive growth of text data in recent years and the establishment and popularization of large-scale knowledge base, the research of named entity recognition has gradually become a research hotspot in the field of natural language processing. However, traditional methods based on supervised learning require large scale tagging corpus. In the medical field where tagging data is scarce, the traditional naming entity recognition method can not achieve the desired results. With the development and popularization of deep learning, cyclic neural network (RNN,Recurrent Ne ural Network), especially LSTM (long and short term memory unit) (Long-Short Term Memory), has been widely used in the field of natural language processing. And in many research directions, the results are significantly higher than the traditional methods. Therefore, we first use the LSTM model to study the named entity recognition in medical field, and prove that it can achieve more than the traditional conditional random field model (CRF,), both in the evaluation of the research effect and in the practical application level. Conditional Random Fields) works better. Because the standard annotated corpus in the medical field is relatively scarce, we hope that LSTM model can integrate external information on the basis that the LSTM model has achieved better results than the CRF model. At the same time, we learn the linguistic features of the news field and the unsupervised semantic information in the medical field to achieve better results. We make use of the knowledge of transfer learning and pre-training in deep learning to fuse the parameters and optimize the models in the medical field, so that the effectiveness of the model can be further improved. Finally, due to the defects of LSTM model in practical application, we hope to use another method for domain adaptive named entity recognition. In order to find out the domain differences of different knowledge domains, we conducted a comparative experiment of mixing different domain corpus to analyze and explore. The named entity recognition is studied by integrating the semantic vectors of domain difference and unsupervised medical field with GB DT model, and good results are obtained.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1;TP18
【參考文獻】
相關期刊論文 前6條
1 王鵬遠;姬東鴻;;基于多標簽CRF的疾病名稱抽取[J];計算機應用研究;2017年01期
2 蘇婭;劉杰;黃亞樓;;在線醫(yī)療文本中的實體識別研究[J];北京大學學報(自然科學版);2016年01期
3 曲春燕;關毅;楊錦鋒;趙永杰;劉雅欣;;中文電子病歷命名實體標注語料庫構建[J];高技術通訊;2015年02期
4 栗偉;趙大哲;李博;彭新茗;劉積仁;;CRF與規(guī)則相結合的醫(yī)學病歷實體識別[J];計算機應用研究;2015年04期
5 張金龍;王石;錢存發(fā);;基于CRF和規(guī)則的中文醫(yī)療機構名稱識別[J];計算機應用與軟件;2014年03期
6 邱莎;段玻;申浩如;丁海燕;;基于條件隨機場的中文人名識別研究[J];昆明學院學報;2011年06期
相關會議論文 前1條
1 張祝玉;任飛亮;朱靖波;;基于條件隨機場的中文命名實體識別特征比較研究[A];第四屆全國信息檢索與內容安全學術會議論文集(上)[C];2008年
相關碩士學位論文 前1條
1 段超群;面向缺乏標注數(shù)據(jù)領域的命名實體識別的研究[D];哈爾濱工業(yè)大學;2015年
,本文編號:2297637
本文鏈接:http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/2297637.html
最近更新
教材專著