基于AMR的中文句子語義標(biāo)注及統(tǒng)計(jì)分析
發(fā)布時(shí)間:2018-06-06 07:22
本文選題:句子語義 + 語義標(biāo)注 ; 參考:《南京師范大學(xué)》2017年碩士論文
【摘要】:一直以來,語義分析都是自然語言處理領(lǐng)域的一大難題。在如今的大數(shù)據(jù)時(shí)代,基于機(jī)器學(xué)習(xí)的詞性標(biāo)注、句法分析研究已經(jīng)日益完善,機(jī)器翻譯、人工智能等領(lǐng)域的發(fā)展越發(fā)依賴深入的句子語義分析。AMR (Abstract Meaning Representation)作為一種句子語義表示方法,其語義表示結(jié)果是一個(gè)單根有向無環(huán)圖;而且AMR表示的是句中的概念以及概念間的關(guān)系,在從詞語到概念以及關(guān)系的抽象過程中,可以根據(jù)句子的語義適當(dāng)新增概念或刪減句中的詞語。因此,較之其他語義表示方法,AMR可以更完整地表示句子中豐富的語義信息。但是AMR目前主要是針對(duì)英文展開的研究,其體系并不適用于中文句子的語義表示;谝陨显,本文決定將中文句子的語義表示作為研究目標(biāo),在詳細(xì)梳理了 AMR的發(fā)展歷程、體系及AMR的自動(dòng)分析等內(nèi)容后,以AMR的體系為基礎(chǔ),建立一套適用于中文句子的抽象語義表示方法(ChineseAMR, CAMR)。本研究建立的CAMR標(biāo)注體系主要包括兩部分:對(duì)AMR的繼承與發(fā)展以及CAMR標(biāo)注規(guī)范。該標(biāo)注規(guī)范不僅制定了一套詳細(xì)的標(biāo)記集,而且對(duì)中文中常見的和特殊的語言現(xiàn)象作了細(xì)致的定義。其中,標(biāo)記集分為概念和關(guān)系兩部分。概念部分不僅僅對(duì)表示回指、語氣、各種疑問代詞、數(shù)量類型、專有名詞等的處理做了規(guī)定,還增加了表復(fù)句的概念。關(guān)系部分共包括5種核心語義關(guān)系,42種非核心語義關(guān)系。規(guī)范中的每一條細(xì)則都給出了具體的中文示例。在制定的標(biāo)注規(guī)范的基礎(chǔ)上,本文展開了第二項(xiàng)工作——語料標(biāo)注。整個(gè)語料標(biāo)注過程分為兩個(gè)階段。第一階段選取了中文版《小王子》進(jìn)行標(biāo)注。在語料的標(biāo)注過程中,根據(jù)語料的實(shí)際分析需求,反復(fù)討論修改標(biāo)記集,不斷完善CAMR的標(biāo)注規(guī)范;第二階段在仔細(xì)比較了多種語料的基礎(chǔ)上,選取了中文賓州樹庫(CTB)語料作為標(biāo)注對(duì)象。最終共標(biāo)注得到《小王子》語料1562句,CTB語料5000句。在語料標(biāo)注完成后,本文又針對(duì)CAMR的一系列特點(diǎn)進(jìn)行了相應(yīng)的統(tǒng)計(jì)分析。首先,針對(duì)CAMR的分析結(jié)果是單根有向無環(huán)圖的這個(gè)特點(diǎn)對(duì)語料進(jìn)行了統(tǒng)計(jì),發(fā)現(xiàn)語料中有39.96%的句子是圖結(jié)構(gòu),這有力地證明了用圖結(jié)構(gòu)來表示中文句子的語義是必要的。接著,針對(duì)CAMR可以新增概念和刪減詞語的這一特點(diǎn)進(jìn)行了統(tǒng)計(jì),發(fā)現(xiàn)語料中有95.2%的句子在用CAMR表示時(shí),進(jìn)行了新增概念的操作,有96.94%的句子進(jìn)行了刪減詞語的操作。這說明了在表示句子語義時(shí),新增概念和刪減詞語這種抽象是必要的,也進(jìn)一步證明了 CAMR繼承AMR,使用抽象的方法來表示句子語義是合理且必要的。最后,鑒于謂詞一直都是句法語義研究的重點(diǎn),而在CAMR中,謂詞義項(xiàng)通過不同的論元結(jié)構(gòu)來區(qū)分,所以本研究統(tǒng)計(jì)了語料中謂詞義項(xiàng)的論元使用情況,得到了一個(gè)關(guān)于謂詞義項(xiàng)的論元詞典,該義項(xiàng)詞典可供其他語言學(xué)研究者使用。
[Abstract]:Semantic analysis has always been a difficult problem in the field of natural language processing. In the era of big data, the research of parse analysis based on machine learning has become more and more perfect, and machine translation is becoming more and more important. The development of artificial intelligence and other fields rely more and more on the in-depth sentence semantic analysis. AMR Abstract Meaning representation as a sentence semantic representation method, the result of semantic representation is a single-root directed acyclic graph. Moreover, AMR denotes the concept in sentence and the relationship between concepts. In the abstract process from words to concepts and relations, we can add concepts or delete words in subtractive sentences according to the semantics of sentences. Therefore, compared with other semantic representations, AMR can represent the abundant semantic information in sentences more completely. However, AMR is mainly focused on English, and its system is not suitable for the semantic representation of Chinese sentences. For the above reasons, this paper decides to take the semantic representation of Chinese sentences as the research goal. After combing in detail the development course, system and automatic analysis of AMR, this paper bases on the system of AMR. To establish a set of abstract semantic representation methods for Chinese sentences. The CAMR annotation system established in this paper consists of two parts: the inheritance and development of AMR and the specification of CAMR annotation. The specification not only makes a detailed set of tags, but also gives a detailed definition of common and special linguistic phenomena in Chinese. The tag set is divided into two parts: concept and relation. The conceptual part not only provides for the treatment of anaphora, mood, various interrogative pronouns, quantity types, proper nouns, but also adds the concept of complex sentences. The relationship part consists of 5 core semantic relationships and 42 non-core semantic relationships. Each detail in the specification gives concrete examples in Chinese. On the basis of the label specification, the second work, corpus annotation, is carried out in this paper. The whole process of corpus tagging is divided into two stages. The first stage selected the Chinese version of "Little Prince" to mark. In the process of corpus tagging, according to the actual needs of data analysis, we repeatedly discuss the revision of marking set, and constantly improve the annotation specification of CAMR. In the second stage, on the basis of careful comparison of many kinds of data, The CTB corpus is selected as the tagging object. Finally, a total of 1562 sentences and 5000 sentences of CTB corpus were obtained by tagging Little Prince. After the completion of corpus tagging, this paper makes a statistical analysis of a series of characteristics of CAMR. First of all, in view of the fact that the result of CAMR analysis is single directed acyclic graph, this paper makes statistics on the corpus. It is found that 39.96% of the sentences in the corpus are graph structures, which proves that it is necessary to use graph structure to represent the semantics of Chinese sentences. Then, according to the feature that CAMR can add new concepts and delete words, it is found that 95.2% of the sentences in the corpus operate on the new concepts and 96.94% of the sentences have the operation of deleting words when they are expressed in CAMR. This shows that it is necessary to add new concepts and delete words in the representation of sentence semantics, and further proves that it is reasonable and necessary to use abstract methods to express sentence semantics by inheriting AMRs. Finally, in view of the fact that predicate has always been the focus of syntactic and semantic research, in CAMR, predicate meanings are distinguished by different argument structures. A lexicon of predicate meanings is obtained, which can be used by other linguistic researchers.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:H146.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 李峰;李芳;;中文詞語語義相似度計(jì)算——基于《知網(wǎng)》2000[J];中文信息學(xué)報(bào);2007年03期
2 王惠;詹衛(wèi)東;俞士汶;;“現(xiàn)代漢語語義詞典”的結(jié)構(gòu)及應(yīng)用[J];語言文字應(yīng)用;2006年01期
3 于江生 ,俞士汶;中文概念詞典的結(jié)構(gòu)[J];中文信息學(xué)報(bào);2002年04期
,本文編號(hào):1985744
本文鏈接:http://www.sikaile.net/wenyilunwen/yuyanyishu/1985744.html
最近更新
教材專著