基于Word Embedding的短文本特征擴展方法研究

發(fā)布時間：2018-05-20 13:11

本文選題：Word + Embedding��；參考：《吉林大學》2017年碩士論文

【摘要】：隨著網絡的發(fā)展和移動設備的普及,人與人之間交流變的更加及時、方便。短信、QQ、微博等社交媒體已成為我們生活中不可或缺的一部分,信息的形式也變得更加短小和自由。網絡中的短文本的數量快速增長,給傳統的基于長文本的自動信息處理和文本挖掘技術帶來了新的挑戰(zhàn)。如何解決短文本自身特征稀疏、特征覆蓋率低等問題,已經成為很多學者研究的重點。其中,最直接有效的方法是擴展短文本的特征。深度學習的不斷發(fā)展,使其在各個領域中得到了廣泛的應用,結合深度學習的自然語言處理技術也成為研究的一種必然趨勢,其中Word Embedding就是這個過程中的一個重要的成果。Word Embedding是詞的一種向量表示方法,不同于傳統相互獨立的詞表示,它將詞按照語義間的關聯強度分布在相對低維度的向量空間中,同時編碼了語言中顯性和隱性的規(guī)則。這也使詞向量不再是單純用來識別單詞的符號,同時也蘊含著很多語義信息。本文將Word Embedding作為短文本的特征擴展的依據,提出來一種新的文本特征擴展方法。該方法豐富了短文本的語義信息,同時擴大了文本特征覆蓋率,具體研究內容如下:1.基于大規(guī)模的語料庫訓練Word Embedding。Word Embedding的訓練模型為神經網絡結構的語言模型,本文根據Word Embedding的發(fā)展過程和不同需求介紹四種常見的模型:神經網絡語言模型、循環(huán)神經網絡模型、CBOW和Skip_gram。同時結合其他學者對模型的研究和本文的任務需求,選擇了Skip_gram模型作為Word Embedding訓練模型。同時選擇內容豐富、數據量較大的WIKI百科英文數據庫作為模型的訓練數據,得到了200多萬個單詞對應的Word Embedding表示。2.依據Word Embedding間的特性,使用向量間計算完成短文本范圍內的簡單推理。部分Word Embedding編碼的語言規(guī)則,可以使用Word Embedding間的減法和加法運算來表示,本文將這一特性用在短文本對應的有序的詞序列上,獲得與短文本語義相關的向量表達。運算得到的推理向量與Word Embedding屬于同一個向量空間。3.使用Word Embedding聚類表示擴展特征空間。不同于傳統的小粒度的語義表示單位(詞、短語、概念等),本文基于Word Embedding空間分布特點,通過聚類得到基于語義相近度自動劃分的“語義單元”,以“語義單元”作為擴展特征的特征項,且相同維度的向量表達(包括短文本對應的Word Embedding向量和之前介紹的Word Embedding的推理向量)都可以映射到擴展特征空間上。最后,本文使用基于Word Embedding的短文本特征擴展方法進行了短文本分類和短文本聚類實驗。在谷歌搜索片段和China Daily新聞摘要兩種數據集上,分類精度相較于基于LDA的方法分別提高了3.7%、1.0%,聚類F值相較于傳統聚類方法分別提高30.64%、17.54%。實驗結果表明,本文方法可以更好地表達短文本的信息,改善了短文本特征稀疏、特征覆蓋率低等問題。
[Abstract]:With the development of network and the popularization of mobile devices, communication between people becomes more timely and convenient. Social media such as SMS, QQ, Weibo have become an integral part of our lives, and the form of information has become shorter and freer. The rapid growth of the number of short text in the network brings new challenges to the traditional automatic information processing and text mining technology based on long text. How to solve the problems of sparse features and low coverage of short text has become the focus of many scholars. The most direct and effective method is to extend the feature of short text. With the continuous development of deep learning, it has been widely used in various fields. The natural language processing technology combined with deep learning has become an inevitable trend of research. Word Embedding is an important achievement in this process. Word Embedding is a vector representation method of words, which is different from the traditional independent word representation. It distributes the words in a vector space of relatively low dimension according to the correlation strength between semantics. It encodes both explicit and implicit rules in language. This makes the word vector not only used to identify words, but also contains a lot of semantic information. In this paper, Word Embedding is taken as the basis of feature extension of short text, and a new method of text feature extension is proposed. This method not only enriches the semantic information of short text, but also expands the coverage of text features. The specific research contents are as follows: 1. The training model of Word Embedding.Word Embedding based on large-scale corpus training is a language model of neural network structure. According to the development process and different needs of Word Embedding, this paper introduces four common models: neural network language model. The circulatory neural network model is CBOW and SkipSP-gram. At the same time, according to other scholars' research on the model and the task requirements of this paper, Skip_gram model is chosen as the Word Embedding training model. At the same time, the English database of WIKI encyclopedia, which is rich in content and large amount of data, is chosen as the training data of the model, and the corresponding Word Embedding representation of more than 2 million words is obtained. 2. 2. According to the characteristics of Word Embedding, the simple reasoning within the scope of text is accomplished by vector computation. Some of the language rules encoded by Word Embedding can be expressed by subtraction and addition between Word Embedding. This feature is applied to the ordered word sequences corresponding to short texts to obtain vector expressions related to the semantics of short texts. The inference vector obtained by operation belongs to the same vector space as Word Embedding. The extended feature space is represented by Word Embedding clustering. Different from the traditional small-grained semantic representation units (words, phrases, concepts, etc.), this paper, based on the spatial distribution of Word Embedding, obtains the "semantic units" which are automatically partitioned based on the degree of semantic similarity by clustering. The "semantic unit" is used as the feature item of extended feature, and the vector representation of the same dimension (including the Word Embedding vector corresponding to the short text and the inference vector of Word Embedding introduced earlier) can be mapped to the extended feature space. Finally, short text classification and short text clustering experiments based on Word Embedding are carried out. On Google search segment and China Daily news summary, the classification accuracy is 3.740% higher than that based on LDA, and the clustering F value is 30.64% 17.54% higher than that of traditional clustering method. The experimental results show that this method can better express the information of short text and improve the problems of sparse feature and low coverage of short text.
【學位授予單位】：吉林大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1

【相似文獻】

相關碩士學位論文前1條

1 孟欣;基于Word Embedding的短文本特征擴展方法研究[D];吉林大學;2017年

，

本文編號：1914747

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1914747.html

上一篇：融合PAM和主題偏好TextRank的歷史沿革信息抽取
下一篇：高校招生管理信息系統設計與實現

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Word Embedding的短文本特征擴展方法研究