天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于個(gè)人微博時(shí)序事件的研究

發(fā)布時(shí)間:2019-03-15 14:44
【摘要】:微博作為一個(gè)新興的社交媒體服務(wù),從各個(gè)方面滲透并影響著人們的生活,成為人們共享信息、交流情感的一個(gè)重要平臺。其中大部分的個(gè)人微博內(nèi)容記錄其生活經(jīng)歷、專業(yè)興趣以及熱點(diǎn)話題的討論等,所以微博數(shù)據(jù)就成了個(gè)人履歷情感的載體。由于發(fā)微博的的實(shí)時(shí)性、便利性有時(shí)甚至是秒發(fā),,這樣個(gè)人微博就逐漸代替了日記,形成了時(shí)記或分記,這樣長時(shí)間后形成的微博數(shù)據(jù)量會(huì)非常龐大,想要了解博主就只能通過逐條瀏覽其歷史微博,這就造成了時(shí)間浪費(fèi)。如何快速準(zhǔn)確的了解博主的動(dòng)態(tài)已成為目前急需解決的問題,微博歸類就是針對這一問題而提出的。在微博歸類過程中,微博相似度的精度決定了其的準(zhǔn)確性,本文研究的重點(diǎn)就是如何提高微博相似度的精確性。 由于個(gè)人微博數(shù)據(jù)總體數(shù)量較多、單條簡短及內(nèi)容隨意性大等特性,利用傳統(tǒng)分類方法以及信息提取算法進(jìn)行處理時(shí)存在一定的局限性。本文考慮到單條微博文本信息簡短包含的有效特征少,且內(nèi)容比較口語化的特性,從同類詞方面對文本的特征詞進(jìn)行了擴(kuò)展,盡量降低特征丟失的可能性,提出了一種基于改進(jìn)的Jaccard相似度和余弦相似度的綜合相似度算法。首先,對獲取的微博數(shù)據(jù)進(jìn)行過濾,去除沒有任何信息的文本和無關(guān)鏈接、圖片等,并利用相關(guān)中科院的漢語詞法分詞系統(tǒng)ICTCLAS對文本進(jìn)行分詞、做詞性標(biāo)記和過濾停用詞以及表情詞;其次,采用改進(jìn)的TF-IDF算法提取微博特征詞和LDA(Latent Dirichlet Allocation)主題模型構(gòu)造同類詞模板來提高微博相似度的精度,即先利用特征選擇評估函數(shù)CHI衡量每個(gè)特征詞對每個(gè)類別的重要程度并使特征詞在該類別文本中符合均勻分布后再計(jì)算TF-IDF值來提取微博特征詞;然后,在提取的特征詞和構(gòu)造的同類詞模板的基礎(chǔ)上結(jié)合Jaccard相似度和余弦相似度計(jì)算個(gè)人微博的綜合相似度,該算法克服了傳統(tǒng)只基于詞語共現(xiàn)方法的不足,能夠從同類詞特征和個(gè)體數(shù)值特征等方面更深層次、更全面的計(jì)算兩條微博的相似度;最后,利用K-Means時(shí)序事件歸類算法對個(gè)人微博數(shù)據(jù)進(jìn)行歸類,使相同話題微博歸類到同一個(gè)集合中。 實(shí)驗(yàn)結(jié)果表明本文提出的綜合相似度算法比傳統(tǒng)的相似度算法具有更高的精確度,在一定程度上提高了個(gè)人微博時(shí)序事件歸類的準(zhǔn)確性。
[Abstract]:Weibo, as a new social media service, permeates and affects people's lives from various aspects, and becomes an important platform for people to share information and exchange emotions. Most of the personal Weibo content records their life experience, professional interest and discussion of hot topics, so Weibo data has become the carrier of personal experience emotion. Because of Weibo's real-time, convenience and sometimes even second hair, individual Weibo gradually replaced the diary and formed a chronology or minutes, so that the amount of Weibo data formed after such a long period of time will be very large. If you want to know the blogger, you can only browse the history of Weibo one by one, which causes a waste of time. How to quickly and accurately understand the dynamics of bloggers has become an urgent problem to be solved. Weibo's classification is aimed at this problem. In the process of Weibo classification, the accuracy of Weibo similarity determines its accuracy. The focus of this paper is how to improve the accuracy of Weibo similarity. Because of the large number of individual Weibo data, short and random content, there are some limitations in using the traditional classification method and information extraction algorithm to process the data. Taking into account the few effective features contained in the short message of Weibo and the colloquial character of the content, this paper extends the feature words of the text from the aspect of similar words, and reduces the possibility of feature loss as far as possible. A synthetic similarity algorithm based on improved Jaccard similarity and cosine similarity is proposed. First of all, we filter the obtained Weibo data, remove the text without any information and irrelevant links, pictures, and so on, and use the Chinese word segmentation system ICTCLAS of the relevant Chinese Academy of Sciences to segment the text. Make part-of-speech markers and filter deactivated words and emoji words; Secondly, the improved TF-IDF algorithm is used to extract Weibo feature words and LDA (Latent Dirichlet Allocation) theme model to construct similar word templates to improve the similarity accuracy of Weibo. Firstly, we use the feature selection evaluation function (CHI) to measure the importance of each feature word to each category, and then calculate the TF-IDF value to extract the feature words of Weibo after the feature words accord with the uniform distribution in the text of this category. Then, on the basis of extracting feature words and constructing similar word templates, we combine Jaccard similarity and cosine similarity to calculate the synthetic similarity of individual Weibo. This algorithm overcomes the deficiency of traditional co-occurrence method based on words. The similarity between the two Weibo can be calculated more comprehensively from the similar word features and individual numerical features and other aspects of the deeper level; Finally, the K-Means time series event classification algorithm is used to classify the personal Weibo data, and Weibo, the same topic, is classified into the same set. The experimental results show that the synthetic similarity algorithm proposed in this paper has higher accuracy than the traditional similarity algorithm, and improves the accuracy of individual Weibo temporal event classification to a certain extent.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092


本文編號:2440722

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/2440722.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶e3b78***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com