微博垃圾評(píng)論識(shí)別方法研究

發(fā)布時(shí)間：2018-03-21 11:46

本文選題：微博垃圾評(píng)論　切入點(diǎn)：協(xié)同訓(xùn)練　出處：《廣西師范大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：垃圾評(píng)論是指用戶發(fā)布的與博文無(wú)關(guān)的,或沒(méi)有意義的,或蓄意發(fā)表的評(píng)論信息。早期常采用人工識(shí)別方法來(lái)識(shí)別,主要有基于驗(yàn)證碼、審核機(jī)制兩種;中期采用自動(dòng)識(shí)別方法來(lái)識(shí)別,主要有基于關(guān)鍵詞、基于鏈接數(shù)量和基于相關(guān)度閾值的方法;近期先采用基于規(guī)則方法過(guò)濾掉超鏈接、特殊字符等明顯的顯式垃圾評(píng)論,然后采用基于主題特征的方法結(jié)合分類器進(jìn)行微博垃圾評(píng)論識(shí)別。目前常采用的微博數(shù)據(jù)獲取方法主要有網(wǎng)絡(luò)爬蟲和微博開(kāi)放平臺(tái)API兩種,前者速度慢,處理本文所需實(shí)驗(yàn)數(shù)據(jù)需要花費(fèi)大量的時(shí)間,而后者的訪問(wèn)次數(shù)受到微博平臺(tái)服務(wù)器的限制,兩種方法獲取實(shí)驗(yàn)數(shù)據(jù)都不是很理想。所以,本文提出了一種基于cookie與正則表達(dá)式的方法獲取實(shí)驗(yàn)所需的數(shù)據(jù),包括微博原文、微博作者信息和微博評(píng)論。本文設(shè)計(jì)采用以上兩種常用方法和本文提出方法獲取經(jīng)過(guò)微博認(rèn)證的用戶名為王寶強(qiáng)發(fā)表的主題為離婚微博的評(píng)論數(shù)據(jù),實(shí)驗(yàn)結(jié)果表明,本方法相較于兩種常用方法,不僅操作相對(duì)簡(jiǎn)單,而且數(shù)據(jù)獲取速度較快。微博及其評(píng)論的字符受到限制,最多只有140字,內(nèi)容相對(duì)短小,微博的主題特征并不是特別明顯,對(duì)微博中垃圾評(píng)論進(jìn)行識(shí)別不能只考慮評(píng)論和微博之間的相關(guān)程度,因?yàn)閱我坏囊蛩乜紤]會(huì)增加垃圾評(píng)論的誤判率。因此,本文嘗試使用協(xié)同訓(xùn)練方法來(lái)增強(qiáng)分類器性能,提出一種基于Co-Training協(xié)同訓(xùn)練的垃圾評(píng)論識(shí)別方法。對(duì)于微博原文和微博作者信息,本文進(jìn)行預(yù)處理后得到的相關(guān)信息詞組,與微博特有情感詞匯以及大連理工信息檢索實(shí)驗(yàn)室的情感詞匯本體中情感強(qiáng)度大于5的情感詞構(gòu)成特征詞匯庫(kù)。對(duì)于微博評(píng)論,本文通過(guò)定義的基于規(guī)則識(shí)別方法過(guò)濾出顯式垃圾評(píng)論,對(duì)于剩下的相關(guān)評(píng)論進(jìn)行預(yù)處理后,一方面,得到相關(guān)評(píng)論詞組,和構(gòu)造的特征詞匯庫(kù)通過(guò)同義詞詞林相似度計(jì)算方法計(jì)算出結(jié)果,送入AdaBoost分類器,另一方面,進(jìn)行特征提取,得到評(píng)論特征作為特征向量來(lái)訓(xùn)練SVM分類器。最后將兩分類器通過(guò)基于微博垃圾評(píng)論的Co-Training協(xié)同訓(xùn)練算法進(jìn)行協(xié)同訓(xùn)練,用訓(xùn)練好的模型來(lái)判斷評(píng)論是否為垃圾評(píng)論。本方法在提高分類精度的同時(shí),節(jié)省了大量的樣本標(biāo)注工作,通過(guò)實(shí)驗(yàn)將本文方法和其他兩種典型的方法進(jìn)行比較分析,結(jié)果表明本文提出的方法具備良好的可行性和有效性。
[Abstract]:Spam comment refers to the comment information issued by users which has nothing to do with blog posts or is meaningless or intentionally published. In the early stage manual identification is often used to identify the spam comments. There are two kinds of comment information based on verification code and verification mechanism. In the middle stage, automatic recognition method is adopted, which is mainly based on keyword, link number and relevance threshold. In recent years, rules based method is used to filter out hyperlinks, special characters and other obvious explicit spam comments. Then we use theme-based feature based method combined with classifier to recognize Weibo's garbage comment. At present, there are two kinds of common data acquisition methods, namely web crawler and Weibo open platform API, which are slow in speed. It takes a lot of time to process the experimental data required in this paper, and the number of access to the latter is limited by Weibo platform server. Neither method is ideal for obtaining experimental data. In this paper, a method based on cookie and regular expression is proposed to obtain the experimental data, including Weibo's original text. Weibo author's Information and Weibo comments. This paper designs to use the above two common methods and the method proposed in this paper to obtain the comment data on the subject of "divorcing Weibo" published by Wang Baoqiang, whose user name has been authenticated by Weibo. The experimental results show that, Compared with the two common methods, this method is not only relatively simple to operate, but also faster to obtain data. Weibo and his comments are limited in characters, at most 140 words, and the content is relatively short. The thematic features of Weibo are not particularly obvious. The identification of spam comment in Weibo can not only consider the correlation between comment and Weibo, because a single factor will increase the misjudgment rate of garbage comment. Therefore, this paper tries to use the cooperative training method to enhance the performance of classifier. A method of garbage comment recognition based on Co-Training cooperative training is proposed. With Weibo and the emotion vocabulary of Dalian University of Science and Technology Information Retrieval Laboratory, emotion words with more than 5 emotional intensity constitute the characteristic vocabulary database. In this paper, explicit spam comments are filtered out by the defined rule-based recognition method. After preprocessing the remaining related comments, on the one hand, the relevant comment phrases are obtained. And the constructed feature vocabulary database calculates the result by calculating the similarity degree of synonym forest, and sends it into the AdaBoost classifier, on the other hand, carries on the feature extraction, Finally, the two classifiers are trained by the Co-Training co-training algorithm based on Weibo spam comment, which is used as the feature vector to train the SVM classifier. This method not only improves the classification accuracy, but also saves a lot of sample labeling work. Through experiments, the method is compared with other two typical methods. The results show that the proposed method is feasible and effective.
【學(xué)位授予單位】：廣西師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP393.092;TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 ;真品三星軟驅(qū)識(shí)別方法[J];電腦迷;2004年06期

2 施水才;俞鴻魁;呂學(xué)強(qiáng);李渝勤;;基于大規(guī)模語(yǔ)料的新詞語(yǔ)識(shí)別方法[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2006年03期

3 蘇家洪;;試述人臉識(shí)別新技術(shù)及編輯識(shí)別方法[J];中國(guó)新技術(shù)新產(chǎn)品;2012年07期

4 高春庚;孫建國(guó);;基于統(tǒng)計(jì)的人臉識(shí)別方法綜述[J];安陽(yáng)工學(xué)院學(xué)報(bào);2012年04期

5 馬彬;洪宇;楊雪蓉;姚建民;朱巧明;;基于語(yǔ)義依存線索的事件關(guān)系識(shí)別方法研究[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年01期

6 馬彬;洪宇;楊雪蓉;姚建民;朱巧明;;基于推理線索構(gòu)建的事件關(guān)系識(shí)別方法[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年01期

7 呂冬梅,劉燕萍,李云凱;一個(gè)新的機(jī)械圖紙識(shí)別方法[J];信息技術(shù);2001年03期

8 劉志鵬,魏君;基于神經(jīng)網(wǎng)絡(luò)的集裝箱編號(hào)識(shí)別方法的研究[J];中國(guó)包裝工業(yè);2002年09期

9 賀敏;龔才春;張華平;程學(xué)旗;;一種基于大規(guī)模語(yǔ)料的新詞識(shí)別方法[J];計(jì)算機(jī)工程與應(yīng)用;2007年21期

10 董世都;黃同愿;王華秋;王森;楊小帆;;半邊人臉識(shí)別方法[J];計(jì)算機(jī)工程;2008年07期

相關(guān)會(huì)議論文前10條

1 鄭凱;;建立多維數(shù)據(jù)異常點(diǎn)識(shí)別方法的嘗試[A];第八屆全國(guó)體育科學(xué)大會(huì)論文摘要匯編（一）[C];2007年

2 張朋柱;韓崇昭;萬(wàn)百五;;智能決策支持系統(tǒng)中的問(wèn)題識(shí)別方法與實(shí)現(xiàn)[A];全國(guó)青年管理科學(xué)與系統(tǒng)科學(xué)論文集（第2卷）[C];1993年

3 劉麗蘭;劉宏昭;;時(shí)間序列模型的識(shí)別方法[A];制造技術(shù)自動(dòng)化學(xué)術(shù)會(huì)議論文集[C];2004年

4 苗振偉;許勇;楊軍;;超聲波人臉識(shí)別方法研究[A];中國(guó)聲學(xué)學(xué)會(huì)2007年青年學(xué)術(shù)會(huì)議論文集（上）[C];2007年

5 羅智勇;宋柔;荀恩東;;一種基于可信度的人名識(shí)別方法[A];第二屆全國(guó)學(xué)生計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2004年

6 張茜;鄭崢;亢一瀾;王娟;仇巍;;基于海量實(shí)測(cè)數(shù)據(jù)的反演識(shí)別方法與盾構(gòu)裝備載荷的力學(xué)建模[A];中國(guó)力學(xué)大會(huì)——2013論文摘要集[C];2013年

7 趙銳;陳光發(fā);;軍事口令識(shí)別的Fuzzy方法探討[A];第二屆全國(guó)人機(jī)語(yǔ)音通訊學(xué)術(shù)會(huì)議論文集[C];1992年

8 駱玉榮;劉建麗;史曉濤;;一種自動(dòng)車窗識(shí)別方法的設(shè)計(jì)與實(shí)現(xiàn)[A];計(jì)算機(jī)技術(shù)與應(yīng)用進(jìn)展·2007——全國(guó)第18屆計(jì)算機(jī)技術(shù)與應(yīng)用（CACIS）學(xué)術(shù)會(huì)議論文集[C];2007年

9 崔凱華;王國(guó)慶;方劍青;李紅軍;賈俊波;馬超;趙燁;張東輝;;基于聲模態(tài)分析的材料識(shí)別方法研究[A];現(xiàn)代振動(dòng)與噪聲技術(shù)（第九卷）[C];2011年

10 李洪東;梁逸曾;張志敏;;酵母蛋白組中原生肽識(shí)別方法的探索研究[A];中國(guó)化學(xué)會(huì)第26屆學(xué)術(shù)年會(huì)化學(xué)信息學(xué)與化學(xué)計(jì)量學(xué)分會(huì)場(chǎng)論文集[C];2008年

相關(guān)重要報(bào)紙文章前9條

1 陳春道;甲魚優(yōu)劣及雌雄的識(shí)別方法[N];北京科技報(bào);2003年

2 龐席堂;假幣的識(shí)別方法[N];中華合作時(shí)報(bào);2003年

3 王修增;手機(jī)被盜號(hào)的6種識(shí)別方法[N];中國(guó)保險(xiǎn)報(bào);2003年

4 張侃;正品手機(jī)電池識(shí)別方法[N];通信產(chǎn)業(yè)報(bào);2000年

5 潘治;德國(guó)開(kāi)發(fā)出癌癥早期識(shí)別方法[N];中國(guó)中醫(yī)藥報(bào);2003年

6 新華社記者段世文;產(chǎn)權(quán)證識(shí)別方法[N];新華每日電訊;2001年

7 金亮;機(jī)器人的情感[N];中國(guó)醫(yī)藥報(bào);2001年

8 黃璐;識(shí)別假火車票有絕招[N];山西經(jīng)濟(jì)日?qǐng)?bào);2004年

9 宗紹純;如何識(shí)別是純奶還是奶飲料？[N];國(guó)際商報(bào);2003年

相關(guān)博士學(xué)位論文前10條

1 趙國(guó)騰;跨座式單軌交通軌道梁表面裂紋識(shí)別方法研究[D];重慶大學(xué);2015年

2 徐訓(xùn);線性與非線性結(jié)構(gòu)動(dòng)力荷載識(shí)別方法及實(shí)驗(yàn)研究[D];哈爾濱工業(yè)大學(xué);2015年

3 黃仕建;視頻序列中人體行為的低秩表達(dá)與識(shí)別方法研究[D];重慶大學(xué);2015年

4 張航;基于高光譜成像技術(shù)的皮棉中地膜識(shí)別方法研究[D];中國(guó)農(nóng)業(yè)大學(xué);2016年

5 吳翔;基于機(jī)器視覺(jué)的害蟲識(shí)別方法研究[D];浙江大學(xué);2016年

6 張莉莉;競(jìng)優(yōu)特征的群識(shí)別方法及其應(yīng)用[D];東北大學(xué);2010年

7 陳綿書;計(jì)算機(jī)人臉識(shí)別方法研究[D];吉林大學(xué);2004年

8 葉俊勇;人臉檢測(cè)與識(shí)別方法研究[D];重慶大學(xué);2002年

9 何光輝;四種人臉識(shí)別方法研究[D];重慶大學(xué);2010年

10 佟麗娜;基于力學(xué)量信息獲取系統(tǒng)的人體摔倒過(guò)程識(shí)別方法研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2011年

相關(guān)碩士學(xué)位論文前10條

1 徐珂瓊;基于視頻的人臉識(shí)別方法研究[D];天津理工大學(xué);2015年

2 彭姣麗;針對(duì)多表情的人臉識(shí)別方法研究[D];昆明理工大學(xué);2015年

3 代秀麗;基于半監(jiān)督判別分析的人臉識(shí)別方法研究[D];深圳大學(xué);2015年

4 易磊;基于兩階段的交通標(biāo)志識(shí)別方法研究[D];南京理工大學(xué);2015年

5 李彥;基于小波變換的人臉識(shí)別方法研究[D];電子科技大學(xué);2014年

6 田曉霞;運(yùn)動(dòng)想象EEG的識(shí)別方法及在上肢康復(fù)中的應(yīng)用[D];北京工業(yè)大學(xué);2015年

7 楊俊濤;基于分?jǐn)?shù)譜時(shí)頻特征的SAR目標(biāo)檢測(cè)與識(shí)別方法研究[D];電子科技大學(xué);2014年

8 宋洪偉;基于模糊集合的漢語(yǔ)主觀句識(shí)別方法研究與實(shí)現(xiàn)[D];黑龍江大學(xué);2015年

9 賈博軒;基于手機(jī)傳感器的人類復(fù)雜行為識(shí)別方法的研究[D];黑龍江大學(xué);2015年

10 范玲;Link-11數(shù)據(jù)鏈信號(hào)的識(shí)別方法研究[D];西安電子科技大學(xué);2014年

，

本文編號(hào)：1643713

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/1643713.html

上一篇：電信大數(shù)據(jù)文本挖掘算法及應(yīng)用
下一篇：空譜聯(lián)合的核光譜角異常檢測(cè)及GPU實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博垃圾評(píng)論識(shí)別方法研究