天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博熱點(diǎn)話題發(fā)現(xiàn)研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-02-23 01:09

  本文關(guān)鍵詞: 微博 熱點(diǎn)話題發(fā)現(xiàn) 微博API Single-Pass算法 LDA模型 出處:《鄭州大學(xué)》2014年碩士論文 論文類型:學(xué)位論文


【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展以及移動(dòng)互聯(lián)網(wǎng)的全面普及,網(wǎng)民們相互溝通了解的方式越來越多樣化。微博作為一個(gè)新興的平臺,以其獨(dú)特的靈活性和便捷性,更加受到網(wǎng)民的青睞。微博給人們生活帶來極大便利的同時(shí),也產(chǎn)生了一些副作用,例如一些人使用微博蓄意傳播假消息,給社會(huì)安定造成不良的影響。如果能夠及早發(fā)現(xiàn)這些話題,就能及時(shí)采取相應(yīng)的措施。對用戶來說,用戶只能看到自己主頁上的微博消息,,無法了解到整個(gè)微博網(wǎng)絡(luò)中大多數(shù)用戶都在討論或者關(guān)注哪些事件。因此,及時(shí)發(fā)現(xiàn)微博熱點(diǎn)話題是非常有意義的。 本文定義了話題的熱度,從定量的角度來表達(dá)熱點(diǎn)話題,對于某個(gè)話題來說,包含的微博發(fā)布時(shí)間越晚,評論數(shù)和轉(zhuǎn)發(fā)數(shù)越多,該話題的熱度越高,越有可能是熱點(diǎn)話題。國內(nèi)外大量學(xué)者都在熱點(diǎn)話題發(fā)現(xiàn)上做了許多研究,總結(jié)出來大致有聚類算法、LDA模型、情感模型三種方法,或者是在此基礎(chǔ)上進(jìn)行改進(jìn)。本文在研究微博熱點(diǎn)話題發(fā)現(xiàn)的過程中,首先需要解決微博語料的問題,傳統(tǒng)的網(wǎng)絡(luò)爬蟲無法適用于微博信息抓取,而且微博API也只能抓取本人微博主頁上的微博信息,無法獲取大量的微博信息,所以本文根據(jù)微博用戶之間相互關(guān)注的關(guān)系獲取大量用戶信息,然后抓取這些用戶最新發(fā)表的微博信息。接下來需要對微博進(jìn)行預(yù)處理,包括過濾垃圾微博信息、分詞、去除停用詞、無用信息過濾、特征詞提取、特征權(quán)重計(jì)算,為每一條微博文本生成特征向量。最后針對微博不斷增加的特點(diǎn),選擇適合的Single-Pass增量聚類算法,得到多個(gè)簇,每個(gè)簇代表一個(gè)話題,每一個(gè)話題下包含許多條微博。為了從話題中選擇出熱點(diǎn)話題,文中定義了話題的熱度,發(fā)布時(shí)間越晚、評論數(shù)和轉(zhuǎn)發(fā)數(shù)越多的話題,熱度越高,成為熱點(diǎn)話題的可能性越大。 從大量學(xué)者的研究中發(fā)現(xiàn),LDA主題模型也能夠用來發(fā)現(xiàn)話題,但是它需要多次迭代,處理大量數(shù)據(jù)時(shí)運(yùn)行時(shí)間比較長。不過LDA主題模型在主題表達(dá)方面比較有優(yōu)勢,所以本文將Single-Pass算法與LDA模型結(jié)合起來,先利用Single-Pass聚類算法對微博文本聚類,然后利用LDA算法處理每一個(gè)簇,最后得到微博熱點(diǎn)話題,這樣比單獨(dú)使用Single-Pass能生成更加準(zhǔn)確的話題,比單獨(dú)使用LDA模型處理速度更快。
[Abstract]:With the rapid development of the Internet and the overall popularity of mobile Internet, Internet users to communicate with each other more and more diverse ways of micro-blog. As a new platform, with its unique flexibility and convenience, more users of all ages. Micro-blog has brought great convenience to people's life at the same time, also have some side effects, such as some people use micro-blog deliberately spread false news, causing adverse effects to social stability. If we can find these topics, we may be able to take corresponding measures. For users, users can only see from micro-blog news has on the home page, you can not understand the majority of users throughout the micro-blog network in the discussion or attention. So what events micro-blog, found that the hot topic is very meaningful in a timely manner.
This paper defines the topic of heat, to express the topic from the quantitative point of view for a topic, including the micro-blog released the late time, the number of comments and forwarding number, the topic of heat is high, the more likely it is a hot topic. Many scholars at home and abroad are found on the hot topic there are many studies, summed up the clustering algorithm, LDA model, emotion model three methods, or improve on this basis. This paper found in hot topic on micro-blog, micro-blog first need to solve the problem of corpora, traditional web crawlers cannot apply to micro-blog information capture, API and micro-blog can grab me the micro-blog home page on micro-blog information, unable to get a lot of micro-blog information, so according to the relationship between the attention of micro-blog users get a lot of user information, and then grab the users the latest. Table next to micro-blog. Micro-blog information pretreatment, including micro-blog word, information filtering spam, remove stop words, useless information filtering, feature extraction, feature weight calculation, for each micro-blog text feature vectors. Finally, according to the characteristics of micro-blog increased, Single-Pass incremental clustering algorithm for the get a plurality of clusters, each cluster represents a topic, each topic contains a lot of micro-blog. In order to select a topic from the topic, this paper defines the topic of heat release, the late time, the number of comments and forwarding topic number, the higher the heat, become the hot topic of the possibility of more.
The study found that a large number of scholars in the LDA topic model also can be used to find the topic, but it needs many iterations, the processing of large amounts of data to run a long time. But LDA topic model in theme expression of comparative advantage, so this paper introduces Single-Pass algorithm and LDA model combined by using the Single-Pass clustering algorithm on micro-blog text clustering, then we use the LDA algorithm to handle each cluster, and finally get the micro-blog hot topic, so than using Single-Pass alone can generate more accurate than a single topic, using the LDA model processing speed is faster.

【學(xué)位授予單位】:鄭州大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 龍樹全;趙正文;唐華;;中文分詞算法概述[J];電腦知識與技術(shù);2009年10期

2 趙前東;葉猛;;微博熱點(diǎn)話題檢測系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電視技術(shù);2013年03期

3 谷文成;柴寶仁;韓俊松;;基于支持向量機(jī)的垃圾信息過濾方法[J];北京理工大學(xué)學(xué)報(bào);2013年10期

4 孫國菊,張杰;中文文本分類的特征選取評價(jià)[J];哈爾濱理工大學(xué)學(xué)報(bào);2005年01期

5 劉麗珍,宋瀚濤;文本分類中的特征選取[J];計(jì)算機(jī)工程;2004年04期

6 馮進(jìn);丁博;史殿習(xí);張矚熹;許凱;;XML解析技術(shù)研究[J];計(jì)算機(jī)工程與科學(xué);2009年02期

7 王小偉;王黎明;;基于動(dòng)態(tài)人工免疫的郵件分類算法研究[J];計(jì)算機(jī)應(yīng)用;2006年10期

8 楊亮;林原;林鴻飛;;基于情感分布的微博熱點(diǎn)事件發(fā)現(xiàn)[J];中文信息學(xué)報(bào);2012年01期

9 龐景安;;Web信息采集技術(shù)研究與發(fā)展[J];情報(bào)科學(xué);2009年12期

10 莫建文;鄭陽;首照宇;張順嵐;;改進(jìn)的基于詞典的中文分詞方法[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年05期



本文編號:1525763

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/1525763.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a60a9***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com