天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于主題的微博網(wǎng)頁爬蟲研究

發(fā)布時(shí)間:2018-04-29 01:33

  本文選題:網(wǎng)頁頁面分析 + 微博爬蟲; 參考:《武漢理工大學(xué)》2014年碩士論文


【摘要】:隨著美國twitter的火熱,國內(nèi)各大微博網(wǎng)站興起,微博在網(wǎng)民中日益火熱。在微博中誕生的各種網(wǎng)絡(luò)熱詞也迅速走紅網(wǎng)絡(luò),微博效應(yīng)正在逐漸形成,微博成為中國網(wǎng)民上網(wǎng)的主要活動之一。正是由于微博效應(yīng)的形成,微博話題在網(wǎng)民之間迅速傳遞。對于微博信息的獲取以及分析,成為重要的研究對象。為方便微博數(shù)據(jù)的獲取,各大網(wǎng)站微博也相繼提供了抓取微博的API,但這些API都有訪問次數(shù)的限制,,無法滿足獲取大量微博數(shù)據(jù)的要求,同時(shí)抓取的數(shù)據(jù)往往很雜亂。針對上述問題,本文引入網(wǎng)頁頁面分析技術(shù)和主題相關(guān)性分析技術(shù),展開基于主題的微博網(wǎng)頁爬蟲的研究與設(shè)計(jì)。 本文的主要工作有研究分析網(wǎng)頁頁面分析技術(shù),根據(jù)微博頁面特點(diǎn)選擇微博頁面信息獲取方法;重點(diǎn)描述基于“剪枝”的廣度優(yōu)先搜索策略的思考以及設(shè)計(jì)的詳細(xì)過程,著重解決URL的去重、URL地址集合動態(tài)變化等問題;研究分析短文本主題抽取技術(shù)以及多關(guān)鍵匹配技術(shù),確定微博主題相關(guān)性分析的設(shè)計(jì)方案;最后設(shè)計(jì)實(shí)現(xiàn)基于主題的微博網(wǎng)頁爬蟲的原型系統(tǒng),實(shí)時(shí)抓取和存儲微博數(shù)據(jù)。本文研究的核心問題是,根據(jù)微博數(shù)據(jù)的特點(diǎn)設(shè)計(jì)一種基于“剪枝”的廣度優(yōu)先搜索策略,并將其應(yīng)用到微博爬蟲中;同時(shí)使用微博頁面分析技術(shù)使得爬蟲不受微博平臺API限制,從而讓用戶盡可能準(zhǔn)確地抓取主題相關(guān)的微博數(shù)據(jù)。 通過多次反復(fù)實(shí)驗(yàn)獲取原型系統(tǒng)實(shí)驗(yàn)結(jié)果,將實(shí)驗(yàn)結(jié)果同基于API微博爬蟲和基于網(wǎng)頁微博爬蟲的抓取效果進(jìn)行對比分析得出結(jié)論:本文提出的爬行策略能夠抓取主題相關(guān)的微博數(shù)據(jù),雖然在效率上有所降低,但在抓取的微博數(shù)據(jù)具有較好的主題相關(guān)性。這實(shí)驗(yàn)結(jié)果證明本論文研究的實(shí)現(xiàn)方案是可行的。
[Abstract]:With the popularity of twitter in the United States and the rise of Weibo websites in China, Weibo is becoming more and more popular among Internet users. All kinds of network hot words born in Weibo are also becoming popular in the Internet, and Weibo effect is gradually forming. Weibo has become one of the main activities of Internet users in China. Precisely because of the formation of Weibo effect, Weibo topic passes quickly among the netizen. For Weibo information acquisition and analysis, become an important research object. In order to facilitate the acquisition of Weibo data, Weibo has also provided the API of Weibo, but these API can not meet the requirements of obtaining a large number of Weibo data because of the limitation of access times. At the same time, the fetched data is often very messy. Aiming at the above problems, this paper introduces the technology of web page analysis and theme correlation analysis, and develops the research and design of Weibo web crawler based on topic. The main work of this paper is to study and analyze the technology of page analysis, to select the method of obtaining the information of Weibo page according to Weibo's page characteristics, and to describe the thinking and design process of the breadth-first search strategy based on "pruning". In order to solve the problem of dynamic change of URL's reshuffling URL address set, this paper studies and analyzes the technology of extracting short text and multi-key matching technology, and determines the design scheme of Weibo's theme correlation analysis. Finally, a prototype system of Weibo web crawler based on theme is designed and implemented, which can capture and store Weibo data in real time. The core problem of this paper is to design a breadth-first search strategy based on pruning according to the characteristics of Weibo data, and apply it to Weibo crawler. At the same time, using Weibo page analysis technology, the crawler is not restricted by the API platform, so that users can capture the data of the topic as accurately as possible. The experimental results of the prototype system are obtained by repeated experiments. The experimental results are compared with those based on API Weibo crawler and web page Weibo crawler. It is concluded that the crawling strategy proposed in this paper can capture data related to the subject, although the efficiency is somewhat lower. But Weibo data in the capture has a better thematic correlation. The experimental results show that the scheme is feasible.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 段愛華;;基于網(wǎng)站結(jié)構(gòu)分析頁面信息提取的方法研究[J];電腦知識與技術(shù);2008年23期

2 周民;邱雅;王華彬;;網(wǎng)絡(luò)輿情分析中智能爬蟲的設(shè)計(jì)[J];電腦知識與技術(shù);2011年33期

3 趙前東;葉猛;;微博熱點(diǎn)話題檢測系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電視技術(shù);2013年03期

4 殷賢亮;李猛;;基于分塊的網(wǎng)頁主題信息自動提取算法[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年10期

5 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計(jì)算機(jī)研究與發(fā)展;2004年10期

6 李聰;梁昌勇;馬麗;;基于領(lǐng)域最近鄰的協(xié)同過濾推薦算法[J];計(jì)算機(jī)研究與發(fā)展;2008年09期

7 李學(xué)勇,歐陽柳波,李國徽,鐘敏娟;網(wǎng)絡(luò)蜘蛛搜索策略比較研究[J];計(jì)算機(jī)工程與應(yīng)用;2004年04期

8 常育紅,姜哲,朱小燕;基于標(biāo)記樹表示方法的頁面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期

9 林海霞;原福永;陳金森;劉俊峰;;一種改進(jìn)的主題網(wǎng)絡(luò)蜘蛛搜索算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年10期

10 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期



本文編號:1817814

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/1817814.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶d3463***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com