微博網(wǎng)絡(luò)爬行器技術(shù)研究與實現(xiàn)
發(fā)布時間:2018-05-19 20:29
本文選題:網(wǎng)絡(luò)爬行器 + Xpath抽取; 參考:《吉林大學(xué)》2013年碩士論文
【摘要】:隨著移動通信網(wǎng)絡(luò)和Web2.0技術(shù)的不斷發(fā)展,微博已逐漸成為人們?nèi)粘=涣、通信、娛樂的基本工具。越來越多的人開始使用并利用微博來傳播廣告、新聞、話題等信息。同時由于微博的開放性和匿名性,微博也隱藏著許多不良信息,如謠言、暴力以及反動信息,這對我國輿論的引導(dǎo)與監(jiān)管帶來了很大的困難。因此,針對微博網(wǎng)絡(luò)開展數(shù)據(jù)采集工作研究,既是對微博網(wǎng)絡(luò)信息傳播建模與優(yōu)化的研究基礎(chǔ),也是微博網(wǎng)絡(luò)輿情監(jiān)控與分析的必要前提,具有十分重要的研究意義與實踐意義。 本論文主要以新浪微博為研究對象,在調(diào)研了當前主流爬行器技術(shù)的基礎(chǔ)上,設(shè)計并實現(xiàn)了一個高效地增量式微博網(wǎng)絡(luò)爬行器,主要工作如下: 1、根據(jù)信息抽取的需求,分析了新浪微博信息結(jié)構(gòu)的組成,采集用戶的基本信息、用戶的標簽與關(guān)注的話題、用戶的社交關(guān)系(關(guān)注、粉絲)及其所發(fā)的微博等,根據(jù)所要抽取的信息并設(shè)計了相應(yīng)的數(shù)據(jù)庫,在具體采集信息時,本文采用模擬瀏覽器的策略訪問微博用戶的主頁,并將采集下的網(wǎng)頁源碼轉(zhuǎn)成文檔對象模型樹,,采用Xpath表達式對轉(zhuǎn)化后的文檔對象模型結(jié)構(gòu)化信息進行的抽取,在數(shù)據(jù)存儲時采用軟件工程的思想,在底層使用了Hibernate和Spring的數(shù)據(jù)持久化技術(shù)進行數(shù)據(jù)存儲,這樣能夠屏蔽數(shù)據(jù)訪問和存儲的細節(jié)。 2、在具體地設(shè)計中,文中較好的實現(xiàn)了自動填寫表單技術(shù),自動填寫表單主要是采用抓包軟件破解新浪微博登陸的加密協(xié)議,并模擬瀏覽器填寫表單登陸新浪微博,獲得新浪微博服務(wù)器返回的cookie,利用這些cookie進行下載用戶的相關(guān)網(wǎng)頁。為了能夠高效持續(xù)的采集用戶的相關(guān)信息,本文設(shè)計并實現(xiàn)了基于多生產(chǎn)者多消費者模型的網(wǎng)頁信息采集與存儲的網(wǎng)絡(luò)爬行器,將爬行器的采集端類比成生產(chǎn)者,即不斷地持續(xù)地從新浪微博服務(wù)器中下載網(wǎng)頁并解析成結(jié)構(gòu)化數(shù)據(jù),將爬行器的存儲端類比成消費者,采用多線程的方式分別對每類結(jié)構(gòu)化的數(shù)據(jù)進行存儲。為了進一步提高爬行器的效率,文中利用新浪微博API接口對微博用戶的社交信息進行輔助采集。 3、本文深入研究了微博網(wǎng)絡(luò)爬行策略的問題,由于每個用戶的發(fā)表博文的頻率并不一致,如果毫無區(qū)別地對微博用戶輪詢采集會浪費大量的帶寬和網(wǎng)絡(luò)資源,因此本文提出了基于用戶活躍度的爬行調(diào)度策略,利用所采集的用戶的微博時間數(shù)據(jù)對用戶的活躍度進行預(yù)測,采用時間序列分析方法預(yù)測用戶在下一個時間段內(nèi)博文的發(fā)表量,發(fā)表量越多用戶的活躍度也越大,爬行器按照用戶的活躍程度進行調(diào)度,用戶越活躍爬行器采集的頻率越大,實驗結(jié)果表明本文的采集策略比簡單的深度優(yōu)先爬行相比其覆蓋率和時效性都有了明顯的提高。
[Abstract]:With the development of mobile communication network and Web2.0 technology, Weibo has become a basic tool for people's daily communication, communication and entertainment. More and more people begin to use and use Weibo to spread information about advertising, news, topics, etc. At the same time, because of the openness and anonymity of Weibo, Weibo also hides a lot of bad information, such as rumors, violence and reactionary information, which brings great difficulties to the guidance and supervision of public opinion in China. Therefore, the research on data acquisition based on Weibo network is not only the research foundation of modeling and optimization of Weibo network information dissemination, but also the necessary premise of monitoring and analysis of Weibo network public opinion. Has the very important research significance and the practice significance. This paper mainly takes Sina Weibo as the research object, on the basis of investigating the current mainstream crawler technology, designs and implements an efficient incremental Weibo crawler. The main work is as follows: 1. According to the demand of information extraction, this paper analyzes the composition of Sina Weibo information structure, the collection of basic information of users, user tags and topics of concern, the social relationship of users (attention, fans) and their Weibo, etc. According to the information to be extracted and the design of the corresponding database, this paper uses the strategy of simulating browser to visit the home page of Weibo users, and converts the source code of web pages to document object model tree. The Xpath expression is used to extract the structured information of the transformed document object model, the idea of software engineering is used to store the data, and the data persistence technology of Hibernate and Spring is used to store the data at the bottom. This masked details of data access and storage. 2, in the concrete design, the paper has realized the automatic form filling technology, the automatic filling form mainly uses the capture package software to break the encryption protocol of Sina Weibo login, and simulates the browser to fill in the form login Sina Weibo, Get the cookie returned by Sina Weibo server, and download the relevant web pages by using these cookie. In order to collect user information efficiently and continuously, a web crawler based on multi-producer and multi-consumer model is designed and implemented in this paper. That is to continuously download web pages from Sina Weibo server and parse them into structured data, compare the storage end of the crawler to consumers, and store each kind of structured data separately by multithreading. In order to further improve the efficiency of the crawler, this paper uses Sina Weibo API interface to collect the social information of Weibo users. 3. In this paper, the problem of Weibo crawling strategy is deeply studied. Because the frequency of publishing blog is not the same for each user, it will waste a lot of bandwidth and network resources to poll Weibo users indiscriminately. Therefore, a crawling scheduling strategy based on user activity is proposed in this paper, and the user activity is predicted by the collected Weibo time data. Time series analysis is used to predict the amount of blog posts published by users in the next time period. The more users publish, the greater the activity of users, and the more active crawlers are scheduled according to the active degree of users, the greater the frequency of crawler collection is. The experimental results show that compared with the simple depth-first crawling, the acquisition strategy in this paper has significantly improved the coverage and timeliness.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前2條
1 李盛韜;余智華;程學(xué)旗;白碩;;Web信息采集研究進展[J];計算機科學(xué);2003年02期
2 廉捷;周欣;曹偉;劉云;;新浪微博數(shù)據(jù)挖掘方案[J];清華大學(xué)學(xué)報(自然科學(xué)版);2011年10期
本文編號:1911548
本文鏈接:http://www.sikaile.net/wenyilunwen/guanggaoshejilunwen/1911548.html
最近更新
教材專著