動(dòng)態(tài)自適應(yīng)的資源采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-08-24 15:24

【摘要】：當(dāng)今，互聯(lián)網(wǎng)提供了越來(lái)越多有價(jià)值的信息，人們習(xí)慣通過(guò)搜索引擎來(lái)獲取信息。中國(guó)的網(wǎng)頁(yè)總數(shù)在2012年比2011年增長(zhǎng)了近41%，這對(duì)搜索引擎的網(wǎng)絡(luò)資源采集提出了更高的要求�；ヂ�(lián)網(wǎng)的網(wǎng)頁(yè)數(shù)量很龐大，尤其是動(dòng)態(tài)網(wǎng)頁(yè)的數(shù)量增長(zhǎng)迅速。在資源采集的過(guò)程中，難免會(huì)碰到各種異常情況，如服務(wù)器響應(yīng)緩慢，重復(fù)網(wǎng)頁(yè)、無(wú)效網(wǎng)頁(yè)鏈接過(guò)多，網(wǎng)頁(yè)資源之間的鏈接關(guān)系難以發(fā)現(xiàn)等問(wèn)題。本文重點(diǎn)研究這類(lèi)問(wèn)題的解決辦法。本文主要研究目標(biāo)是設(shè)計(jì)并實(shí)現(xiàn)一個(gè)資源采集系統(tǒng)，不僅能夠動(dòng)態(tài)調(diào)整和自動(dòng)適應(yīng)廣域網(wǎng)中的各種異常情況，而且能基于已有采集信息發(fā)現(xiàn)網(wǎng)頁(yè)之間的鏈接關(guān)系，預(yù)測(cè)出更多相似網(wǎng)頁(yè)。本文中，系統(tǒng)將采集過(guò)程中的實(shí)時(shí)統(tǒng)計(jì)信息，作為實(shí)時(shí)過(guò)濾鏈接的依據(jù)，旨在過(guò)濾重復(fù)率高、訪(fǎng)問(wèn)無(wú)效、訪(fǎng)問(wèn)超時(shí)的網(wǎng)頁(yè)鏈接，以提高系統(tǒng)的采集效率。與一般的采集系統(tǒng)相比，本系統(tǒng)可以較好地適應(yīng)了不穩(wěn)定的網(wǎng)絡(luò)狀況和較好地處理大量垃圾鏈接的問(wèn)題。本文針對(duì)難以發(fā)現(xiàn)網(wǎng)頁(yè)鏈接的問(wèn)題，提出了鏈接分析預(yù)測(cè)的方法，采用了在分析鏈接統(tǒng)計(jì)信息的基礎(chǔ)上進(jìn)行預(yù)測(cè)的方式，取得了發(fā)現(xiàn)大量相似網(wǎng)頁(yè)、擴(kuò)大采集覆蓋范圍的效果，，彌補(bǔ)了抽取鏈接的常規(guī)方法的不足。本文采用分布式架構(gòu)設(shè)計(jì)來(lái)實(shí)現(xiàn)資源采集系統(tǒng)，除了劃分并實(shí)現(xiàn)了網(wǎng)頁(yè)下載、網(wǎng)頁(yè)解析、URL消重、URL調(diào)度等基本模塊以外，還加入實(shí)時(shí)過(guò)濾模塊和URL預(yù)測(cè)模塊，以及統(tǒng)計(jì)信息、URL聚類(lèi)、分類(lèi)等輔助模塊，使得系統(tǒng)具備動(dòng)態(tài)自適應(yīng)特性。測(cè)試表明，本文提出的方法能夠識(shí)別各種異常采集狀況的發(fā)生并自適應(yīng)地進(jìn)行調(diào)整，提高了系統(tǒng)的健壯性，保證了采集過(guò)程的穩(wěn)定。針對(duì)難以發(fā)現(xiàn)的網(wǎng)頁(yè)鏈接，系統(tǒng)能夠進(jìn)行有效預(yù)測(cè)，除了常規(guī)抽取鏈接以外，本文提供了發(fā)現(xiàn)網(wǎng)頁(yè)鏈接的另一個(gè)有效途徑。
[Abstract]:Nowadays, the Internet provides more and more valuable information. The total number of web pages in China increased by nearly 41% in 2012 compared with 2011, which puts forward higher requirements for the collection of web resources by search engines. The number of web pages on the Internet is huge, especially the number of dynamic pages. In the process of resource acquisition, it is inevitable to encounter various abnormal situations, such as slow response of server, repeated pages, too many invalid web page links, and the link relationship between web resources is difficult to find, and so on. This paper focuses on the solution of this kind of problem. The main research goal of this paper is to design and implement a resource acquisition system, which can not only dynamically adjust and automatically adapt to all kinds of anomalies in WAN, but also discover the link relationship between web pages based on the information collected. Predict more similar pages. In this paper, the system takes real-time statistical information in the process of collection as the basis for real-time filtering links, aiming at filtering web links with high repetition rate, invalid access and time-out access, so as to improve the efficiency of the system. Compared with the general collection system, the system can adapt to the unstable network conditions and deal with the problem of a large number of spam links. In this paper, the method of link analysis and prediction is put forward, which is based on the analysis of the statistical information of the link, and the method of finding a large number of similar pages and extending the coverage of the collection is obtained. It makes up for the deficiency of the conventional method of extracting links. In this paper, the distributed architecture is used to realize the resource acquisition system. Besides the basic modules of web page download, web page analysis and URL reshuffle scheduling, real-time filtering module and URL prediction module are also added. As well as the statistical information URL clustering, classification and other auxiliary modules, make the system has dynamic adaptive characteristics. The test results show that the method proposed in this paper can recognize the occurrence of various abnormal sampling conditions and adaptively adjust, improve the robustness of the system and ensure the stability of the acquisition process. The system can make effective prediction for the hard to find web links. In addition to the conventional extraction of links, this paper provides another effective way to find web links.
【學(xué)位授予單位】：華南理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類(lèi)號(hào)】：TP393.092;TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 雷鳴,王建勇,趙江華,單松巍,陳葆玨;第三代搜索引擎與天網(wǎng)二期[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2001年05期

2 陳鵬;呂衛(wèi)鋒;;一種基于有效修剪的最大頻繁項(xiàng)集挖掘算法[J];北京航空航天大學(xué)學(xué)報(bào);2006年02期

3 王新;;搜索方法中的剪枝優(yōu)化[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年11期

4 李振星,徐澤平,唐衛(wèi)清,唐榮錫;基于興趣模型的WEB信息預(yù)測(cè)采集過(guò)濾方法[J];計(jì)算機(jī)工程與應(yīng)用;2003年05期

5 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期

6 楊文峰,李星;網(wǎng)絡(luò)搜索引擎的用戶(hù)查詢(xún)分析[J];計(jì)算機(jī)工程;2001年06期

7 汪濤,樊孝忠;鏈接分析對(duì)主題爬蟲(chóng)的改進(jìn)[J];計(jì)算機(jī)應(yīng)用;2004年S2期

8 董守斌;;木棉:企業(yè)級(jí)校園網(wǎng)搜索引擎[J];中國(guó)教育網(wǎng)絡(luò);2007年06期

9 馬志新,陳曉云,王雪,李龍杰;最大頻繁項(xiàng)集挖掘中搜索空間的剪枝策略[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期

10 周開(kāi)波;孟艾立;王小雨;谷金雷;魯旭;;影響互聯(lián)網(wǎng)網(wǎng)速的因素[J];現(xiàn)代電信科技;2012年09期

本文編號(hào)：2201235

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2201235.html

上一篇：具備web數(shù)據(jù)整合功能的負(fù)載均衡系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
下一篇：場(chǎng)域理論視野下的網(wǎng)絡(luò)新聞生產(chǎn)模式分析——以人民網(wǎng)、新浪網(wǎng)、微博、百度搜索引擎為例

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

動(dòng)態(tài)自適應(yīng)的資源采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)