動(dòng)態(tài)自適應(yīng)的資源采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
[Abstract]:Nowadays, the Internet provides more and more valuable information. The total number of web pages in China increased by nearly 41% in 2012 compared with 2011, which puts forward higher requirements for the collection of web resources by search engines. The number of web pages on the Internet is huge, especially the number of dynamic pages. In the process of resource acquisition, it is inevitable to encounter various abnormal situations, such as slow response of server, repeated pages, too many invalid web page links, and the link relationship between web resources is difficult to find, and so on. This paper focuses on the solution of this kind of problem. The main research goal of this paper is to design and implement a resource acquisition system, which can not only dynamically adjust and automatically adapt to all kinds of anomalies in WAN, but also discover the link relationship between web pages based on the information collected. Predict more similar pages. In this paper, the system takes real-time statistical information in the process of collection as the basis for real-time filtering links, aiming at filtering web links with high repetition rate, invalid access and time-out access, so as to improve the efficiency of the system. Compared with the general collection system, the system can adapt to the unstable network conditions and deal with the problem of a large number of spam links. In this paper, the method of link analysis and prediction is put forward, which is based on the analysis of the statistical information of the link, and the method of finding a large number of similar pages and extending the coverage of the collection is obtained. It makes up for the deficiency of the conventional method of extracting links. In this paper, the distributed architecture is used to realize the resource acquisition system. Besides the basic modules of web page download, web page analysis and URL reshuffle scheduling, real-time filtering module and URL prediction module are also added. As well as the statistical information URL clustering, classification and other auxiliary modules, make the system has dynamic adaptive characteristics. The test results show that the method proposed in this paper can recognize the occurrence of various abnormal sampling conditions and adaptively adjust, improve the robustness of the system and ensure the stability of the acquisition process. The system can make effective prediction for the hard to find web links. In addition to the conventional extraction of links, this paper provides another effective way to find web links.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP393.092;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 雷鳴,王建勇,趙江華,單松巍,陳葆玨;第三代搜索引擎與天網(wǎng)二期[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2001年05期
2 陳鵬;呂衛(wèi)鋒;;一種基于有效修剪的最大頻繁項(xiàng)集挖掘算法[J];北京航空航天大學(xué)學(xué)報(bào);2006年02期
3 王新;;搜索方法中的剪枝優(yōu)化[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年11期
4 李振星,徐澤平,唐衛(wèi)清,唐榮錫;基于興趣模型的WEB信息預(yù)測(cè)采集過(guò)濾方法[J];計(jì)算機(jī)工程與應(yīng)用;2003年05期
5 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
6 楊文峰,李星;網(wǎng)絡(luò)搜索引擎的用戶(hù)查詢(xún)分析[J];計(jì)算機(jī)工程;2001年06期
7 汪濤,樊孝忠;鏈接分析對(duì)主題爬蟲(chóng)的改進(jìn)[J];計(jì)算機(jī)應(yīng)用;2004年S2期
8 董守斌;;木棉:企業(yè)級(jí)校園網(wǎng)搜索引擎[J];中國(guó)教育網(wǎng)絡(luò);2007年06期
9 馬志新,陳曉云,王雪,李龍杰;最大頻繁項(xiàng)集挖掘中搜索空間的剪枝策略[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期
10 周開(kāi)波;孟艾立;王小雨;谷金雷;魯旭;;影響互聯(lián)網(wǎng)網(wǎng)速的因素[J];現(xiàn)代電信科技;2012年09期
本文編號(hào):2201235
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2201235.html