基于元搜索引擎的網(wǎng)頁采集技術(shù)的研究與實現(xiàn)

發(fā)布時間：2018-10-19 14:35

【摘要】：隨著互聯(lián)網(wǎng)的迅速發(fā)展，網(wǎng)絡(luò)信息急劇膨脹，對互聯(lián)網(wǎng)信息敏感的政府部門和企事業(yè)單位已經(jīng)無法單單依靠人工監(jiān)控來把握互聯(lián)網(wǎng)的動向了。為了幫助用戶更好地實時監(jiān)控分析網(wǎng)絡(luò)信息,近些年涌現(xiàn)了大量的互聯(lián)網(wǎng)信息處理平臺。這些互聯(lián)網(wǎng)信息處理平臺借助于高性能的計算機，及時、準(zhǔn)確、全面的采集網(wǎng)絡(luò)信息，并進一步為用戶提供有價值的分析結(jié)果。然而，現(xiàn)有的網(wǎng)頁信息采集技術(shù)在采集數(shù)據(jù)的時效性、全面性和有效率上還存在一定缺陷，并且設(shè)計復(fù)雜，維護困難，需要消耗大量的人力、物力。為了克服上述缺陷，本文將元搜索技術(shù)遷移應(yīng)用到了互聯(lián)網(wǎng)信息采集系統(tǒng)中去，提出了基于元搜索引擎的網(wǎng)頁采集技術(shù)——采集型元搜索技術(shù)。實驗結(jié)果表明，比起已有的網(wǎng)頁信息采集技術(shù)，新的網(wǎng)頁采集技術(shù)能夠保證采集數(shù)據(jù)的時效性、全面性和有效率。本文所做主要工作如下： 1)對傳統(tǒng)的網(wǎng)頁采集技術(shù)進行了詳細的研究和分析，闡述了各種網(wǎng)絡(luò)爬蟲在滿足互聯(lián)網(wǎng)信息處理平臺的網(wǎng)頁采集需求時的優(yōu)缺點，提出了基于元搜索引擎的網(wǎng)頁采集技術(shù)。 2)針對現(xiàn)有元搜索引擎應(yīng)用于采集模塊存在采集規(guī)模過小的問題，提出了基于局部共現(xiàn)統(tǒng)計的查詢擴展技術(shù)（LCOOCS），通過增加查詢次數(shù)的方式來獲取更多相關(guān)網(wǎng)頁。 3)針對LCOOCS需要對初檢結(jié)果進行文本分析，而元搜索引擎的采集結(jié)果都是HTML網(wǎng)頁源代碼的問題，設(shè)計并實現(xiàn)了一種全自動的正文抽取算法TextEx。 4)設(shè)計并實現(xiàn)了一個采集型元搜索系統(tǒng)�？偨Y(jié)提取了百度新聞、bing資訊等六大互聯(lián)網(wǎng)搜索引擎的查詢語法和結(jié)果頁結(jié)構(gòu)，，實現(xiàn)了查詢提交以及結(jié)果下載的自動化。
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【學(xué)位授予單位】：西安電子科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP393.092;TP391.3

【參考文獻】

相關(guān)期刊論文前10條

1 沈宇;黃衛(wèi)東;;基于領(lǐng)域本體的元搜索技術(shù)研究[J];信息通信;2008年02期

2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計算機科學(xué);2009年08期

3 劉國靖;康麗;羅長壽;;基于遺傳算法的主題爬蟲策略[J];計算機應(yīng)用;2007年S2期

4 王磊;蔣建中;郭軍利;;基于擴展DOM樹的Web頁面信息抽取[J];計算機應(yīng)用與軟件;2007年06期

5 黃名選;嚴小衛(wèi);張師超;;查詢擴展技術(shù)進展與展望[J];計算機應(yīng)用與軟件;2007年11期

6 林子熠;沈備軍;;基于統(tǒng)計的自動化Web新聞?wù)某槿J];計算機應(yīng)用與軟件;2010年12期

7 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期

8 梅雪;程學(xué)旗;郭巖;張剛;丁國棟;;一種全自動生成網(wǎng)頁信息抽取Wrapper的方法[J];中文信息學(xué)報;2008年01期

9 崔航,文繼榮,李敏強;基于用戶日志的查詢擴展統(tǒng)計模型[J];軟件學(xué)報;2003年09期

10 楊少華;林海略;韓燕波;;針對模板生成網(wǎng)頁的一種數(shù)據(jù)自動抽取方法(英文)[J];軟件學(xué)報;2008年02期

相關(guān)博士學(xué)位論文前4條

1 郭秀娟;基于關(guān)聯(lián)規(guī)則數(shù)據(jù)挖掘算法的研究[D];吉林大學(xué);2004年

2 李榮陸;文本分類及其相關(guān)技術(shù)研究[D];復(fù)旦大學(xué);2005年

3 李強;基于本體論的個性化和社會化元搜索引擎的研究[D];浙江大學(xué);2006年

4 高茂庭;文本聚類分析若干問題研究[D];天津大學(xué);2007年

相關(guān)碩士學(xué)位論文前4條

1 陳劍銳;基于Hadoop海量數(shù)據(jù)存儲仿真平臺的研究與設(shè)計[D];華南理工大學(xué);2011年

2 萬晶;Web網(wǎng)頁正文抽取方法研究[D];南昌大學(xué);2010年

3 程錦佳;基于Hadoop的分布式爬蟲及其實現(xiàn)[D];北京郵電大學(xué);2010年

4 于洪波;中文網(wǎng)頁自動采集與分類系統(tǒng)設(shè)計與實現(xiàn)[D];北京郵電大學(xué);2010年

本文編號：2281433

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2281433.html

上一篇：基于大數(shù)據(jù)的城市公園游憩功能研究
下一篇：論網(wǎng)絡(luò)信息資源編目的實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于元搜索引擎的網(wǎng)頁采集技術(shù)的研究與實現(xiàn)