基于元搜索引擎的網(wǎng)頁采集技術(shù)的研究與實現(xiàn)
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【學(xué)位授予單位】:西安電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.092;TP391.3
【參考文獻】
相關(guān)期刊論文 前10條
1 沈宇;黃衛(wèi)東;;基于領(lǐng)域本體的元搜索技術(shù)研究[J];信息通信;2008年02期
2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計算機科學(xué);2009年08期
3 劉國靖;康麗;羅長壽;;基于遺傳算法的主題爬蟲策略[J];計算機應(yīng)用;2007年S2期
4 王磊;蔣建中;郭軍利;;基于擴展DOM樹的Web頁面信息抽取[J];計算機應(yīng)用與軟件;2007年06期
5 黃名選;嚴小衛(wèi);張師超;;查詢擴展技術(shù)進展與展望[J];計算機應(yīng)用與軟件;2007年11期
6 林子熠;沈備軍;;基于統(tǒng)計的自動化Web新聞?wù)某槿J];計算機應(yīng)用與軟件;2010年12期
7 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期
8 梅雪;程學(xué)旗;郭巖;張剛;丁國棟;;一種全自動生成網(wǎng)頁信息抽取Wrapper的方法[J];中文信息學(xué)報;2008年01期
9 崔航,文繼榮,李敏強;基于用戶日志的查詢擴展統(tǒng)計模型[J];軟件學(xué)報;2003年09期
10 楊少華;林海略;韓燕波;;針對模板生成網(wǎng)頁的一種數(shù)據(jù)自動抽取方法(英文)[J];軟件學(xué)報;2008年02期
相關(guān)博士學(xué)位論文 前4條
1 郭秀娟;基于關(guān)聯(lián)規(guī)則數(shù)據(jù)挖掘算法的研究[D];吉林大學(xué);2004年
2 李榮陸;文本分類及其相關(guān)技術(shù)研究[D];復(fù)旦大學(xué);2005年
3 李強;基于本體論的個性化和社會化元搜索引擎的研究[D];浙江大學(xué);2006年
4 高茂庭;文本聚類分析若干問題研究[D];天津大學(xué);2007年
相關(guān)碩士學(xué)位論文 前4條
1 陳劍銳;基于Hadoop海量數(shù)據(jù)存儲仿真平臺的研究與設(shè)計[D];華南理工大學(xué);2011年
2 萬晶;Web網(wǎng)頁正文抽取方法研究[D];南昌大學(xué);2010年
3 程錦佳;基于Hadoop的分布式爬蟲及其實現(xiàn)[D];北京郵電大學(xué);2010年
4 于洪波;中文網(wǎng)頁自動采集與分類系統(tǒng)設(shè)計與實現(xiàn)[D];北京郵電大學(xué);2010年
本文編號:2281433
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2281433.html