主題網(wǎng)絡(luò)爬蟲的并行化研究與設(shè)計(jì)

發(fā)布時(shí)間：2018-05-18 22:07

本文選題：并行化 + 爬蟲��；參考：《西南石油大學(xué)》2017年碩士論文

【摘要】：隨著移動(dòng)互聯(lián)網(wǎng)的普及,數(shù)據(jù)產(chǎn)生的速度不斷加快,數(shù)據(jù)量不斷增長(zhǎng)。搜索引擎提供的查詢結(jié)果數(shù)量雖能夠滿足普通用戶的需求,但不足以支持科研人員在主題領(lǐng)域的數(shù)據(jù)分析。本文以如何獲取主題信息作為研究問(wèn)題,根據(jù)實(shí)際需要,研究使用主題網(wǎng)絡(luò)爬蟲從互聯(lián)網(wǎng)中高效地采集相關(guān)數(shù)據(jù)。文中采用集群并行化處理的思想以及改進(jìn)的網(wǎng)頁(yè)相似度判定算法采集網(wǎng)頁(yè)并判定網(wǎng)頁(yè)信息主題相關(guān)性,從而獲取信息。研究工作分為三部分:爬蟲工作原理及相關(guān)知識(shí)、爬蟲并行化改進(jìn)和數(shù)據(jù)采集過(guò)程中文本主題相關(guān)性的判斷。首先,爬蟲是搜索引擎的重要組成部分,以搜索引擎和Web遵循的HTTP協(xié)議為起點(diǎn),進(jìn)而研究了爬蟲的采集流程。其次,在普通爬蟲流程的基礎(chǔ)上,基于常用搜索策略提出了多策略融合的搜索算法,改進(jìn)了原有搜索效率低下的問(wèn)題,達(dá)到效率成倍提升的效果。接著,互聯(lián)網(wǎng)的數(shù)據(jù)規(guī)模促使爬蟲采用并行化方式提高效率,根據(jù)爬蟲各部分的需求以及數(shù)據(jù)的特點(diǎn)采用了合適的并行框架:包括存放URL多隊(duì)列的RabbitMQ、URL去重的內(nèi)存級(jí)數(shù)據(jù)庫(kù)Redis、處理網(wǎng)頁(yè)數(shù)據(jù)的并行計(jì)算框架Storm和分布式數(shù)據(jù)庫(kù)MongoDB。最后,提出以標(biāo)題為中心的精簡(jiǎn)內(nèi)容子樹構(gòu)建網(wǎng)頁(yè)主要內(nèi)容,并對(duì)其應(yīng)用向量空間模型和語(yǔ)義結(jié)合的判別算法對(duì)網(wǎng)頁(yè)進(jìn)行主題識(shí)別,提高了網(wǎng)頁(yè)主題相關(guān)的識(shí)別率。通過(guò)對(duì)系統(tǒng)架構(gòu)以及各模塊的設(shè)計(jì)與實(shí)現(xiàn),并以“大數(shù)據(jù)”為主題對(duì)系統(tǒng)進(jìn)行測(cè)試,結(jié)果表明系統(tǒng)能夠識(shí)別與“大數(shù)據(jù)”相關(guān)的網(wǎng)頁(yè),準(zhǔn)確率最高達(dá)到82%,且經(jīng)過(guò)并行化的改進(jìn),系統(tǒng)效率和穩(wěn)定性有所提升,解決了中小型爬蟲自主采集相關(guān)主題網(wǎng)頁(yè)的問(wèn)題,獲取到的數(shù)據(jù)對(duì)后續(xù)的分析也有著積極作用。
[Abstract]:With the popularity of mobile Internet, the speed of data generation is accelerating and the amount of data is increasing. Although the number of query results provided by search engines can meet the needs of ordinary users, it is not enough to support the data analysis of scientific researchers in the subject area. In this paper, how to obtain topic information as a research problem, according to the actual needs, the use of topic crawlers from the Internet to efficiently collect relevant data. In this paper, the idea of cluster parallelization and the improved similarity determination algorithm are used to collect web pages and determine the relevance of web pages' information, so as to obtain the information. The research work is divided into three parts: crawler working principle and related knowledge, reptile parallelization improvement and the judgment of relevance of Chinese text in data acquisition process. Firstly, the crawler is an important part of the search engine. Based on the HTTP protocol followed by the search engine and Web, the crawler collection process is studied. Secondly, on the basis of common crawler flow, a multi-strategy fusion search algorithm is proposed based on common search strategies, which improves the original problem of low search efficiency and achieves the effect of multiplying the efficiency. Then, the size of the data on the Internet encourages crawlers to use parallelism to improve their efficiency. According to the requirements of each part of the crawler and the characteristics of the data, this paper adopts a suitable parallel framework, which includes the memory level database Redisis which stores the URL multi-queue RabbitMQ URL, the parallel computing framework for processing web page data, Storm and the distributed database, MongoDB. Finally, the main content of the web page is constructed by a reduced content subtree with the title as the center, and the recognition rate of the web page is improved by using the vector space model and the semantic discriminant algorithm. Through the design and implementation of the system architecture and each module, and taking "big data" as the theme to test the system, the result shows that the system can identify the web pages related to "big data", and the accuracy rate is up to 822, and it is improved by parallelization. The efficiency and stability of the system are improved, which solves the problem of the small and medium-sized reptiles collecting related web pages independently, and the obtained data also play a positive role in the subsequent analysis.
【學(xué)位授予單位】：西南石油大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 夏靖波;韋澤鯤;付凱;陳珍;;云計(jì)算中Hadoop技術(shù)研究與應(yīng)用綜述[J];計(jì)算機(jī)科學(xué);2016年11期

2 朱新華;馬潤(rùn)聰;孫柳;陳宏朝;;基于知網(wǎng)與詞林的詞語(yǔ)語(yǔ)義相似度計(jì)算[J];中文信息學(xué)報(bào);2016年04期

3 姜芳;李國(guó)和;岳翔;;基于語(yǔ)義的文檔特征提取研究方法[J];計(jì)算機(jī)科學(xué);2016年02期

4 馬雷雷;李宏偉;連世偉;梁汝鵬;陳虎;;一種基于本體語(yǔ)義的災(zāi)害主題爬蟲策略[J];計(jì)算機(jī)工程;2016年11期

5 王景中;邱銅相;;基于TF-IDF改進(jìn)算法的聚焦主題網(wǎng)絡(luò)爬蟲[J];計(jì)算機(jī)應(yīng)用;2015年10期

6 王東;熊世桓;;基于同義詞詞林?jǐn)U展的短文本分類[J];蘭州理工大學(xué)學(xué)報(bào);2015年04期

7 李川;鄂海紅;宋美娜;;基于Storm的實(shí)時(shí)計(jì)算框架的研究與應(yīng)用[J];軟件;2014年10期

8 余兆釵;傅化權(quán);;一種改進(jìn)的最好優(yōu)先搜索策略算法[J];科技視界;2014年33期

9 朱亞興;余愛民;王夷;;基于Redis+MySQL+MongoDB存儲(chǔ)架構(gòu)應(yīng)用[J];微型機(jī)與應(yīng)用;2014年13期

10 喻依;甘若迅;樊鎖海;劉慶;邵晴;;基于PageRank算法和HITS算法的期刊評(píng)價(jià)研究[J];計(jì)算機(jī)科學(xué);2014年S1期

相關(guān)碩士學(xué)位論文前10條

1 牛牧;基于Kafka的大規(guī)模流數(shù)據(jù)分布式緩存與分析平臺(tái)[D];吉林大學(xué);2016年

2 黃美華;基于人工魚群算法的多目標(biāo)背包問(wèn)題研究[D];廣東工業(yè)大學(xué);2016年

3 羅路天;垂直搜索引擎中主題網(wǎng)絡(luò)爬蟲算法的設(shè)計(jì)與研究[D];廣東工業(yè)大學(xué);2016年

4 魏光澤;中文分詞技術(shù)在搜索引擎中的研究與應(yīng)用[D];青島科技大學(xué);2016年

5 楊超群;基于自身特征的短文本分類研究[D];合肥工業(yè)大學(xué);2016年

6 任書琴;健康領(lǐng)域的垂直搜索引擎的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2016年

7 吳昊;垂直搜索引擎關(guān)鍵技術(shù)研究及分布式實(shí)現(xiàn)[D];東南大學(xué);2016年

8 周祺;基于統(tǒng)計(jì)與詞典相結(jié)合的中文分詞的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2015年

9 譚靜;基于向量空間模型的文本相似度算法研究[D];西南石油大學(xué);2015年

10 鐘杰;基于文本語(yǔ)義及結(jié)構(gòu)的中文文本相似度研究[D];江西財(cái)經(jīng)大學(xué);2015年

，

本文編號(hào)：1907388

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1907388.html

上一篇：高職土建類實(shí)訓(xùn)軟件的設(shè)計(jì)與開發(fā)
下一篇：“開放存

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

主題網(wǎng)絡(luò)爬蟲的并行化研究與設(shè)計(jì)