基于Docker集群的分布式爬蟲研究與設計

發(fā)布時間：2018-06-18 10:21

本文選題：Docker + 分布式爬蟲　；參考：《浙江理工大學》2017年碩士論文

【摘要】：自從政府提出實施國家大數(shù)據(jù)戰(zhàn)略以來,互聯(lián)網(wǎng)大數(shù)據(jù)成為重要的戰(zhàn)略資源的地位越來越明顯。而開采互聯(lián)網(wǎng)大數(shù)據(jù)的有效工具網(wǎng)絡爬蟲也顯得更加重要,但傳統(tǒng)的爬蟲均建立在VM集群之上,存在著宿主機資源利用不充分且爬蟲系統(tǒng)難以擴展等問題。隨著新興虛擬化技術Docker的發(fā)展,為解決原有運行在VM環(huán)境上的網(wǎng)絡爬蟲存在的問題提供了契機。基于Docker集群分布式爬蟲主要從分布式爬蟲技術和Docker集群技術兩個方面進行研究。目前開源的爬蟲框架對分布式的支持程度不同,例如Scrapy爬蟲框架不支持分布式,并且現(xiàn)有框架比較適合運行在VM集群環(huán)境之上,存在著VM集群帶來的系統(tǒng)資源利用不充分的缺點。Docker集群是一種全新的虛擬化集群技術,比VM集群更加合理高效的利用宿主機的各種資源。通過研究開源網(wǎng)絡爬蟲架構,本文設計并實現(xiàn)完全支持分布式的網(wǎng)絡爬蟲系統(tǒng),并使之運行在Docker集群之上。本文還進一步改進爬蟲的URL去重算法,采用具有更好去重效果的K分型Bloom filter算法,并使其滿足分布式情況下的應用需求。本文的主要工作有以下幾個方面:(1)深入研究網(wǎng)絡爬蟲的工作原理,掌握其整體架構的設計模式。詳細研究Docker集群的編排管理工具,掌握其工作原理以及管理和調(diào)度機制。研究內(nèi)容去重算法,并應用于分布式爬蟲系統(tǒng)。(2)通過研究開源的網(wǎng)絡爬蟲框架,理解其不支持分布式的原因,設計并實現(xiàn)出適合Docker集群的分布式爬蟲系統(tǒng)模塊。并將系統(tǒng)模塊有效的組合起來,形成完整高效的分布式爬蟲系統(tǒng)。采用Docker集群編排管理工具Kubernetes來對分布式爬蟲系統(tǒng)的各個功能模塊進行部署和管理,使之成功運行在Docker集群之上。(3)將實現(xiàn)的分布式爬蟲分別搭建在VM集群和Docker集群之上進行不同層次的實驗對比,來證明分布式爬蟲系統(tǒng)運行在Docker集群之上有更好的抓取效率,更加充分的利用宿主機資源,并且容易實現(xiàn)系統(tǒng)水平擴展。(4)理解經(jīng)典的Bloom filter算法的原理,并對其誤差概率進行研究。通過改進K分型Bloom filter算法使其滿足分布式情況下的應用需求,并進一步提高去重效果,降低誤差概率。最后通過實驗證明改進后的K分型Bloom filter有更好的去重效果。
[Abstract]:Since the government put forward the national big data strategy, the status of Internet big data as an important strategic resource has become more and more obvious. However, the traditional crawlers are based on VM clusters, and there are some problems such as insufficient utilization of host resources and difficulty in extending crawler systems. With the development of new virtualization technology Docker provides an opportunity to solve the problems of web crawlers running in VM environment. Distributed crawler based on Docker cluster is mainly studied from two aspects: distributed crawler technology and Docker cluster technology. The current open source crawler framework has different degrees of support for distribution, for example, Scrapy crawler framework does not support distributed, and the existing framework is more suitable for running on VM cluster environment. Docker cluster is a new virtualization cluster technology, which is more reasonable and efficient than VM cluster to utilize all kinds of resources of host. Through the research of open source web crawler architecture, this paper designs and implements a distributed web crawler system and makes it run on Docker cluster. This paper also further improves the crawler's URL removal algorithm, adopts K-typed Bloom filter algorithm with better removal effect, and makes it meet the requirements of distributed applications. The main work of this paper is as follows: 1) deeply studying the working principle of web crawler and mastering the design pattern of its whole architecture. This paper studies the orchestration management tool of Docker cluster in detail, and grasps its working principle and management and scheduling mechanism. By studying the open source web crawler framework and understanding the reason why it does not support distributed, the distributed crawler system module suitable for Docker cluster is designed and implemented. And the system modules are effectively combined to form a complete and efficient distributed crawler system. Kubernetes, a Docker cluster orchestration management tool, is used to deploy and manage the functional modules of distributed crawler systems. Make it run on Docker cluster. 3) build distributed crawler on VM cluster and Docker cluster for different levels of experiments, to prove that distributed crawler system running on Docker cluster has better crawling efficiency. It is easier to realize the horizontal expansion of the system by making full use of the host resource and to understand the principle of the classical Bloom filter algorithm, and to study its error probability. The K-typing Bloom filter algorithm is improved to meet the requirements of distributed applications, and further improve the removal effect and reduce the error probability. Finally, the improved K-typing Bloom filter has been proved to be more effective.
【學位授予單位】：浙江理工大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.3

【參考文獻】

相關期刊論文前1條

1 嚴華云;關佶紅;;Bloom Filter研究進展[J];電信科學;2010年02期

相關碩士學位論文前7條

1 杜軍;基于Kubernetes的云端資源調(diào)度器改進[D];浙江大學;2016年

2 陳星宇;基于容器云平臺的網(wǎng)絡資源管理與配置系統(tǒng)設計與實現(xiàn)[D];浙江大學;2016年

3 閆明;高可用可擴展集群化Redis設計與實現(xiàn)[D];西安電子科技大學;2014年

4 魏會建;基于屬性約簡和屬性加權的樸素貝葉斯分類算法的研究[D];吉林大學;2014年

5 趙鵬程;分布式書籍網(wǎng)絡爬蟲系統(tǒng)的設計與實現(xiàn)[D];西南交通大學;2014年

6 朱彥杰;基于搜索引擎的輿情分析系統(tǒng)研究與實現(xiàn)[D];電子科技大學;2012年

7 程錦佳;基于Hadoop的分布式爬蟲及其實現(xiàn)[D];北京郵電大學;2010年

，

本文編號：2035147

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2035147.html

上一篇：基于雙目視覺算法的圖像清晰化算法研究
下一篇：虛擬環(huán)境“數(shù)字腳

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Docker集群的分布式爬蟲研究與設計