基于Nutch的分布式搜索引擎的研究與優(yōu)化
本文選題:Nutch + 索引; 參考:《武漢理工大學(xué)》2013年碩士論文
【摘要】:云計(jì)算已發(fā)展成為目前計(jì)算機(jī)產(chǎn)業(yè)界和學(xué)術(shù)界關(guān)注的熱點(diǎn)之一,Hadoop,作為當(dāng)今最流行的云計(jì)算平臺,也得到了越來越廣泛的應(yīng)用。與此同時(shí),開放源代碼搜索引擎包Nutch不僅能提供搜索引擎所需要的工具,還具有極好的擴(kuò)展性,越來越多的學(xué)者圍繞Hadoop和Nutch的結(jié)合展開研究,力圖通過各種途徑來提高分布式搜索的性能,本文正是在這些學(xué)者的研究成果上,開展了基于Nutch和Hadoop的分布式搜索引擎的研究和優(yōu)化等相關(guān)工作。 本文具體研究工作包括:Nutch框架、Hadoop分布式平臺和分布式爬蟲原理三個方面。首先,對Nutch框架和Hadoop分布式平臺進(jìn)行了分析和研究,仔細(xì)剖析了其架構(gòu)及主要工作原理,如索引機(jī)制、插件機(jī)制、HDFS,Map/Reduce等核心技術(shù)。接著重點(diǎn)研究了爬蟲技術(shù),特別是分布式爬蟲技術(shù),通過分析和研究現(xiàn)有的基于Nutch的爬取機(jī)制,從改變數(shù)據(jù)結(jié)構(gòu)入手,在任務(wù)分配算法中引入可擴(kuò)展的哈希函數(shù),從而解決了原有算法中負(fù)載均衡性和算法低效率的問題。 在上述研究工作的基礎(chǔ)上,本文設(shè)計(jì)了基于Nutch和Hadoop的分布式搜索系統(tǒng),在所設(shè)計(jì)系統(tǒng)的索引模塊中采用了可擴(kuò)展的hash函數(shù),在索引和搜索模塊中利用Nutch的可擴(kuò)展性,通過引入中科院的漢語詞法分析系統(tǒng)ICTCLAS,有效地改進(jìn)了Nutch對中文的支持力。 最后,本文對所設(shè)計(jì)的搜索系統(tǒng),在實(shí)驗(yàn)室構(gòu)建的集群基礎(chǔ)上,從多個角度進(jìn)行了功能測試、性能測試和綜合評估,測試結(jié)果不僅驗(yàn)證了所設(shè)計(jì)的系統(tǒng)的可行性和可擴(kuò)展性,還驗(yàn)證了其性能的提升。
[Abstract]:Cloud computing has become one of the hot topics in computer industry and academia. As the most popular cloud computing platform, cloud computing has been more and more widely used. At the same time, the open source search engine package Nutch not only provides the tools that search engines need, but also has excellent expansibility. More and more scholars are studying the combination of Hadoop and Nutch. This paper tries to improve the performance of distributed search engine through various ways. In this paper, the research and optimization of distributed search engine based on Nutch and Hadoop are carried out. The research work in this paper includes three aspects: Hadoop distributed platform and distributed crawler principle. Firstly, the Nutch framework and Hadoop distributed platform are analyzed and studied, and its architecture and main working principles are analyzed in detail, such as index mechanism, plug-in mechanism, HDFSMapP / Reduce and other core technologies. Then, the crawler technology, especially the distributed crawler technology, is studied emphatically. By analyzing and studying the existing crawling mechanism based on Nutch, the scalable hash function is introduced into the task assignment algorithm by changing the data structure. Thus, the problems of load balance and low efficiency of the original algorithm are solved. Based on the above research work, a distributed search system based on Nutch and Hadoop is designed. The extensible hash function is used in the index module of the designed system, and the extensibility of Nutch is used in the index and search module. By introducing the Chinese lexical analysis system (ICTCLASS) of the Chinese Academy of Sciences (CAS), Nutch's support for Chinese is improved effectively. Finally, on the basis of the cluster constructed in the laboratory, the function test, performance test and comprehensive evaluation of the designed search system are carried out. The test results not only verify the feasibility and expansibility of the designed system. The improvement of its performance is also verified.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 潘以鋒;;基于Lucene的網(wǎng)站全文檢索系統(tǒng)的開發(fā)[J];廣西教育學(xué)院學(xué)報(bào);2006年05期
2 張?jiān)S;董守斌;張凌;陳曉志;;基于Map/Reduce的網(wǎng)頁消重并行算法[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年02期
3 張嶺,葉允明,宋暉,于水,馬范援;一種高性能分布式Web Crawler的設(shè)計(jì)與實(shí)現(xiàn)[J];上海交通大學(xué)學(xué)報(bào);2004年01期
相關(guān)碩士學(xué)位論文 前6條
1 董長春;基于Hadoop的倒排索引技術(shù)的研究[D];遼寧大學(xué);2011年
2 蘇旋;分布式網(wǎng)絡(luò)爬蟲技術(shù)的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年
3 朱珠;基于Hadoop的海量數(shù)據(jù)處理模型研究和應(yīng)用[D];北京郵電大學(xué);2008年
4 時(shí)延軍;基于Nutch的分布式搜索引擎的設(shè)計(jì)與研究[D];長春理工大學(xué);2010年
5 程錦佳;基于Hadoop的分布式爬蟲及其實(shí)現(xiàn)[D];北京郵電大學(xué);2010年
6 吳翠雁;基于Nutch的信息采集系統(tǒng)的研究與實(shí)現(xiàn)[D];華南理工大學(xué);2010年
,本文編號:2114599
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2114599.html