基于Hadoop的搜索引擎的研究與應(yīng)用
本文選題:搜索引擎 + Hadoop; 參考:《浙江理工大學(xué)》2013年碩士論文
【摘要】:隨著網(wǎng)絡(luò)信息技術(shù)的大規(guī)模普及,用戶對(duì)于信息檢索的要求日益嚴(yán)格。實(shí)現(xiàn)快速、準(zhǔn)確且全面的信息搜索能為各類機(jī)構(gòu)獲得較高的客戶滿意度和良好的商業(yè)效益。由于技術(shù)和經(jīng)濟(jì)實(shí)力受限,大多數(shù)中小型機(jī)構(gòu)難以像大型機(jī)構(gòu)那樣根據(jù)用戶需求實(shí)現(xiàn)專有的高效搜索體系,也難以結(jié)合中小型機(jī)構(gòu)自身的需求作進(jìn)一步的個(gè)性化設(shè)計(jì)。因此如何有效利用現(xiàn)有搜索引擎巨頭的技術(shù),,為更多機(jī)構(gòu),尤其是具備一定數(shù)據(jù)集,但經(jīng)濟(jì)承載力較小、核心開(kāi)發(fā)能力較弱的中小型企業(yè)、高校及科研機(jī)構(gòu)等提供強(qiáng)大的搜索計(jì)算技術(shù)和多樣化服務(wù),成為當(dāng)前搜索領(lǐng)域的研究重點(diǎn)和難點(diǎn)。 本文結(jié)合實(shí)際應(yīng)用需求,研究基于Hadoop的分布式搜索引擎原理、相關(guān)技術(shù)和算法,深入剖析分布式計(jì)算框架MapReduce和分布式文件系統(tǒng)HDFS,引入MapReduce編程模型的具體設(shè)計(jì)方案,將BM25排序模型集成于Lucene實(shí)現(xiàn)檢索評(píng)分,采用Paoding分詞器做中文分詞處理,完成了系統(tǒng)在Hadoop平臺(tái)的架構(gòu)設(shè)計(jì),確定了系統(tǒng)功能劃分,分析并設(shè)計(jì)爬行、索引和檢索流程,完成了三個(gè)子系統(tǒng)的改進(jìn)與實(shí)現(xiàn)。 在分析、評(píng)價(jià)和總結(jié)中小型機(jī)構(gòu)實(shí)現(xiàn)信息高效搜索的需求和現(xiàn)存弊端的基礎(chǔ)之上,本文集成三個(gè)相對(duì)獨(dú)立的子系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn),完成了Hadoop框架搭建和相關(guān)配置,部署實(shí)現(xiàn)了3個(gè)節(jié)點(diǎn)的分布式搜索引擎系統(tǒng)。最后從中小型機(jī)構(gòu)用戶的搜索需求出發(fā),對(duì)本系統(tǒng)性能進(jìn)行測(cè)試與評(píng)價(jià)。具體以浙江理工大學(xué)網(wǎng)站作為實(shí)驗(yàn)對(duì)象,在三節(jié)點(diǎn)的分布式平臺(tái)與單機(jī)環(huán)境下考察系統(tǒng)進(jìn)行網(wǎng)頁(yè)爬取和索引的效率。爬行和索引用時(shí)計(jì)算結(jié)果表明,對(duì)于20000個(gè)網(wǎng)頁(yè),集群用時(shí)相比單機(jī)節(jié)省約15.64%。隨著網(wǎng)頁(yè)數(shù)量的增加,該差異逐漸擴(kuò)大。同時(shí)通過(guò)比較不同網(wǎng)頁(yè)數(shù)對(duì)應(yīng)的檢索結(jié)果匹配度,計(jì)算得出基于Hadoop的分布式搜索引擎系統(tǒng)檢索的平均準(zhǔn)確率較單機(jī)環(huán)境提升了近20%。實(shí)驗(yàn)結(jié)果表明,在機(jī)構(gòu)網(wǎng)頁(yè)量增加到一定程度后,該面向中小型機(jī)構(gòu)的分布式搜索引擎系統(tǒng)較傳統(tǒng)集中式搜索引擎能更快速獲取用戶需要的更加精準(zhǔn)的檢索結(jié)果且系統(tǒng)安全穩(wěn)定性和可擴(kuò)展性得到提升,從而改善了中小型機(jī)構(gòu)信息檢索效能,加快其信息化程度。
[Abstract]:With the widespread popularity of network information technology, users are increasingly demanding information retrieval. Fast, accurate and comprehensive information search can achieve high customer satisfaction and good commercial benefits for various institutions. Because of limited technical and economic strength, most small and medium-sized institutions are difficult to use as large institutions. The user needs to realize the exclusive efficient search system, and it is difficult to make further personalized design in combination with the needs of the small and medium-sized institutions. Therefore, how to effectively use the technology of the existing search engine giant for more organizations, especially the small and medium enterprises with a certain data set, but small economic carrying capacity and weak core development ability And scientific research institutions provide powerful search and computing technology and diversified services, which become the focus and difficulty of the current search field.
This paper studies the principle of distributed search engine based on Hadoop, related technologies and algorithms, analyzes distributed computing framework MapReduce and distributed file system HDFS, and introduces the specific design scheme of MapReduce programming model. The BM25 sorting model is set in Lucene to achieve the retrieval score, and the Paoding participle is adopted. In Chinese word segmentation processing, the architecture design of the system in the Hadoop platform is completed, the system function is divided, the crawl, index and retrieval process are analyzed and designed, and the improvement and implementation of the three subsystems are completed.
Based on the analysis, evaluation and summary of the needs and existing drawbacks of the small and medium institutions to achieve efficient information search, this paper integrates the design and implementation of three relatively independent subsystems, completes the construction of the Hadoop framework and the related configuration, and deploys the distributed search engine system of 3 nodes. Finally, the users of small and medium institutions have been implemented. The performance of the system is tested and evaluated. The efficiency of web crawling and indexing is carried out on the three node distributed platform and single machine environment. The results of crawling and cable reference show that for the 20000 web pages, the clustering is compared to single machine savings. With the increase of the number of web pages, the difference is expanding gradually. At the same time, the average accuracy of the distributed search engine system based on Hadoop is calculated by comparing the matching degree of the retrieval results of different web pages. The results show that the average accuracy of the search engine system based on the Hadoop based distributed search engine is improved by the experimental results of the 20%. experiment. The distributed search engine system oriented to small and medium-sized institutions can get more accurate retrieval results more quickly than the traditional centralized search engine, and improve the security stability and scalability of the system, thus improving the efficiency of information retrieval in small and medium institutions and speeding up its information level.
【學(xué)位授予單位】:浙江理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 夏天;;Nutch的插件機(jī)制分析[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期
2 胡長(zhǎng)春;劉功申;;面向搜索引擎Lucene的中文分析器[J];計(jì)算機(jī)工程與應(yīng)用;2009年12期
3 孫殿哲;魏海平;陳巖;;Nutch中庖丁解牛中文分詞的實(shí)現(xiàn)與評(píng)測(cè)[J];計(jì)算機(jī)與現(xiàn)代化;2010年06期
4 蔣建洪;;主要分布式搜索引擎技術(shù)的研究[J];科學(xué)技術(shù)與工程;2007年10期
5 陸興;八個(gè)著名中文搜索引擎的特征及其評(píng)析[J];圖書館理論與實(shí)踐;2003年02期
6 岳珍;四大中文搜索引擎檢索性能測(cè)評(píng)[J];情報(bào)科學(xué);2005年06期
7 段旭良;;中小企業(yè)電子商務(wù)網(wǎng)站站內(nèi)搜索引擎的設(shè)計(jì)與應(yīng)用[J];商場(chǎng)現(xiàn)代化;2009年36期
8 王衛(wèi)東;宋丹;宋人杰;;基于分解的向量空間模型的Web新聞信息檢索[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2006年03期
9 薛明;搜索引擎Google與Baidu比較[J];沈陽(yáng)大學(xué)學(xué)報(bào);2004年03期
10 杜德生;田小軍;;Lucene應(yīng)用中Pdf文檔文本數(shù)據(jù)提取方法研究[J];自動(dòng)化技術(shù)與應(yīng)用;2009年03期
本文編號(hào):1808328
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1808328.html