基于Solr的企業(yè)級(jí)檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
本文關(guān)鍵詞: 企業(yè)級(jí)搜索引擎 分布式 一致性哈希 SOLR 出處:《華南理工大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:搜索引擎是一項(xiàng)偉大的技術(shù),它使人們從浩如煙海的網(wǎng)頁(yè)中解放出來(lái)。企業(yè)級(jí)搜索引擎是面向企業(yè)應(yīng)用的中小型搜索引擎,可幫助企業(yè)處理內(nèi)部信息,并將各種企業(yè)相關(guān)的網(wǎng)絡(luò)信息聯(lián)系起來(lái),實(shí)現(xiàn)資源的共享及整合。木棉檢索是面向校園網(wǎng)應(yīng)用的企業(yè)級(jí)搜索引擎,也是下一代互聯(lián)網(wǎng)分布式搜索平臺(tái)SE6的主要節(jié)點(diǎn)搜索引擎。本文在其原有架構(gòu)的基礎(chǔ)上,對(duì)一些核心模塊及流程做了重新設(shè)計(jì),并加入一些新的模塊,使系統(tǒng)在性能、擴(kuò)展性、容錯(cuò)性等能力上有更大的提高。 本文針對(duì)查詢性能的優(yōu)化,重新設(shè)計(jì)了查詢模塊,搜索節(jié)點(diǎn)引入了開(kāi)源企業(yè)級(jí)搜索引擎——Solr,并設(shè)計(jì)了分布式網(wǎng)頁(yè)存儲(chǔ),以一致性哈希為劃分策略。在保持原有系統(tǒng)并行查詢的特點(diǎn)外,加入了對(duì)索引的維護(hù)功能,包括增、刪、改索引;節(jié)點(diǎn)通信方式也由RPC改成了更加開(kāi)放、標(biāo)準(zhǔn)的HTTP方式,,接口更加規(guī)范。重新設(shè)計(jì)后,系統(tǒng)的查詢效率得到了提高,開(kāi)放、擴(kuò)展性也得到了提升。 針對(duì)正文管理不規(guī)范、生成摘要速度慢、索引冗余等問(wèn)題,本文設(shè)計(jì)了網(wǎng)頁(yè)元數(shù)據(jù)管理系統(tǒng)。與原有的正文管理方式相比,網(wǎng)頁(yè)元數(shù)據(jù)管理更加系統(tǒng)、規(guī)范、高效,滿足了網(wǎng)頁(yè)規(guī)模不斷增長(zhǎng)的需求,在存儲(chǔ)節(jié)點(diǎn)增、刪時(shí),能快速重新劃分及完成數(shù)據(jù)遷移的工作。為了提高系統(tǒng)的容錯(cuò)性、擴(kuò)展性和錯(cuò)誤恢復(fù)能力,本文設(shè)計(jì)了動(dòng)態(tài)發(fā)現(xiàn)機(jī)制。動(dòng)態(tài)發(fā)現(xiàn)機(jī)制拋棄了原有的節(jié)點(diǎn)管理方式,分布式系統(tǒng)里的節(jié)點(diǎn)分布等信息統(tǒng)一由動(dòng)態(tài)發(fā)現(xiàn)機(jī)制維護(hù)。通過(guò)動(dòng)態(tài)發(fā)現(xiàn)機(jī)制,在節(jié)點(diǎn)新增、宕機(jī)、退出、網(wǎng)絡(luò)異常等情況下,系統(tǒng)依然能保持正常的服務(wù)狀態(tài),容錯(cuò)能力大大提高。 本文最后對(duì)整個(gè)系統(tǒng)進(jìn)行了性能評(píng)測(cè)。評(píng)測(cè)主要通過(guò)索引的建立速度、網(wǎng)頁(yè)在節(jié)點(diǎn)間的分布是否均勻、查詢響應(yīng)速度幾個(gè)方面進(jìn)行,并通過(guò)與原有系統(tǒng)的對(duì)比來(lái)評(píng)測(cè)最終效果。測(cè)試的數(shù)據(jù)來(lái)自實(shí)驗(yàn)室SE6分布式搜索引擎平臺(tái)中的校園網(wǎng)在線數(shù)據(jù)。
[Abstract]:Search engine is a great technology, it liberates people from the vast web pages. Enterprise search engine is a small and medium-sized search engine for enterprise applications, which can help enterprises handle internal information. And related to various enterprises related to network information to achieve the sharing and integration of resources. Kapok Retrieval is an enterprise-level search engine for campus network applications. It is also the main node search engine of the next generation Internet distributed search platform SE6. Based on its original architecture, this paper redesigns some core modules and processes, and adds some new modules. Make the system in the performance, expansibility, fault-tolerant and other capabilities have a greater improvement. Aiming at the optimization of query performance, this paper redesigns the query module, introduces open source enterprise-class search engine Solr, and designs distributed web storage. In addition to maintaining the characteristics of parallel query in the original system, the maintenance function of the index is added, including adding, deleting and changing the index. The node communication mode is changed from RPC to more open standard HTTP mode and the interface is more standardized. After redesigning the query efficiency of the system has been improved open and expansibility has also been improved. Aiming at the problems of non-standard text management, slow summary generation and index redundancy, this paper designs a web page metadata management system. Compared with the original text management mode, web metadata management is more systematic and standardized. It can quickly repartition and complete the work of data migration when the storage nodes increase and delete. In order to improve the fault tolerance scalability and error recovery ability of the system. This paper designs the dynamic discovery mechanism. The dynamic discovery mechanism abandons the original node management mode, and the information of node distribution in the distributed system is maintained by the dynamic discovery mechanism. In the case of new nodes, outages, exits, network anomalies, the system can still maintain a normal service state, and the fault-tolerant ability is greatly improved. At the end of this paper, the performance of the whole system is evaluated. The evaluation is mainly carried out through the speed of building index, the distribution of web pages among nodes, and the response speed of query. Finally, the final result is evaluated by comparing with the original system. The test data come from the campus network online data in the laboratory SE6 distributed search engine platform.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 丁豐,董娜,林碧琴,袁保宗;自然語(yǔ)言處理系統(tǒng)中自動(dòng)分詞的研究[J];北方交通大學(xué)學(xué)報(bào);1999年06期
2 曲衛(wèi)華;王群;;搜索引擎原理介紹與分析[J];電腦知識(shí)與技術(shù);2006年35期
3 張艷;;信息檢索模型的比較研究[J];電腦知識(shí)與技術(shù);2009年08期
4 龍樹(shù)全;趙正文;唐華;;中文分詞算法概述[J];電腦知識(shí)與技術(shù);2009年10期
5 文坤梅,盧正鼎,葉衛(wèi)國(guó),金莉;搜索引擎中頁(yè)面更新策略的分析與改進(jìn)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年12期
6 王繼成,蕭嶸,孫正興,張福炎;Web信息檢索研究進(jìn)展[J];計(jì)算機(jī)研究與發(fā)展;2001年02期
7 李振星,徐澤平,唐衛(wèi)清,唐榮錫;全二分最大匹配快速分詞算法[J];計(jì)算機(jī)工程與應(yīng)用;2002年11期
8 文坤梅;盧正鼎;孫小林;李瑞軒;;語(yǔ)義搜索研究綜述[J];計(jì)算機(jī)科學(xué);2008年05期
9 陳耀東,王挺;基于有向圖的雙向匹配分詞算法及實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用;2005年06期
10 董守斌;;木棉:企業(yè)級(jí)校園網(wǎng)搜索引擎[J];中國(guó)教育網(wǎng)絡(luò);2007年06期
相關(guān)重要報(bào)紙文章 前2條
1 章森 王偉;[N];計(jì)算機(jī)世界;2006年
2 ;[N];中國(guó)計(jì)算機(jī)報(bào);2006年
相關(guān)博士學(xué)位論文 前1條
1 張?bào)?搜索引擎類企業(yè)國(guó)際市場(chǎng)進(jìn)入戰(zhàn)略研究[D];武漢大學(xué);2012年
本文編號(hào):1488970
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1488970.html