搜索引擎中索引表求交和提前停止技術(shù)優(yōu)化研究

發(fā)布時(shí)間：2019-01-08 11:48

【摘要】：目前搜索引擎系統(tǒng)所處理的網(wǎng)頁數(shù)據(jù)量、用戶的查詢數(shù)量都在與日俱增，如何能高效地完成用戶查詢是最近幾年，無論是工業(yè)界還是學(xué)術(shù)界都在著力解決的問題，本文便是應(yīng)上述需求而展開研究的。具體來說，本文探討了索引表求交算法和提前停止技術(shù)，我們還嘗試采用新一代的圖形顯卡技術(shù)來提升查詢處理的效率。具體工作包括以下幾方面：第一，本文首先系統(tǒng)地回顧了目前的索引表求交算法，并采用真實(shí)搜索引擎的查詢集在GOV和GOV2數(shù)據(jù)集上對各種求交算法和搜索策略進(jìn)行測試和概要分析，評價(jià)各種因素對求交算法性能的影響，并相應(yīng)提出優(yōu)化方法。前人對求交和搜索算法的分析和評價(jià)主要關(guān)注比較次數(shù)對運(yùn)行時(shí)間的影響，本文嘗試分析更接近系統(tǒng)架構(gòu)層面的分支預(yù)測失敗次數(shù)和緩存失效次數(shù)對求交和搜索算法性能的重要影響。作者發(fā)現(xiàn)，基于SvS求交模型并使用哈希分段策略的算法性能最佳，盡管該算法的比較次數(shù)并不是最少的，但是在付出少量額外存儲(chǔ)空間的前提下，運(yùn)行過程中的緩存失效和分支預(yù)測失敗數(shù)相對較低，從而達(dá)到了最佳的性能。本文還討論了文檔重排序?qū)τ谇蠼凰惴ㄐ阅艿挠绊�，大部分求交算法在按照URL排序的索引上的性能優(yōu)于文檔隨機(jī)編號(hào)的索引，且緩存失效數(shù)和分支預(yù)測失敗數(shù)更少。另外，通過表長的相對比值來分析求交算法中各種搜索策略的性能，，本文設(shè)計(jì)了一種啟發(fā)式的調(diào)度策略來提升求交算法的性能，實(shí)驗(yàn)結(jié)果表明這種啟發(fā)式混合調(diào)度策略可以顯著提升索引表求交算法的性能。第二，目前圖形顯卡（GPU）由于擁有強(qiáng)大的并行計(jì)算能力，已經(jīng)越來越多地應(yīng)用在各種科學(xué)計(jì)算的任務(wù)中。本文探討了利用GPU加速搜索引擎中的索引表求交算法以及后續(xù)的文檔分?jǐn)?shù)計(jì)算、排序等步驟，實(shí)驗(yàn)表明本文提出的算法可以大幅度提升查詢處理的吞吐率。具體來說，本文針對搜索引擎中可能存在的高查詢負(fù)載的交互式查詢或者非交互式查詢，提出了CPU-GPU協(xié)同工作模型來加速搜索引擎中的索引表求交運(yùn)算。當(dāng)系統(tǒng)處于低負(fù)載的時(shí)候，CPU還遠(yuǎn)沒有達(dá)到負(fù)載上限時(shí)，查詢處理的響應(yīng)時(shí)間是首先需要考慮的因素，每當(dāng)新的查詢進(jìn)入系統(tǒng)時(shí)，可以采用CPU或者GPU求交算法對查詢進(jìn)行處理。而當(dāng)系統(tǒng)處于高查詢負(fù)載或需要處理非交互式查詢時(shí)，則采用同步模式，其中若干查詢首先在CPU端組成批次，再發(fā)送到GPU上進(jìn)行處理�；谏鲜瞿Ｐ停谔幚砀哓�(fù)載查詢或者非交互查詢的情況下，利用GPU的大規(guī)模并行處理能力，查詢處理的吞吐率可以大幅增加。本文則提出并實(shí)現(xiàn)了這種批次求交的框架。在這個(gè)框架中，求交部分中的搜索算法是性能瓶頸，因此本文著重討論搜索算法的各種優(yōu)化策略。具體來說，在本文提出的GPU表求交算法中，當(dāng)某兩個(gè)索引表1和2進(jìn)行求交操作時(shí)（|1|≤|2|），對于1的每一個(gè)元素，GPU的每一個(gè)線程執(zhí)行某種搜索策略在2中進(jìn)行搜索，以發(fā)現(xiàn)交集元素。其中最簡單的搜索策略為二分搜索，本文以該算法作為基準(zhǔn)，提出了線性回歸搜索、哈希分段搜索和基于Bloom Filter的GPU求交算法。實(shí)驗(yàn)結(jié)果顯示，本文提出幾種策略可以顯著提升GPU求交算法的性能，尤其是哈希分段算法的性能提升最為明顯。同時(shí)，本文還討論了索引重排對于GPU求交算法性能的影響。除此之外，本文還研究了如何在GPU求交算法的基礎(chǔ)上對文檔計(jì)算分?jǐn)?shù)并排序，提出了一種適合GPU批次算法的文檔排序方法。第三，本文還討論了求交運(yùn)算后進(jìn)一步優(yōu)化查詢處理，即利用提前停止技術(shù)優(yōu)化文檔排序工作。提前停止的基本思想是無需掃描完全部索引表，即可將與用戶查詢相關(guān)度分?jǐn)?shù)最高的前k名文檔求出。針對于僅僅依賴全局信息排序的索引結(jié)構(gòu)，很難獲得很好的提前停止效果的問題，本文提出了新的方法來重新組織索引結(jié)構(gòu)，獲得了顯著的性能提升。具體來說，本文首先對提前停止效果和靜態(tài)排名分?jǐn)?shù)、詞頻分?jǐn)?shù)的分布之間的關(guān)系進(jìn)行理論分析。本文發(fā)現(xiàn)只要詞頻分?jǐn)?shù)是均勻分布的，提前停止的效果會(huì)非常好。然而，真實(shí)的詞頻分?jǐn)?shù)是不服從均勻分布的，即某些分?jǐn)?shù)的值的概率會(huì)高于其它分?jǐn)?shù)值。而上述傾斜的分布正是導(dǎo)致全局排名方法提前停止效率較差的原因。本文提出了新的算法來依據(jù)某些額外的信息來重新安排倒排索引中文檔的組織方式。具體地，本文提出了利用文檔詞頻分?jǐn)?shù)上界（UBIR）與靜態(tài)排名分?jǐn)?shù)結(jié)合后的分?jǐn)?shù)作為排序的標(biāo)準(zhǔn)重新組織索引結(jié)構(gòu)，以提升查詢處理中提前停止的效果。通過這種方法，在進(jìn)行查詢處理時(shí)，文檔詞頻數(shù)的估計(jì)值可以顯著地降低，從而可以更快地滿足提前停止的條件。本文還提出了其它幾種的基于全局信息對索引進(jìn)行排序的方法。通過在GOV和GOV2數(shù)據(jù)集上比較本文提出的各種算法，并且與現(xiàn)有的全局排名的方法進(jìn)行比較，結(jié)果顯示，本文的方法可以有效地提升查詢處理的效率。
[Abstract]:At present, the number of web pages processed by the search engine system is increasing, and how to efficiently complete the user's query is the most recent years, whether industry or academia is working hard to solve, this paper is to study the above requirements. In particular, this paper discusses the index table intersection algorithm and the early stop technique, and we also try to improve the efficiency of query processing by using a new generation of graphic graphics card technology. The specific work includes the following aspects: First, this paper first systematically reviews the current index table intersection algorithm, and uses the query set of the real search engine to test and outline the various intersection algorithms and search strategies on the GOV and GOV2 data sets. The influence of various factors on the performance of the algorithm is evaluated, and the optimization method is put forward accordingly. In this paper, we try to analyze the influence of the number of branch prediction failures and the number of cache failures on the performance of the search algorithm and the search algorithm. In response, the author has found that the algorithm performance of the model based on SvS and using the hashing strategy is the best, although the number of the comparisons of the algorithm is not the least, the cache failure and the number of the branch prediction failures in the running process are relatively low on the premise of a small amount of extra storage space. low, thus achieving the best This paper also discusses the effect of re-ordering of documents on the performance of the algorithm. Most of the algorithms are better than the index of the random number of the document in the index ordered by the URL, and the number of cache failures and the number of branch prediction failures are more In addition, the performance of various search strategies in the intersection algorithm is analyzed by the relative ratio of the table length. In this paper, a heuristic scheduling strategy is designed to improve the performance of the algorithm. The experimental results show that this heuristic hybrid scheduling strategy can significantly improve the performance of the algorithm. Second, the current graphics graphics card (GPU) has been used more and more in various scientific computing because of its powerful parallel computing capability In this paper, the paper discusses the steps of using the algorithm of the index table in the GPU to accelerate the search engine and the subsequent calculation and sorting of the documents. The experiments show that the proposed algorithm can greatly improve the query processing. In this paper, an interactive query or non-interactive query of the high query load that may exist in the search engine is presented, and the CPU-GPU cooperative work model is proposed to speed up the index table in the search engine. When the system is in a low load, the response time of the query processing is the first to be considered when the system is in the low load, and each time the new query enters the system, the CPU or the GPU can be used to ask the algorithm to query the query. Line processing. When the system is in a high query load or a non-interactive query needs to be processed, a synchronization pattern is used, in which a number of queries first constitute a batch at the CPU end and then sent to the GPU. line processing. Based on the model, in the case of processing a high-load query or a non-interactive query, the throughput rate of the query processing can be large with the large-scale parallel processing capability of the GPU. In this paper, this paper presents and implements this batch. In this framework, the search algorithm in the intersection is the performance bottleneck, so this paper focuses on the various advantages of the search algorithm. Specifically, in the GPU table intersection algorithm proposed herein, when one of the two index tables 1 and 2 performs an intersection operation (| 1 | 1 | 2 |), for each element of 1, each thread of the GPU executes a search strategy to search in 2 to find the intersection The simplest search strategy is a binary search. In this paper, based on the algorithm, a linear regression search, a hash-segment search, and a Bloom Filter-based GPU are proposed. The experimental results show that several strategies can improve the performance of the algorithm, especially the performance of the hash segmentation algorithm. This paper also discusses the performance of the index rearrangement for GPU. In addition, the paper also studies how to compute the score of the document on the basis of the GPU's intersection algorithm, and put forward a kind of document bank suitable for GPU batch algorithm. The third, this paper also discusses the further optimization of query processing after the intersection operation, that is, using the advance stop technique to optimize the text. File sorting. The basic idea to stop in advance is that you can query the top k with the highest correlation score with the user without the need to scan the full index table. In this paper, a new method is proposed to re-organize the index structure. In particular, in this paper, the relationship between the early stop effect and the static ranking score and the word frequency score is first In this paper, we find that as long as the word frequency score is uniformly distributed, the effect of early stop The fruit will be very good. However, the true word frequency score is not subject to uniform distribution, that is, the probability of certain scores will be higher other fractional values. The above-described skew distribution is the leading end of the global ranking approach This paper proposes a new algorithm to rearrange the documents in the inverted index according to some extra information. In this paper, the index structure of the index is re-organized by using the score of the upper bound (UBIR) and the static ranking score in the document, so as to advance the advance in the query processing. By this method, when the query processing is performed, the estimated value of the frequency of the document word can be significantly reduced, so that the advance can be more quickly satisfied. In this paper, several other global information-based indexes are also put forward. By comparing the various algorithms proposed in this paper on the GOV and GOV2 data sets, and compared with the existing global ranking method, the results show that the method can effectively improve the query.
【學(xué)位授予單位】：南開大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 那罡;;移動(dòng)搜索的“簡單”邏輯[J];中國計(jì)算機(jī)用戶;2006年26期

2 劉興波;;數(shù)據(jù)挖掘在搜索引擎中的應(yīng)用[J];遼寧省交通高等�？茖W(xué)校學(xué)報(bào);2008年03期

3 謝紅俠;惠正運(yùn);;一種面向文檔的XML的索引查詢方法[J];微機(jī)發(fā)展;2005年12期

4 許洪超;袁培燕;;智能搜索引擎系統(tǒng)的建模分析[J];福建電腦;2009年08期

5 陸小麗;何加銘;;基于Map/Reduce的索引數(shù)據(jù)云存儲(chǔ)模型研究[J];寧波大學(xué)學(xué)報(bào)(理工版);2011年03期

6 王路芳;張虎;;一種面向搜索引擎的基于集合模型的搜索算法[J];山西農(nóng)業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年06期

7 盧秉亮;朱健;張磊;曹一鵬;;Internet搜索引擎索引數(shù)據(jù)庫的設(shè)計(jì)與實(shí)現(xiàn)[J];微處理機(jī);2006年03期

8 吳啟明;;基于XML的搜索引擎[J];中國新技術(shù)新產(chǎn)品;2008年12期

9 張繼剛;搜索引擎使用技巧[J];網(wǎng)絡(luò)與信息;1999年09期

10 ;關(guān)鍵詞搜索[J];每周電腦報(bào);2000年38期

相關(guān)會(huì)議論文前10條

1 彭軻;廖聞劍;;淺析搜索引擎[A];中國通信學(xué)會(huì)第五屆學(xué)術(shù)年會(huì)論文集[C];2008年

2 李丹;;如何利用搜索引擎查找中醫(yī)藥信息[A];中國中醫(yī)藥信息研究會(huì)第二屆理事大會(huì)暨學(xué)術(shù)交流會(huì)議論文匯編[C];2003年

3 鄧長壽;郭景峰;楊焱林;鄧安遠(yuǎn);;下一代Web搜索引擎初探[A];第十八屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2001年

4 維尼拉·木沙江;吐爾洪·吾司曼;;維、哈、柯文搜索引擎中網(wǎng)頁爬行器的設(shè)計(jì)與實(shí)現(xiàn)[A];少數(shù)民族青年自然語言處理技術(shù)研究與進(jìn)展——第三屆全國少數(shù)民族青年自然語言信息處理、第二屆全國多語言知識(shí)庫建設(shè)聯(lián)合學(xué)術(shù)研討會(huì)論文集[C];2010年

5 何璐;李晉宏;;基于XML的大容量搜索引擎技術(shù)探討[A];2006北京地區(qū)高校研究生學(xué)術(shù)交流會(huì)——通信與信息技術(shù)會(huì)議論文集（下）[C];2006年

6 湯薇;曾艷;;構(gòu)建校園網(wǎng)搜索引擎必要性分析[A];廣西計(jì)算機(jī)學(xué)會(huì)2008年年會(huì)論文集[C];2008年

7 張維華;王國仁;于戈;鄭懷遠(yuǎn);;面向?qū)ο髷?shù)據(jù)庫系統(tǒng)中索引的建立與維護(hù)[A];第十七屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2000年

8 姚樹宇;趙少東;;一種使用分布式技術(shù)的搜索引擎[A];2005年全國開放式分布與并行計(jì)算學(xué)術(shù)會(huì)議論文集[C];2005年

9 倪俊峰;;基于黃頁搜索引擎的關(guān)鍵字排名廣告系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[A];2005年中國索引學(xué)會(huì)年會(huì)暨學(xué)術(shù)研討會(huì)論文集[C];2005年

10 張怡;查貴庭;;SEO在信息服務(wù)中的應(yīng)用研究[A];2010年中國索引學(xué)會(huì)年會(huì)暨學(xué)術(shù)研討會(huì)論文集[C];2010年

相關(guān)重要報(bào)紙文章前10條

1 章森王偉;搜索引擎的工作機(jī)制[N];計(jì)算機(jī)世界;2006年

2 李一鑫;搜索排名的紅與黑[N];財(cái)經(jīng)時(shí)報(bào);2007年

3 周文林;搜狗3.0能否撼動(dòng)搜索市場[N];經(jīng)濟(jì)參考報(bào);2007年

4 惠正一;比爾·蓋茨:微軟不怕Google[N];第一財(cái)經(jīng)日報(bào);2005年

5 賽迪顧問股份有限公司互聯(lián)網(wǎng)與電子商務(wù)咨詢中心常燕杰;搜索，還是門戶[N];中國計(jì)算機(jī)報(bào);2005年

6 陳珊;浙江移動(dòng)推出手機(jī)搜索引擎服務(wù)[N];人民郵電;2005年

7 趙法忠;搜索引擎還需悠著點(diǎn)[N];中國經(jīng)營報(bào);2005年

8 金朝力;搜索引擎火拼搜索質(zhì)量[N];北京商報(bào);2006年

9 本報(bào)記者　趙曉輝孟昭麗;搜索引擎駛?cè)搿氨茱L(fēng)港”[N];中國證券報(bào);2006年

10 孫t;搜索引擎驚喜侵權(quán)官司止于“避風(fēng)港”？[N];第一財(cái)經(jīng)日報(bào);2006年

相關(guān)博士學(xué)位論文前10條

1 張帆;搜索引擎中索引表求交和提前停止技術(shù)優(yōu)化研究[D];南開大學(xué);2012年

2 陳旭毅;基于索引云的企業(yè)搜索引擎實(shí)現(xiàn)研究[D];武漢大學(xué);2011年

3 劉佐達(dá);分布協(xié)作式搜索引擎模型及算法研究[D];清華大學(xué);2011年

4 岑榮偉;基于用戶行為分析的搜索引擎評價(jià)研究[D];清華大學(xué);2010年

5 李群;主題搜索引擎聚類算法的研究[D];北京林業(yè)大學(xué);2011年

6 蘇君華;面向搜索引擎的技術(shù)接受模型研究[D];南京大學(xué);2011年

7 郭眈;中文互聯(lián)網(wǎng)視頻搜索引擎系統(tǒng)策略研究[D];北京交通大學(xué);2012年

8 王昤璞;基于用戶體驗(yàn)的互聯(lián)網(wǎng)搜索引擎醫(yī)學(xué)信息檢索可用性評估研究[D];吉林大學(xué);2010年

9 李莎莎;面向搜索引擎的自然語言處理關(guān)鍵技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

10 白玉琪;空間信息搜索引擎研究[D];中國科學(xué)院研究生院（遙感應(yīng)用研究所）;2003年

相關(guān)碩士學(xué)位論文前10條

1 張蕾;基于Lucene的電子檔案檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];西安電子科技大學(xué);2010年

2 李浩;分布式教育網(wǎng)信息檢索系統(tǒng)的研究和實(shí)現(xiàn)[D];華南理工大學(xué);2010年

3 張朝斌;企業(yè)級(jí)搜索引擎的優(yōu)化設(shè)計(jì)與實(shí)現(xiàn)[D];華南理工大學(xué);2010年

4 尉建興;基于Lucene搜索引擎的研究與應(yīng)用[D];太原理工大學(xué);2011年

5 張彬;基于lucene的搜索引擎[D];上海師范大學(xué);2010年

6 徐財(cái)應(yīng);基于Lucene的搜索引擎技術(shù)的研究與改進(jìn)[D];長春理工大學(xué);2010年

7 陳艷斐;基于用戶興趣模型的校園網(wǎng)搜索引擎設(shè)計(jì)與應(yīng)用[D];云南大學(xué);2010年

8 劉志偉;數(shù)學(xué)搜索引擎研究[D];蘭州大學(xué);2011年

9 孫華昱;Lucene在醫(yī)學(xué)影像資源檢索平臺(tái)中的應(yīng)用[D];沈陽工業(yè)大學(xué);2011年

10 薛云;Internet上元搜索引擎的研究與設(shè)計(jì)[D];太原理工大學(xué);2003年

本文編號(hào)：2404556

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2404556.html

上一篇：基于搜索日志的查詢推薦算法研究
下一篇：網(wǎng)絡(luò)搜索引擎探析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎中索引表求交和提前停止技術(shù)優(yōu)化研究