搜索引擎中重復(fù)網(wǎng)頁(yè)檢測(cè)算法研究

發(fā)布時(shí)間：2018-05-12 01:15

本文選題：搜索引擎 + 重復(fù)網(wǎng)頁(yè)檢測(cè)��；參考：《河南工業(yè)大學(xué)》2012年碩士論文

【摘要】：隨著因特網(wǎng)的普及和快速發(fā)展，網(wǎng)絡(luò)信息以指數(shù)級(jí)速度快速增長(zhǎng)，搜索引擎成為用戶在海量網(wǎng)絡(luò)資源中查找需求信息的有效工具。但是由于網(wǎng)絡(luò)信息發(fā)布沒有明確統(tǒng)一的規(guī)范，而且發(fā)布信息比較容易，造成因特網(wǎng)上存在有大量?jī)?nèi)容重復(fù)和近似重復(fù)的網(wǎng)頁(yè)。這些重復(fù)網(wǎng)頁(yè)會(huì)給搜索引擎帶來(lái)諸多弊端，如影響用戶體驗(yàn)，浪費(fèi)抓取和存儲(chǔ)資源，增大倒排索引表和降低檢索效率等，因此重復(fù)網(wǎng)頁(yè)檢測(cè)技術(shù)可以有效提高搜索引擎的質(zhì)量。近年來(lái)，各大搜索引擎公司和國(guó)內(nèi)外學(xué)者提出了多種重復(fù)網(wǎng)頁(yè)檢測(cè)算法，如基于特征碼的算法、I-Match算法、基于特征項(xiàng)的重復(fù)網(wǎng)頁(yè)檢測(cè)算法和DSC重復(fù)網(wǎng)頁(yè)檢測(cè)算法等。論文對(duì)現(xiàn)有的重復(fù)網(wǎng)頁(yè)檢測(cè)算法進(jìn)行詳細(xì)分析發(fā)現(xiàn)，這些算法的共同思想是首先從文本中抽取一定信息，其次利用抽取出的信息進(jìn)行相似性判定。不同算法在具體抽取文本信息時(shí)的策略不同，導(dǎo)致計(jì)算相似性時(shí)的方法不同。并且有些算法為了提高計(jì)算的效率，對(duì)抽取的文本信息進(jìn)行壓縮處理�？梢娔芊駨奈谋緝�(nèi)容中抽取有效信息準(zhǔn)確表征文本是影響重復(fù)網(wǎng)頁(yè)檢測(cè)技術(shù)性能的關(guān)鍵因素。論文對(duì)兩種經(jīng)典的重復(fù)網(wǎng)頁(yè)檢測(cè)算法進(jìn)行了詳細(xì)的分析，并對(duì)其中存在的不足進(jìn)行改進(jìn)，主要研究?jī)?nèi)容如下：（1）基于DSC重復(fù)網(wǎng)頁(yè)檢測(cè)算法的改進(jìn) DSC(Digital Syntactic Clustering)算法是用于重復(fù)網(wǎng)頁(yè)檢測(cè)的經(jīng)典算法，其基本思想是將文本切分成一定數(shù)量的shingles，然后選取一定的shingles參與相似性比較。該算法的缺點(diǎn)是在選取shingles時(shí)是隨機(jī)的，并沒有充分利用文本的內(nèi)容特征。針對(duì)算法的不足，改進(jìn)算法維護(hù)一個(gè)特征項(xiàng)的集合，選取含有特征項(xiàng)的shingles，這樣參與相似性比較的shingles能更好的利用文本的結(jié)構(gòu)特征和內(nèi)容特征。（2）基于特征項(xiàng)的重復(fù)網(wǎng)頁(yè)檢測(cè)算法的改進(jìn) 基于特征項(xiàng)的重復(fù)網(wǎng)頁(yè)檢測(cè)算法首先利用傳統(tǒng)信息檢索中的TFIDF算法抽取文本的特征項(xiàng)，將文本表示成特征項(xiàng)的空間向量，然后利用余弦公式判定相似性。TFIDF算法的缺點(diǎn)是在計(jì)算特征項(xiàng)的權(quán)重時(shí)沒有考慮特征項(xiàng)在文本中的位置信息。通過對(duì)網(wǎng)頁(yè)的觀察發(fā)現(xiàn)，，網(wǎng)頁(yè)文本的內(nèi)容較短，較多含有標(biāo)題，并且標(biāo)題是內(nèi)容的高度概括。利用這一特點(diǎn)，對(duì)TFIDF算法進(jìn)行改進(jìn)，對(duì)在文本標(biāo)題中出現(xiàn)的特征項(xiàng)的權(quán)重進(jìn)行了增強(qiáng)。（3）改進(jìn)算法的性能評(píng)估實(shí)現(xiàn)了一個(gè)基于開源索引檢索工具Lucene的搜索引擎原型系統(tǒng)，對(duì)改進(jìn)算法進(jìn)行性能驗(yàn)證。實(shí)驗(yàn)結(jié)果表明，改進(jìn)算法在重復(fù)網(wǎng)頁(yè)識(shí)別的查全率和查準(zhǔn)率方面較原算法都有所提升。
[Abstract]:With the popularization and rapid development of the Internet, the network information is growing exponentially, and the search engine has become an effective tool for users to find the demand information in the massive network resources. However, there is no clear and uniform specification for the information release on the Internet, and it is easy to publish the information, which results in the existence of a large number of web pages with repeated content and similar duplication on the Internet. These repeated pages will bring many disadvantages to search engine, such as affecting user experience, wasting grab and storage resources, increasing inverted index table and reducing retrieval efficiency, etc. Therefore, duplicate page detection technology can effectively improve the quality of search engine. In recent years, various search engine companies and scholars at home and abroad have proposed a variety of duplicate page detection algorithms, such as signature based algorithm I match algorithm, feature based repeat page detection algorithm and DSC repeat page detection algorithm and so on. Through the detailed analysis of the existing repeated page detection algorithms, it is found that the common idea of these algorithms is to extract some information from the text first, and then to use the extracted information to determine the similarity. Different algorithms have different strategies for extracting text information, which leads to different methods for computing similarity. In order to improve the computational efficiency, some algorithms compress the extracted text information. It can be seen that extracting effective information from text content accurately represents the text is the key factor to affect the performance of duplicate page detection technology. In this paper, two classical algorithms of duplicate page detection are analyzed in detail, and the shortcomings are improved. The main contents are as follows: 1) an improved algorithm for duplicate web page detection based on DSC DSC(Digital Syntactic clustering algorithm is a classical algorithm for repeated web page detection. Its basic idea is to divide the text into a certain number of shingles, and then select a certain shingles to participate in similarity comparison. The disadvantage of this algorithm is that it is random in selecting shingles and does not make full use of the content features of the text. In view of the deficiency of the algorithm, the improved algorithm maintains a set of feature items and selects Shingleses with feature items, so that the shingles which takes part in the similarity comparison can make better use of the structural features and content features of the text. Improvement of the algorithm of duplicate Web Page Detection based on feature item Firstly, the TFIDF algorithm of traditional information retrieval is used to extract the feature items of the text, and the text is represented as the spatial vector of the feature item. Then the disadvantage of using cosine formula to determine similarity. TFIDF algorithm is that the location information of feature items in text is not considered when calculating the weight of feature items. Through the observation of the web page, it is found that the content of the page text is shorter, the content contains more titles, and the title is the high generalization of the content. Using this feature, the TFIDF algorithm is improved, and the weight of the feature items appearing in the text title is enhanced. Performance evaluation of improved algorithm A prototype system of search engine based on open source index retrieval tool Lucene is implemented to verify the performance of the improved algorithm. The experimental results show that the improved algorithm can improve the recall rate and precision rate of duplicate page recognition compared with the original algorithm.
【學(xué)位授予單位】：河南工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 王建勇,謝正茂,雷鳴,李曉明;近似鏡像網(wǎng)頁(yè)檢測(cè)算法的研究與評(píng)價(jià)[J];電子學(xué)報(bào);2000年S1期

2 孫鐵利;劉延吉;;中文分詞技術(shù)的研究現(xiàn)狀與困難[J];信息技術(shù);2009年07期

3 馬玉春,宋瀚濤;Web中文文本分詞技術(shù)研究[J];計(jì)算機(jī)應(yīng)用;2004年04期

4 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期

5 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計(jì)算的研究[J];計(jì)算機(jī)應(yīng)用研究;2008年11期

6 張俊英;胡俠;卜佳俊;;網(wǎng)頁(yè)文本信息自動(dòng)提取技術(shù)綜述[J];計(jì)算機(jī)應(yīng)用研究;2009年08期

7 唐鐵兵;陳林;祝偉華;;基于Lucene的全文檢索構(gòu)件的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2010年02期

8 吳平博,陳群秀,馬亮;基于特征串的大規(guī)模中文網(wǎng)頁(yè)快速去重算法研究[J];中文信息學(xué)報(bào);2003年02期

9 代六玲,黃河燕,陳肇雄;中文文本分類中特征抽取方法的比較研究[J];中文信息學(xué)報(bào);2004年01期

相關(guān)碩士學(xué)位論文前2條

1 劉運(yùn)佳;基于Lucene和Heririx構(gòu)建搜索引擎的研究和示例實(shí)現(xiàn)[D];電子科技大學(xué);2008年

2 萬(wàn)晶;Web網(wǎng)頁(yè)正文抽取方法研究[D];南昌大學(xué);2010年

本文編號(hào)：1876461

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1876461.html

上一篇：基于Key的XML連續(xù)查詢算法
下一篇：高并發(fā)搜索系統(tǒng)下內(nèi)存池的設(shè)計(jì)和實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎中重復(fù)網(wǎng)頁(yè)檢測(cè)算法研究