基于DHT的分布式價格搜索引擎研究
發(fā)布時間:2019-01-26 21:01
【摘要】:近年來,隨著網(wǎng)絡(luò)資源的多樣化和人們對專有領(lǐng)域信息的需求,垂直搜索引擎的研究越來越受到人們的關(guān)注。面向價格的搜索就是垂直搜索引擎中的一種。但現(xiàn)有的價格搜索引擎幾乎都是基于集中式的,當大量用戶在同一時間進行請求時,中央服務(wù)器就會成為“瓶頸”且容易出現(xiàn)單點故障。隨著網(wǎng)絡(luò)規(guī)模的不斷擴大,對分布式垂直搜索的研究顯得越來越重要。本文將P2P技術(shù)與垂直搜索引擎相結(jié)合,設(shè)計了一個基于DHT的分布式價格搜索引擎,并討論了主題爬蟲的爬行策略、利用URL規(guī)則對網(wǎng)頁的主題相關(guān)性進行判斷以及利用XPath技術(shù)對web信息進行抽取。然后討論了如何利用DHT的思想實現(xiàn)索引的構(gòu)建和分布式存儲,有效的避免了集中式索引可能出現(xiàn)的問題。 最后,針對現(xiàn)有的價格搜素引擎存在的搜索結(jié)果呈現(xiàn)結(jié)構(gòu)不清晰、混亂的問題,本文提出了對搜索結(jié)果進行聚類的想法。通過對現(xiàn)有聚類算法的研究和分析,本文對k-means算法進行了改進,并利用改進后的算法對搜索結(jié)果進行聚類,使得簇內(nèi)的文檔相似度較高,簇間的文檔相似度較低。然后每個簇都用類標簽進行描述,用戶只需根據(jù)類標簽查看自己感興趣的信息即可,而無需對所有的返回結(jié)果進行逐個瀏覽,大大節(jié)省了瀏覽時間和查找時間。
[Abstract]:In recent years, with the diversification of network resources and people's demand for proprietary domain information, the research of vertical search engine has attracted more and more attention. Price-oriented search is one of the vertical search engines. However, most existing price search engines are based on centralized search engines. When a large number of users make requests at the same time, the central server becomes a "bottleneck" and is prone to a single point of failure. With the expansion of network scale, the research of distributed vertical search becomes more and more important. This paper combines P2P technology with vertical search engine, designs a distributed price search engine based on DHT, and discusses the crawling strategy of topic crawler. URL rules are used to judge the relevance of web pages and XPath technology is used to extract web information. Then it discusses how to use the idea of DHT to realize index construction and distributed storage, which can effectively avoid the possible problems of centralized index. Finally, aiming at the problem that the search results of the existing price search engine are not clear and confusing, this paper puts forward the idea of clustering the search results. Through the research and analysis of the existing clustering algorithms, this paper improves the k-means algorithm, and makes use of the improved algorithm to cluster the search results, which makes the document similarity within the cluster is higher, and the document similarity between the clusters is lower. Then each cluster is described by class tags. Users only need to view the information they are interested in according to the class tag, without having to browse all the returned results one by one, which greatly saves the browsing time and searching time.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
本文編號:2415907
[Abstract]:In recent years, with the diversification of network resources and people's demand for proprietary domain information, the research of vertical search engine has attracted more and more attention. Price-oriented search is one of the vertical search engines. However, most existing price search engines are based on centralized search engines. When a large number of users make requests at the same time, the central server becomes a "bottleneck" and is prone to a single point of failure. With the expansion of network scale, the research of distributed vertical search becomes more and more important. This paper combines P2P technology with vertical search engine, designs a distributed price search engine based on DHT, and discusses the crawling strategy of topic crawler. URL rules are used to judge the relevance of web pages and XPath technology is used to extract web information. Then it discusses how to use the idea of DHT to realize index construction and distributed storage, which can effectively avoid the possible problems of centralized index. Finally, aiming at the problem that the search results of the existing price search engine are not clear and confusing, this paper puts forward the idea of clustering the search results. Through the research and analysis of the existing clustering algorithms, this paper improves the k-means algorithm, and makes use of the improved algorithm to cluster the search results, which makes the document similarity within the cluster is higher, and the document similarity between the clusters is lower. Then each cluster is described by class tags. Users only need to view the information they are interested in according to the class tag, without having to browse all the returned results one by one, which greatly saves the browsing time and searching time.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前1條
1 王文鈞;李巍;;垂直搜索引擎的現(xiàn)狀與發(fā)展探究[J];情報科學(xué);2010年03期
,本文編號:2415907
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2415907.html
最近更新
教材專著