基于本體概念相似度的主題爬蟲中網(wǎng)頁排序模型研究
發(fā)布時間:2018-10-04 19:47
【摘要】:相比通用搜索引擎,專注于某一具體領域的主題搜索引擎可以帶來更高精度的信息采集,為用戶帶來更好信息檢索服務。主題爬蟲作為主題搜索引擎的核心模塊,提高檢索信息的領域相關度就顯得尤為重要。 但是由于網(wǎng)絡資源規(guī)模巨大且呈高度動態(tài)的增長,采集結(jié)果仍然會存在大量不相關的網(wǎng)頁信息,從而導致采集效率下降。針對這種問題,本文通過研究主題爬蟲設計中的相關性分析技術,主要是網(wǎng)頁排序算法的研究,分析總結(jié)目前網(wǎng)頁排序算法的優(yōu)缺點,并結(jié)合鹽湖領域特點,利用本體在表達語義方面的優(yōu)勢,提出一種新的基于本體概念相似度的網(wǎng)頁排序算法,以此提高主題相關性計算準確度。 該方法首先選擇出合適網(wǎng)頁作為初始領子種子集合,然后通過構建鹽湖領域本體獲取本體概念集,并對概念集分類且給予權重,,利用概念相似度計算方法計算網(wǎng)頁內(nèi)所有概念與本體概念集中概念的相似度,根據(jù)綜合得分對網(wǎng)頁進行排序,將得分高的網(wǎng)頁存放到主題爬蟲中,為將來的網(wǎng)頁采集做準備。最后通過實驗證明,該算法不僅大大減少了不相關的結(jié)果,提高了采集網(wǎng)頁的主題相關度,而且也提高了檢索的準確率。
[Abstract]:Compared with the general search engine, the subject search engine focused on a specific field can bring higher precision information collection and better information retrieval service for users. As the core module of subject search engine, it is very important to improve the relevance of subject crawler. However, due to the large scale and highly dynamic growth of network resources, there will still be a large number of irrelevant web page information, which leads to a decline in the efficiency of collection. In order to solve this problem, this paper analyzes and summarizes the advantages and disadvantages of the current web page sorting algorithm, and combines the characteristics of the salt lake field by studying the correlation analysis technology in the subject crawler design, mainly the research of the web page sorting algorithm. Taking advantage of ontology in expressing semantics, a new web page sorting algorithm based on ontology concept similarity is proposed to improve the accuracy of topic correlation calculation. The method first selects the appropriate web page as the initial collar seed set, then obtains the ontology concept set by constructing the salt lake domain ontology, and classifies the concept set and gives the weight to the concept set. The concept similarity calculation method is used to calculate the similarity between all the concepts in the web page and the concepts in the ontology concept set. According to the comprehensive score, the web pages with high scores are sorted, and the high score pages are stored in the subject crawler to prepare for the future collection of web pages. Finally, the experimental results show that the algorithm not only reduces the irrelevant results, but also improves the retrieval accuracy.
【學位授予單位】:北京信息科技大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.1
本文編號:2251648
[Abstract]:Compared with the general search engine, the subject search engine focused on a specific field can bring higher precision information collection and better information retrieval service for users. As the core module of subject search engine, it is very important to improve the relevance of subject crawler. However, due to the large scale and highly dynamic growth of network resources, there will still be a large number of irrelevant web page information, which leads to a decline in the efficiency of collection. In order to solve this problem, this paper analyzes and summarizes the advantages and disadvantages of the current web page sorting algorithm, and combines the characteristics of the salt lake field by studying the correlation analysis technology in the subject crawler design, mainly the research of the web page sorting algorithm. Taking advantage of ontology in expressing semantics, a new web page sorting algorithm based on ontology concept similarity is proposed to improve the accuracy of topic correlation calculation. The method first selects the appropriate web page as the initial collar seed set, then obtains the ontology concept set by constructing the salt lake domain ontology, and classifies the concept set and gives the weight to the concept set. The concept similarity calculation method is used to calculate the similarity between all the concepts in the web page and the concepts in the ontology concept set. According to the comprehensive score, the web pages with high scores are sorted, and the high score pages are stored in the subject crawler to prepare for the future collection of web pages. Finally, the experimental results show that the algorithm not only reduces the irrelevant results, but also improves the retrieval accuracy.
【學位授予單位】:北京信息科技大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 張文秀;朱慶華;;領域本體的構建方法研究[J];圖書與情報;2011年01期
2 朱禮軍,陶蘭,劉慧;領域本體中的概念相似度計算[J];華南理工大學學報(自然科學版);2004年S1期
3 馬培華;;科學開發(fā)我國的鹽湖資源[J];化學進展;2009年11期
4 劉玉婷;馬志明;;網(wǎng)頁排序中的隨機模型及算法[J];中國科學:數(shù)學;2011年12期
5 孫德才;孫星明;張偉;劉玉玲;;基于匹配區(qū)域特征的相似字符串匹配過濾算法[J];計算機研究與發(fā)展;2010年04期
6 李榮;楊冬;劉磊;;基于本體的概念相似度計算方法研究[J];計算機研究與發(fā)展;2011年S3期
7 蔡國民;王雅琳;;搜索引擎的相關排序算法分析與優(yōu)化[J];吉首大學學報(自然科學版);2006年05期
8 李學勇,歐陽柳波,李國徽,鐘敏娟;網(wǎng)絡蜘蛛搜索策略比較研究[J];計算機工程與應用;2004年04期
9 陳杰;蔣祖華;;領域本體的概念相似度計算[J];計算機工程與應用;2006年33期
10 劉文劍;郭寧;金天國;;制造資源本體的相似度計算模型[J];計算機集成制造系統(tǒng);2010年11期
相關博士學位論文 前1條
1 蔡盈芳;基于本體的航空產(chǎn)品知識庫構建研究[D];北京交通大學;2011年
本文編號:2251648
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2251648.html
最近更新
教材專著