基于標簽詞抽取的搜索結果聚類研究
發(fā)布時間:2018-09-03 12:29
【摘要】:當前人們正處于一個“信息爆炸”的時代,因此各種各樣的搜索引擎應運而生。但是由于網上的信息都是半結構化或者非結構化的,盡管采用了各種方法來提高檢索結果的精度,但是檢索結果中仍然包含了與用戶查詢不相關的頁面。雖然可以采取相關度排序等方法,仍不能很方便的為用戶展示結果。為了方便用戶查看自己感興趣的網頁,將搜索引擎返回的結果進行聚類,使用戶可以按照主題類別瀏覽網頁,減輕用戶瀏覽網頁的負擔。 本文在研究中文文本聚類現狀的基礎上,對中文文本聚類的關鍵技術進行了總結,其中,包括文本預處理、文本表示模型、特征抽取、特征降維、文本相似度計算以及現有的聚類算法,并對現有的聚類算法作了分析比較。然后,論文分析并研究了文本的相似度計算,包括文檔的相似度計算和相異度計算,以及簇和簇之間的鄰近度度量。并且分析了支持向量回歸理論和其技術上的實現。 本文提出了一種基于標簽詞抽取的文本聚類方法,該方法的實現目標是對搜索引擎返回的搜索結果進行聚類,然后論文實現了文本聚類系統。首先從搜索結果返回的網頁文檔進行預處理,包括去噪、分詞、去除停用詞。然后從中抽取3元模型詞作為標簽詞,提出了基于監(jiān)督模型的打分方法,并對標簽詞做一些相似詞替換、詞串整合等后期處理。最后根據標簽詞進行語料聚類,采用了層次聚類的方法,最終完成聚類。 論文設計了聚類系統,并對其進行實驗,實驗內容包括標簽詞的抽取,支持向量的回歸統計,標簽詞的聚類實驗。通過實驗證明,算法在對搜索結果進行聚類時有著較好的效果,能夠將類別相似的文檔聚到同一個類別中。
[Abstract]:At present, people are in an era of "information explosion", so various search engines emerge as the times require. However, because the information on the Internet is semi-structured or unstructured, although a variety of methods are used to improve the accuracy of the retrieval results, the retrieval results still contain pages that are not related to the user query. Although we can use correlation ranking and other methods, it is still not convenient for users to display the results. In order to facilitate users to view their interested web pages, the search engine returns the results of clustering, so that users can browse the web according to the subject category, reducing the burden of users browsing web pages. On the basis of studying the present situation of Chinese text clustering, this paper summarizes the key technologies of Chinese text clustering, including text preprocessing, text representation model, feature extraction, feature dimensionality reduction, etc. Text similarity calculation and existing clustering algorithms are analyzed and compared. Then, the paper analyzes and studies the text similarity calculation, including document similarity calculation and dissimilarity calculation, as well as the proximity measure between clusters. The support vector regression theory and its technical realization are analyzed. In this paper, a text clustering method based on tag word extraction is proposed. The goal of this method is to cluster the search results returned by search engines, and then the text clustering system is implemented in this paper. First, we preprocess the web pages returned from the search results, including de-noising, participle, and deactivation. Then the three-element model words are extracted as label words, and a scoring method based on supervised model is put forward, and some similar word substitution and string integration are made for label words. Finally, according to the label word clustering, hierarchical clustering method is used to complete the clustering. This paper designs a cluster system and carries on the experiment to it, the experiment content includes the tag word extraction, the support vector regression statistics, the label word clustering experiment. The experimental results show that the algorithm is effective in clustering search results and can cluster similar documents into the same category.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.1
本文編號:2219983
[Abstract]:At present, people are in an era of "information explosion", so various search engines emerge as the times require. However, because the information on the Internet is semi-structured or unstructured, although a variety of methods are used to improve the accuracy of the retrieval results, the retrieval results still contain pages that are not related to the user query. Although we can use correlation ranking and other methods, it is still not convenient for users to display the results. In order to facilitate users to view their interested web pages, the search engine returns the results of clustering, so that users can browse the web according to the subject category, reducing the burden of users browsing web pages. On the basis of studying the present situation of Chinese text clustering, this paper summarizes the key technologies of Chinese text clustering, including text preprocessing, text representation model, feature extraction, feature dimensionality reduction, etc. Text similarity calculation and existing clustering algorithms are analyzed and compared. Then, the paper analyzes and studies the text similarity calculation, including document similarity calculation and dissimilarity calculation, as well as the proximity measure between clusters. The support vector regression theory and its technical realization are analyzed. In this paper, a text clustering method based on tag word extraction is proposed. The goal of this method is to cluster the search results returned by search engines, and then the text clustering system is implemented in this paper. First, we preprocess the web pages returned from the search results, including de-noising, participle, and deactivation. Then the three-element model words are extracted as label words, and a scoring method based on supervised model is put forward, and some similar word substitution and string integration are made for label words. Finally, according to the label word clustering, hierarchical clustering method is used to complete the clustering. This paper designs a cluster system and carries on the experiment to it, the experiment content includes the tag word extraction, the support vector regression statistics, the label word clustering experiment. The experimental results show that the algorithm is effective in clustering search results and can cluster similar documents into the same category.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 吳啟明;易云飛;;文本聚類綜述[J];河池學院學報;2008年02期
2 趙亞琴;周獻中;;一種基于小生境遺傳算法的中文文本聚類新方法[J];計算機工程;2006年06期
3 姚清耘;劉功申;李翔;;基于向量空間模型的文本聚類算法[J];計算機工程;2008年18期
4 卜東波,白碩,李國杰;聚類/分類中的粒度原理[J];計算機學報;2002年08期
5 彭京;楊冬青;唐世渭;付艷;蔣漢奎;;一種基于語義內積空間模型的文本聚類算法[J];計算機學報;2007年08期
6 張紅云,劉向東,段曉東,苗奪謙,馬垣;數據挖掘中聚類算法比較研究[J];計算機應用與軟件;2003年02期
7 駱雄武;萬小軍;楊建武;吳於茜;;基于后綴樹的Web檢索結果聚類標簽生成方法[J];中文信息學報;2009年02期
8 孫爽;章勇;;一種基于語義相似度的文本聚類算法[J];南京航空航天大學學報;2006年06期
9 宋韶旭;李春平;;基于非對稱相似度的文本聚類方法[J];清華大學學報(自然科學版);2006年07期
10 魯松,白碩,黃雄;基于向量空間模型中義項詞語的無導詞義消歧[J];軟件學報;2002年06期
,本文編號:2219983
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2219983.html
教材專著