信息檢索中相關(guān)反饋算法的研究
發(fā)布時(shí)間:2018-10-14 12:25
【摘要】:信息檢索是關(guān)于信息的結(jié)構(gòu)、分析、組織、存儲(chǔ)、搜索和檢索的領(lǐng)域。概括的說,信息檢索就是從非結(jié)構(gòu)化的信息集合中找出與用戶需求相關(guān)的信息。信息檢索的一個(gè)核心問題是注重用戶和他們的信息需求,因?yàn)閷λ阉鞯脑u價(jià)是以用戶為中心的。這種理念引發(fā)了大量關(guān)于人們怎樣與搜索引擎進(jìn)行交互的研究,特別是開發(fā)幫助用戶表達(dá)他們的信息需求的技術(shù)。 在用戶參與的檢索過程中,用戶提交一個(gè)簡短的查詢,系統(tǒng)返回初次查詢結(jié)果,,用戶對部分結(jié)果進(jìn)行標(biāo)注,標(biāo)注為相關(guān)或不相關(guān),系統(tǒng)基于用戶的反饋計(jì)算出一個(gè)更好的查詢來表示信息需求,并重新返回一批更有可能滿足用戶需求的新的檢索結(jié)果,這個(gè)過程叫做相關(guān)反饋。在信息檢索過程中使用相關(guān)反饋技術(shù)能夠優(yōu)化查詢結(jié)果,提高查詢效率。 本文從介紹相關(guān)反饋技術(shù)的現(xiàn)狀出發(fā),給出了相關(guān)反饋技術(shù)的有關(guān)算法,包括向量空間模型,概率模型和布爾模型中的相關(guān)反饋技術(shù)。其中,以基于向量空間模型的Rocchio相關(guān)反饋算法為主,詳細(xì)介紹了該算法的思想和執(zhí)行過程及其在某些特定情況下查詢效果不好的現(xiàn)象,如某個(gè)查詢的答案集合本身就需要不同類的文檔來組成和通常以多個(gè)具體概念的或關(guān)系來出現(xiàn)的詞這兩個(gè)方面,對Rocchio相關(guān)反饋算法進(jìn)行改進(jìn),使該算法在這兩種特殊情況下也能得到好的返回結(jié)果。 本文就此做了以下貢獻(xiàn): (1)在查詢語句包含多條件內(nèi)容時(shí),根據(jù)Rocchio相關(guān)反饋算法的思想,提出了將包含有兩個(gè)條件信息的文檔集看成新的交叉類,在交叉類范圍內(nèi),從離初始查詢最近的質(zhì)心開始,向著另一個(gè)質(zhì)心不斷移動(dòng),在此過程中找到理想結(jié)果。改進(jìn)后的Rocchio相關(guān)反饋算法能夠有效解決多條件查詢時(shí)返回結(jié)果不理想的狀況。 (2)在多義詞查詢時(shí),系統(tǒng)返回的結(jié)果往往混亂無序,本文設(shè)計(jì)了一種對結(jié)果屬性進(jìn)行聚類的算法:層次收縮算法。該算法首先獲取系統(tǒng)返回結(jié)果的關(guān)鍵詞,用布爾矩陣表達(dá),然后以文檔間關(guān)鍵詞個(gè)數(shù)作為度量方式,計(jì)算文檔間相似度,按照文檔間相似度,以合取方式將文檔層次合并,聚類結(jié)束后提取返回的標(biāo)簽。在不考慮召回率的情況下,該算法的最終結(jié)果收斂于對簇中文檔具有高度表達(dá)性的關(guān)鍵詞,具有較高的正確率。
[Abstract]:Information retrieval is about the structure, analysis, organization, storage, search and retrieval of information. Generally speaking, information retrieval is to find out the information related to the user's needs from the unstructured information set. One of the core problems of information retrieval is to focus on users and their information needs, because the evaluation of search is user-centered. This concept has led to a great deal of research on how people interact with search engines, especially the development of technologies to help users express their information needs. In the retrieval process, the user submits a short query, the system returns the first query results, and the user marks some of the results as relevant or irrelevant. The system computes a better query to represent the information requirement based on the user's feedback and returns a batch of new retrieval results which are more likely to satisfy the user's needs. This process is called correlation feedback. In the process of information retrieval, the related feedback technique can optimize the query results and improve the query efficiency. In this paper, based on the introduction of the current situation of the correlation feedback technology, the relevant algorithms of the correlation feedback technology are presented, including the vector space model, the probability model and the Boolean model. Among them, the Rocchio correlation feedback algorithm based on vector space model is mainly used. The idea and execution process of the algorithm and the phenomenon that the query effect is not good in some special cases are introduced in detail. For example, the answer set of a query itself requires documents of different classes to compose and words that usually appear in multiple concrete concepts or relationships to improve the Rocchio correlation feedback algorithm. So that the algorithm can also get good results in these two special cases. In this paper, the following contributions are made: (1) when a query statement contains multiple conditional content, according to the idea of Rocchio correlation feedback algorithm, a document set containing two conditional information is considered as a new crossover class, which is within the scope of a cross-class. Starting with the center of mass nearest to the initial query, moving to another center of mass, the desired result is found in the process. The improved Rocchio correlation feedback algorithm can effectively solve the unsatisfactory result of multi-conditional query. (2) in polysemy query, the system returns chaotic and disordered results. In this paper, a hierarchical shrinkage algorithm is designed to cluster the result attributes. The algorithm firstly acquires the key words returned by the system, expresses them with Boolean matrix, then calculates the similarity between documents by taking the number of keywords among documents as a measure, and merges the document hierarchy according to the similarity between documents. The returned label is extracted after clustering. Without considering the recall rate, the final result of the algorithm converges to the key words that are highly expressive to the documents in the cluster, and has a high accuracy.
【學(xué)位授予單位】:河南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
本文編號:2270449
[Abstract]:Information retrieval is about the structure, analysis, organization, storage, search and retrieval of information. Generally speaking, information retrieval is to find out the information related to the user's needs from the unstructured information set. One of the core problems of information retrieval is to focus on users and their information needs, because the evaluation of search is user-centered. This concept has led to a great deal of research on how people interact with search engines, especially the development of technologies to help users express their information needs. In the retrieval process, the user submits a short query, the system returns the first query results, and the user marks some of the results as relevant or irrelevant. The system computes a better query to represent the information requirement based on the user's feedback and returns a batch of new retrieval results which are more likely to satisfy the user's needs. This process is called correlation feedback. In the process of information retrieval, the related feedback technique can optimize the query results and improve the query efficiency. In this paper, based on the introduction of the current situation of the correlation feedback technology, the relevant algorithms of the correlation feedback technology are presented, including the vector space model, the probability model and the Boolean model. Among them, the Rocchio correlation feedback algorithm based on vector space model is mainly used. The idea and execution process of the algorithm and the phenomenon that the query effect is not good in some special cases are introduced in detail. For example, the answer set of a query itself requires documents of different classes to compose and words that usually appear in multiple concrete concepts or relationships to improve the Rocchio correlation feedback algorithm. So that the algorithm can also get good results in these two special cases. In this paper, the following contributions are made: (1) when a query statement contains multiple conditional content, according to the idea of Rocchio correlation feedback algorithm, a document set containing two conditional information is considered as a new crossover class, which is within the scope of a cross-class. Starting with the center of mass nearest to the initial query, moving to another center of mass, the desired result is found in the process. The improved Rocchio correlation feedback algorithm can effectively solve the unsatisfactory result of multi-conditional query. (2) in polysemy query, the system returns chaotic and disordered results. In this paper, a hierarchical shrinkage algorithm is designed to cluster the result attributes. The algorithm firstly acquires the key words returned by the system, expresses them with Boolean matrix, then calculates the similarity between documents by taking the number of keywords among documents as a measure, and merges the document hierarchy according to the similarity between documents. The returned label is extracted after clustering. Without considering the recall rate, the final result of the algorithm converges to the key words that are highly expressive to the documents in the cluster, and has a high accuracy.
【學(xué)位授予單位】:河南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 敬斌;全景視覺足球機(jī)器人視覺處理系統(tǒng)設(shè)計(jì)[D];西安電子科技大學(xué);2007年
本文編號:2270449
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2270449.html
最近更新
教材專著