社區(qū)型問(wèn)答中問(wèn)句檢索關(guān)鍵技術(shù)研究
本文選題:社區(qū)型問(wèn)答 + 問(wèn)句檢索。 參考:《哈爾濱工業(yè)大學(xué)》2014年博士論文
【摘要】:隨著Web2.0時(shí)代的到來(lái),社區(qū)型問(wèn)答漸漸成為人們?cè)诰W(wǎng)絡(luò)上獲取知識(shí)和信息的必要途徑。相對(duì)于互聯(lián)網(wǎng)搜索引擎而言,社區(qū)型問(wèn)答能夠直接返回用戶提出的自然語(yǔ)言形式問(wèn)句的答案,而不是需要用戶自行篩選的檢索結(jié)果列表。相對(duì)于傳統(tǒng)的開(kāi)放域問(wèn)答系統(tǒng)而言,社區(qū)型問(wèn)答中的答案都是由真實(shí)用戶生成的,其質(zhì)量要高于傳統(tǒng)的開(kāi)放域問(wèn)答系統(tǒng)自動(dòng)地從候選文檔中抽取和生成的答案。同時(shí),由于社區(qū)型問(wèn)答中積累了大量的問(wèn)答對(duì)資源,因此,社區(qū)型問(wèn)答中的核心問(wèn)題和關(guān)鍵技術(shù)體現(xiàn)在檢索相似的已回答問(wèn)句并返回相應(yīng)的答案,我們稱(chēng)之為問(wèn)句檢索。 然而,社區(qū)型問(wèn)答中的問(wèn)句檢索面臨的三個(gè)主要挑戰(zhàn)為:由于用戶問(wèn)句表述的冗長(zhǎng)性導(dǎo)致的用戶意圖理解困難;由于用戶問(wèn)句表述多樣性造成的問(wèn)句之間的詞項(xiàng)不匹配問(wèn)題;由于未能考慮問(wèn)句的社區(qū)屬性而導(dǎo)致問(wèn)句檢索的排序僅僅依靠文本相關(guān)性。因此,在本文中,我們從以下四個(gè)方面來(lái)解決上述三個(gè)關(guān)鍵問(wèn)題,從而在整體上提高社區(qū)型問(wèn)答中問(wèn)句檢索的性能。 本文的第二章提出了基于依存句法關(guān)系圖的詞項(xiàng)重要度賦權(quán)方法,從而解決了社區(qū)型問(wèn)答中用戶問(wèn)句查詢的冗長(zhǎng)性問(wèn)題。具體地,對(duì)于已有的基于詞項(xiàng)賦權(quán)的問(wèn)句檢索模型而言,一個(gè)主要的問(wèn)題是在計(jì)算詞項(xiàng)權(quán)重時(shí)忽略了詞項(xiàng)之間的聯(lián)系。為了解決這個(gè)問(wèn)題,我們提出了一種新的利用詞項(xiàng)之間依存句法關(guān)系作為線索的詞項(xiàng)賦權(quán)機(jī)制。對(duì)于給定問(wèn)句,我們首先構(gòu)建依存句法圖來(lái)計(jì)算每個(gè)詞項(xiàng)對(duì)的關(guān)聯(lián)強(qiáng)度,進(jìn)而我們根據(jù)依存關(guān)聯(lián)度來(lái)更新常規(guī)的詞項(xiàng)權(quán)重。我們驗(yàn)證了更新后的詞項(xiàng)權(quán)重能夠有效地整合到已有的問(wèn)句檢索模型中,且實(shí)驗(yàn)結(jié)果相比于已有最新穎的問(wèn)句檢索模型有了顯著的提升。 本文的第三章提出了基于短語(yǔ)復(fù)述的問(wèn)句重構(gòu)模型,提高了問(wèn)句查詢擴(kuò)展的整體效果。具體地,由于語(yǔ)言表述的多樣性所導(dǎo)致的問(wèn)句檢索中的詞項(xiàng)不匹配現(xiàn)象,已經(jīng)成為社區(qū)型問(wèn)答中亟待解決的問(wèn)題。為了解決這個(gè)問(wèn)題,我們提出了一種基于短語(yǔ)級(jí)復(fù)述方法的問(wèn)句重構(gòu)機(jī)制,從而提高了問(wèn)句檢索的效果。給定一個(gè)問(wèn)句查詢,我們首先結(jié)合語(yǔ)料庫(kù)統(tǒng)計(jì)信息和問(wèn)句內(nèi)部線索的特征來(lái)識(shí)別問(wèn)句中的關(guān)鍵短語(yǔ);接下來(lái),我們通過(guò)融合多個(gè)在線翻譯引擎的翻譯結(jié)果來(lái)進(jìn)行關(guān)鍵短語(yǔ)的復(fù)述抽取;最后,我們提出一種基于解碼算法的問(wèn)句重構(gòu)方法,在融合關(guān)鍵短語(yǔ)的基礎(chǔ)上,生成重構(gòu)問(wèn)句。通過(guò)在社區(qū)型問(wèn)答數(shù)據(jù)集上的問(wèn)句檢索實(shí)驗(yàn)效果的提升,驗(yàn)證了我們所提出的問(wèn)句重構(gòu)算法的有效性,并且在問(wèn)句檢索上顯著優(yōu)于當(dāng)前的最新穎的問(wèn)句檢索模型。 本文的第四章提出了基于主題翻譯及聚類(lèi)模型,實(shí)現(xiàn)問(wèn)句查詢中詞項(xiàng)的擴(kuò)展。具體地,基于統(tǒng)計(jì)機(jī)器翻譯模型的問(wèn)句檢索模型,其相關(guān)性排序機(jī)制主要依賴于詞項(xiàng)間的翻譯概率,然而已有的機(jī)器翻譯模型沒(méi)有很好地控制詞項(xiàng)之間的翻譯噪聲,使得當(dāng)前的問(wèn)句檢索模型存在不完善之處。我們提出一種基于主題翻譯及聚類(lèi)模型的問(wèn)句檢索模型,從理論上說(shuō)明,該模型利用主題的推理及主題之間的相似性信息,達(dá)到控制翻譯模型噪聲的效果,從而提高問(wèn)句檢索的結(jié)果。實(shí)驗(yàn)結(jié)果表明,我們提出的模型在MAP、MRR以及p@1等指標(biāo)上顯著優(yōu)于當(dāng)前最新穎的問(wèn)句檢索模型。 本文的第五章提出了問(wèn)句流行度預(yù)測(cè)問(wèn)題,并以此提高用戶問(wèn)句檢索結(jié)果。具體地,隨著社區(qū)型問(wèn)答的發(fā)展,其上積累了大量高質(zhì)量的問(wèn)答對(duì)資源。這些資源不僅能夠讓用戶進(jìn)行問(wèn)句檢索的操作,更重要的是允許用戶之間進(jìn)行交互。在問(wèn)答社區(qū)上面,大多數(shù)研究都是基于問(wèn)句的文本內(nèi)容進(jìn)行問(wèn)句檢索的相關(guān)研究,而很少有研究用戶個(gè)人信息及交互行為對(duì)問(wèn)句檢索結(jié)果的影響。社區(qū)型問(wèn)答中,問(wèn)句的流行度能夠反映用戶的關(guān)注、興趣以及交互行為,因此,,我們通過(guò)預(yù)測(cè)問(wèn)句的流行度來(lái)改善用戶在問(wèn)句檢索時(shí)的體驗(yàn)。我們首先通過(guò)對(duì)影響問(wèn)句流行度的因素進(jìn)行分析和建模,以此來(lái)預(yù)測(cè)新問(wèn)句的流行度。并通過(guò)預(yù)測(cè)出的流行度對(duì)用戶使用問(wèn)句檢索的結(jié)果進(jìn)行重排序,實(shí)驗(yàn)結(jié)果表明,基于流行度重排序的問(wèn)句檢索結(jié)果優(yōu)于基于檢索相關(guān)度的問(wèn)句檢索結(jié)果。
[Abstract]:With the advent of the Web2.0 era, community interrogation has gradually become a necessary way for people to acquire knowledge and information on the Internet. Relative to Internet search engines, community type questions and answers can directly return to the answers to natural language questions raised by users, rather than the list of retrieval results that need to be screened by users themselves. In the open domain question answering system, the answers in the community type questions and answers are generated by the real users. Their quality is higher than the traditional open domain question answering system automatically extracts and generates the answers from the candidate documents. At the same time, a large number of questions and answers are accumulated in the community quiz. The key technology is to retrieve similar answer questions and return corresponding answers, which we call question search.
However, the three main challenges in the question answer search in the community type question answer are that the user's intention is difficult to understand because of the verbose description of the user's questions, and the problem of the mismatch between the words between the questions caused by the diversity of the user's question expression, and the sort of question retrieval due to the failure to consider the community attributes of the question. Therefore, in this article, we solve the above three key problems in the following four aspects, so as to improve the performance of the query in the community quiz.
The second chapter of this paper proposes a method of weighting the importance of word items based on dependency parsing graph, which solves the verbose problem of query in the question answer of the community type question and answer. In order to solve this problem, we propose a new word term empowerment mechanism that uses the interdependent syntactic relationship as a clue. For a given question, we first construct dependency parsing graph to calculate the correlation intensity of each word pair, and then we update the conventional word term weight according to the dependency correlation degree. The weight of the updated word item can be effectively integrated into the existing query model, and the experimental results have been improved significantly compared with the most novel query model.
In the third chapter of this paper, a question sentence reconstruction model based on phrase rehearsal is proposed to improve the overall effect of question query expansion. Specifically, the problem of word item mismatch in the query of question retrieval caused by the diversity of language expression has become an urgent problem in the community type question answer. In order to solve this problem, we put forward a new question. For a question sentence query, we first identify the key phrases in the question sentence combining the corpus statistics and the characteristics of the interal clues in a question. In the end, we propose a method of reconstructing the question sentence based on the decoding algorithm, which is based on the fusion of key phrases. Through the improvement of the experimental results on the question answer data set in the community type question and answer data set, we verify the validity of the question reconstruction algorithm and search the question sentence. It is significantly better than the current most novel query retrieval model.
The fourth chapter of this paper is based on topic translation and clustering model to realize the extension of word items in question query. Specifically, the query model based on statistical Machine Translation model is based on the probability of translation between words. However, the existing Machine Translation model does not control the translation between words well. Noise makes the current question retrieval model imperfections. We propose a query model based on topic translation and clustering model. In theory, the model uses the reasoning of the subject and the similarity information between subjects to control the effect of the noise of the translation model, thus improving the result of the question retrieval. The results show that our proposed model is significantly better than the current most innovative query retrieval model in terms of MAP, MRR and p@1.
The fifth chapter of this paper puts forward the question of the popularity of question and raises the result of user query. In particular, with the development of the community type question and answer, it has accumulated a large number of high quality questions and answers to the resources. These resources not only allow users to carry out the operation of query, but more importantly, allow users to interact. In answer to the community, most of the studies are based on interrogative text content for questions related to query, and few of the impact of user personal information and interactive behavior on query results. In community type questions and answers, the popularity of questions can reflect users' attention, interest and interaction behavior. Therefore, we predict the question through the question. The popularity of the sentence improves the user's experience in question retrieval. First, we analyze and model the factors that affect the popularity of the question sentences, in order to predict the popularity of the new questions, and reorder the user's query results through the predicted popularity. The experimental results show that the question based on the popularity reordering is the question. The result of sentence retrieval is better than that of query retrieval based on retrieval relevance.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王君;李舟軍;胡俠;胡必云;;一種新的復(fù)合核函數(shù)及在問(wèn)句檢索中的應(yīng)用[J];電子與信息學(xué)報(bào);2011年01期
2 姚蘭;林鴻飛;林原;馬云龍;;基于句法特征的冗長(zhǎng)查詢處理技術(shù)[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年02期
3 范宇峰;陳佳佳;趙占波;;問(wèn)答社區(qū)用戶知識(shí)分享意向的影響因素研究[J];財(cái)貿(mào)研究;2013年04期
4 余偉;王明文;萬(wàn)劍怡;左家莉;;結(jié)合語(yǔ)義的位置語(yǔ)言模型[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年02期
5 蒲強(qiáng);何大慶;楊國(guó)緯;;一種基于統(tǒng)計(jì)語(yǔ)義聚類(lèi)的查詢語(yǔ)言模型估計(jì)[J];計(jì)算機(jī)研究與發(fā)展;2011年02期
6 張中峰;李秋丹;;社區(qū)問(wèn)答系統(tǒng)研究綜述[J];計(jì)算機(jī)科學(xué);2010年11期
7 王品;黃廣君;;信息檢索中的句子相似度計(jì)算[J];計(jì)算機(jī)工程;2011年12期
8 鄭誠(chéng);李清;劉福君;;改進(jìn)的VSM算法及其在FAQ中的應(yīng)用[J];計(jì)算機(jī)工程;2012年17期
9 延霞;范士喜;;基于問(wèn)答社區(qū)的海量問(wèn)句檢索關(guān)鍵技術(shù)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2013年07期
10 韓如冰;葉得學(xué);;基于VSM的權(quán)重改進(jìn)文檔相似度算法研究[J];軟件;2012年10期
本文編號(hào):2040980
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2040980.html