藏文搜索和搜索結(jié)果聚類(lèi)研究及系統(tǒng)實(shí)現(xiàn)

發(fā)布時(shí)間：2018-07-06 16:45

本文選題：藏文分詞 + 藏文聚類(lèi)�。� 參考：《西南交通大學(xué)》2013年碩士論文

【摘要】：藏文歷史悠久,是藏族文化和藏族文明傳承的載體,使用人數(shù)有600多萬(wàn)。藏文文獻(xiàn)數(shù)目龐大,內(nèi)容廣泛。隨著windows系統(tǒng)對(duì)藏文的支持,藏族同胞參入網(wǎng)絡(luò)活動(dòng)的熱情日益高漲。然而當(dāng)前尚無(wú)藏文搜索引擎,國(guó)內(nèi)外各大著名搜索引擎也不提供藏文搜索,因而對(duì)藏文搜索系統(tǒng)的研究意義重大。本文圍繞如何實(shí)現(xiàn)藏文搜索系統(tǒng),研究了藏文分詞,藏文文本收集,文本處理,編碼轉(zhuǎn)換,索引搜索及結(jié)果聚類(lèi)等相關(guān)問(wèn)題,旨在實(shí)現(xiàn)一個(gè)功能完善的藏文信息檢索系統(tǒng)。本文的主要工作如下：第一,提出了一種AllCut藏文分詞算法。藏文詞間沒(méi)有分隔符,因而需要分詞。當(dāng)前分詞算法主要有基于統(tǒng)計(jì)概率、詞性標(biāo)注及語(yǔ)法規(guī)則等。然而這些算法或需要大量的語(yǔ)料訓(xùn)練學(xué)習(xí),或?qū)崿F(xiàn)起來(lái)很復(fù)雜,在當(dāng)前情況下難以實(shí)現(xiàn)或?qū)崿F(xiàn)效果并不好。因而本方案使用詞典匹配,結(jié)合藏文的語(yǔ)法特性及格助詞和接續(xù)性特征,同時(shí)使用細(xì)粒度切分,取得了很好的分詞效果,為接下來(lái)工作提供了保障。第二,藏文聚類(lèi)研究。本文首先研究了中藏聚類(lèi)中文文本表示,藏文停詞等相關(guān)問(wèn)題：使用向量模型表示文檔,使得文本可以很好的被計(jì)算機(jī)存儲(chǔ)和處理；通過(guò)統(tǒng)計(jì)大量文檔得到藏文停詞,排除了這些詞對(duì)聚類(lèi)效果的干擾。最后系統(tǒng)研究了及劃分法和層次法聚類(lèi)算法對(duì)于藏文的聚類(lèi)效果。第三,藏文信息檢索研究及系統(tǒng)實(shí)現(xiàn)。藏文信息檢索主要研究了藏文網(wǎng)頁(yè)收集,藏文編碼轉(zhuǎn)換,藏文網(wǎng)頁(yè)預(yù)處理,及藏文文本存儲(chǔ)等,解決了計(jì)算機(jī)對(duì)藏文的處理和檢索；然后以Lucene為基礎(chǔ),實(shí)現(xiàn)了該搜索系統(tǒng),系統(tǒng)能夠自動(dòng)更發(fā)現(xiàn)更新藏文資源,提供藏文搜索功能,完成了藏文搜索引擎的功能。并結(jié)合藏文聚類(lèi)對(duì)搜索結(jié)果聚類(lèi)顯示,提高了搜索結(jié)果的針對(duì)性和準(zhǔn)確性。
[Abstract]:The Tibetan language has a long history and is the carrier of Tibetan culture and Tibetan civilization, with more than 6 million users. Tibetan literature is large in number and extensive in content. With the support of the windows system for Tibetan, Tibetan people's enthusiasm to participate in network activities is growing. However, there is no Tibetan search engine at present, and famous search engines at home and abroad do not provide Tibetan search, so the research on Tibetan search system is of great significance. This paper focuses on how to realize the Tibetan language search system, studies the Tibetan participle, the Tibetan text collection, the text processing, the coding conversion, the index search and the result clustering and so on. The purpose of this paper is to realize a perfect Tibetan information retrieval system. The main work of this paper is as follows: first, an all cut Tibetan word segmentation algorithm is proposed. There are no delimiters between Tibetan words, so participle is needed. Current word segmentation algorithms are mainly based on statistical probability, part of speech tagging and grammar rules. However, these algorithms require a lot of corpus training and learning, or they are very complex to implement, which are difficult to implement or not effective in the current situation. Therefore, this scheme uses dictionary matching, combines the grammatical characteristics of Tibetan and the features of case auxiliary and continuity, at the same time uses fine granularity segmentation, and achieves a good segmentation effect, which provides a guarantee for the next work. Second, the study of Tibetan clustering. In this paper, we first study the Chinese text representation, Tibetan word stopping and other related problems: using vector model to represent documents, so that the text can be well stored and processed by computer, through statistics a large number of documents to obtain Tibetan stop words, The interference of these words to the clustering effect is excluded. Finally, the clustering effect of partitioning and hierarchical clustering algorithm for Tibetan is studied systematically. Third, Tibetan information retrieval research and system implementation. Tibetan information retrieval mainly studies Tibetan web page collection, Tibetan coding conversion, Tibetan web page preprocessing, Tibetan text storage and so on, which solves the problem of computer processing and retrieval of Tibetan language, and then realizes the search system based on Lucene. The system can automatically discover and update Tibetan resources, provide Tibetan search function, and complete the function of Tibetan search engine. Combined with Tibetan clustering to display search results, improve the pertinence and accuracy of search results.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類(lèi)號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 扎西次仁;《中華大藏經(jīng)·丹珠爾》藏文對(duì)勘本字頻統(tǒng)計(jì)分析[J];中國(guó)藏學(xué);1997年02期

2 陳玉忠,俞士汶;藏文信息處理技術(shù)的研究現(xiàn)狀與展望[J];中國(guó)藏學(xué);2003年04期

3 劉群,張華平,俞鴻魁,程學(xué)旗;基于層疊隱馬模型的漢語(yǔ)詞法分析[J];計(jì)算機(jī)研究與發(fā)展;2004年08期

4 于江蘇,葛小沖;計(jì)算機(jī)藏文信息處理的研究與設(shè)計(jì)[J];中文信息學(xué)報(bào);1988年01期

5 陳玉忠,李保利,俞士汶;藏文自動(dòng)分詞系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];中文信息學(xué)報(bào);2003年03期

6 春燕;曲珍;;藏文文本編碼識(shí)別方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2013年01期

7 祁坤鈺;;信息處理用藏文自動(dòng)分詞研究[J];西北民族大學(xué)學(xué)報(bào)(哲學(xué)社會(huì)科學(xué)版);2006年04期

8 高定國(guó);關(guān)白;;回顧藏文信息處理技術(shù)的發(fā)展[J];西藏大學(xué)學(xué)報(bào)(社會(huì)科學(xué)版);2009年03期

9 陳玉忠,李保利,俞士汶,蘭措吉;基于格助詞和接續(xù)特征的藏文自動(dòng)分詞方案[J];語(yǔ)言文字應(yīng)用;2003年01期

相關(guān)會(huì)議論文前3條

1 陳玉忠;;信息處理用現(xiàn)代藏語(yǔ)詞語(yǔ)的分類(lèi)方案[A];第十屆全國(guó)少數(shù)民族語(yǔ)言文字信息處理學(xué)術(shù)研討會(huì)論文集[C];2005年

2 劉匯丹;芮建武;吳健;;藏文網(wǎng)頁(yè)的編碼識(shí)別與轉(zhuǎn)換[A];中文信息處理前沿進(jìn)展——中國(guó)中文信息學(xué)會(huì)二十五周年學(xué)術(shù)會(huì)議論文集[C];2006年

3 戴玉剛;;藏文網(wǎng)頁(yè)采集技術(shù)研究[A];民族語(yǔ)言文字信息技術(shù)研究——第十一屆全國(guó)民族語(yǔ)言文字信息學(xué)術(shù)研討會(huì)論文集[C];2007年

，

本文編號(hào)：2103468

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2103468.html

上一篇：淺談網(wǎng)絡(luò)搜索引擎的應(yīng)用
下一篇：利用條件概率與乘法公式解釋搜索引擎拼寫(xiě)糾錯(cuò)功能的原理

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

藏文搜索和搜索結(jié)果聚類(lèi)研究及系統(tǒng)實(shí)現(xiàn)