天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于指紋檢索的文本相似性檢測技術(shù)研究與應(yīng)用

發(fā)布時間:2018-02-13 01:18

  本文關(guān)鍵詞: 文本相似性檢測 指紋檢索 b位minwise哈希 細(xì)粒度提取 聚類 出處:《中南大學(xué)》2013年碩士論文 論文類型:學(xué)位論文


【摘要】:網(wǎng)絡(luò)的開放性與文本的易復(fù)制性為學(xué)術(shù)資源的共享提供方便的同時也為抄襲、剽竊等學(xué)術(shù)不端行為提供了機(jī)會。從保護(hù)知識產(chǎn)權(quán)、端正學(xué)術(shù)風(fēng)氣等角度出發(fā),文本相似性檢測相關(guān)技術(shù)的研究已成為十分必要的方向。 論文以某基金項(xiàng)目申報書相似性檢測為應(yīng)用背景,為了在海量文檔中快速、準(zhǔn)確地檢測出相似的文檔,主要研究基于指紋檢索的相似性檢測系統(tǒng)中所涉及的關(guān)鍵技術(shù)如指紋快速檢索算法與技術(shù)、指紋的提取模型與方法等,具體的研究工作如下: (1)針對海量文本相似性檢索中指紋數(shù)少導(dǎo)致相似度估值不準(zhǔn)確、高維向量距離計算耗時等問題,提出基于指紋分組的并行檢索算法,將指紋分組建立索引,預(yù)檢索低位指紋,從而減少文檔的距離計算。同時,通過在指紋的檢索過程中使用CPU+GPU并行技術(shù),整體縮短指紋的檢索時間,并提高低相似度閾值的檢索準(zhǔn)確度。 (2)針對文檔內(nèi)容結(jié)構(gòu)性、各章節(jié)多樣性及用戶對文檔不同部分關(guān)注度差異較明顯等特點(diǎn),論文主要研究細(xì)粒度劃分方法、標(biāo)記詞的模糊匹配、中文分詞等技術(shù),實(shí)現(xiàn)章節(jié)、段落、句子等粗細(xì)粒度的精確提取。針對基金項(xiàng)目檢測準(zhǔn)確性的要求,使用了基于字符串匹配的最大正向匹配算法和最大反向匹配算法相結(jié)合的方法確保特征指紋提取的準(zhǔn)確率,所形成的指紋能確保后續(xù)的檢測質(zhì)量,并能直觀、清晰地呈現(xiàn)相似性證據(jù)。 (3)論文論述了文本相似性檢查系統(tǒng)的功能框架與主要流程,對文檔聚類、相似性估計及文檔相似性詳細(xì)比對與結(jié)果呈現(xiàn)等技術(shù)進(jìn)行了詳細(xì)分析,結(jié)合提出的指紋分組并行檢索算法與細(xì)粒度文本提取技術(shù)進(jìn)行了實(shí)現(xiàn)研究。圖20幅,表4個,參考文獻(xiàn)56篇。
[Abstract]:The openness of the network and the easy reproduction of the text provide an opportunity for academic misconduct such as plagiarism and plagiarism, as well as for the sharing of academic resources. The research of text similarity detection technology has become a very necessary direction. In order to quickly and accurately detect similar documents in a large number of documents, this paper takes the similarity detection of a fund project declaration as the application background. The key technologies involved in the similarity detection system based on fingerprint retrieval, such as fingerprint fast retrieval algorithms and techniques, fingerprint extraction models and methods, are mainly studied. The specific research work is as follows:. In order to solve the problems of imprecise similarity estimation and time-consuming computation of high dimensional vector distance in mass text similarity retrieval, a parallel retrieval algorithm based on fingerprint grouping is proposed to index fingerprint grouping and pre-retrieve low fingerprint. At the same time, by using CPU GPU parallel technology in fingerprint retrieval, the retrieval time of fingerprint is shortened, and the retrieval accuracy of low similarity threshold is improved. 2) aiming at the characteristics of document content structure, the diversity of each chapter and the difference of user's attention to different parts of the document, this paper mainly studies the fine granularity partition method, the fuzzy matching of tagging words, the Chinese word segmentation and so on, so as to realize the chapter. Accurate extraction of paragraphs, sentences, etc. For the accuracy of fund project detection, The maximum forward matching algorithm based on string matching and the maximum reverse matching algorithm are used to ensure the accuracy of feature fingerprint extraction. The resulting fingerprint can ensure the quality of subsequent detection and can be intuitionistic. Clear evidence of similarity. The paper discusses the functional framework and main flow of the text similarity checking system, and analyzes in detail the techniques of document clustering, similarity estimation, document similarity comparison and result presentation. Combined with the proposed fingerprint grouping parallel retrieval algorithm and fine-grained text extraction technology, the paper studies the implementation of the algorithm, which includes 20 figures, 4 tables and 56 references.
【學(xué)位授予單位】:中南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 金博,史彥軍,滕弘飛;基于語義理解的文本相似度算法[J];大連理工大學(xué)學(xué)報;2005年02期

2 韓京宇;徐立臻;董逸生;;一種大數(shù)據(jù)量的相似記錄檢測方法[J];計算機(jī)研究與發(fā)展;2005年12期

3 費(fèi)洪曉,康松林,朱小娟,謝文彪;基于詞頻統(tǒng)計的中文分詞的研究[J];計算機(jī)工程與應(yīng)用;2005年07期

4 麻會東;劉國華;李旭;梁鵬;劉春輝;張凌宇;;基于提取關(guān)鍵詞的中文文檔復(fù)制檢測研究[J];計算機(jī)工程與科學(xué);2007年10期

5 宋擒豹,楊向榮,沈鈞毅,齊勇;數(shù)字商品非法復(fù)制的檢測算法[J];計算機(jī)學(xué)報;2002年11期

6 李慶虎,陳玉健,孫家廣;一種中文分詞詞典新機(jī)制——雙字哈希機(jī)制[J];中文信息學(xué)報;2003年04期

7 徐琳宏;林鴻飛;楊志豪;;基于語義理解的文本傾向性識別機(jī)制[J];中文信息學(xué)報;2007年01期

8 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期

9 鮑軍鵬,沈鈞毅,劉曉東,宋擒豹;自然語言文檔復(fù)制檢測研究綜述[J];軟件學(xué)報;2003年10期

10 張祖平;徐昕;龍軍;袁鑫攀;;文本相似性度量中參數(shù)相關(guān)性與優(yōu)化配置研究[J];小型微型計算機(jī)系統(tǒng);2011年05期

,

本文編號:1506995

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/falvlunwen/zhishichanquanfa/1506995.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶6aa82***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com