天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

文檔復(fù)制檢測(cè)方法研究與系統(tǒng)實(shí)現(xiàn)

發(fā)布時(shí)間:2018-02-14 18:04

  本文關(guān)鍵詞: 本復(fù)制檢測(cè) 在線復(fù)制檢測(cè) 關(guān)鍵字提取 相似度計(jì)算 倒排索引 出處:《哈爾濱工業(yè)大學(xué)》2012年碩士論文 論文類型:學(xué)位論文


【摘要】:目前,隨著互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡(luò)信息資源日益豐富,人們的信息交流的方式變得日益便利。然而由于文本,圖片,視頻等網(wǎng)絡(luò)電子資源便利的復(fù)制基礎(chǔ),從而導(dǎo)致網(wǎng)絡(luò)資源過多的冗余,降低了網(wǎng)絡(luò)搜索引擎的檢索效率,同時(shí)加大了信息抽取的難度。近年來(lái)一些高校里也頻繁出現(xiàn)了作業(yè)抄襲,論文抄襲等現(xiàn)象。為了提高網(wǎng)絡(luò)信息檢索效率、保護(hù)知識(shí)產(chǎn)權(quán),以及端正學(xué)術(shù)風(fēng)氣,文檔復(fù)制檢測(cè)技術(shù)成為了自然語(yǔ)言處理領(lǐng)域的研究熱點(diǎn),其研究意義十分重大。 本文對(duì)文檔復(fù)制檢測(cè)方面做了詳細(xì)研究,在前人研究的基礎(chǔ)上,對(duì)基于句子相似度計(jì)算的文檔復(fù)制檢測(cè)方法作了改進(jìn),很大程度上提高了文檔復(fù)制檢測(cè)效率與檢測(cè)準(zhǔn)確率。 首先,,本文針對(duì)文檔復(fù)制檢測(cè)的背景、意義、國(guó)內(nèi)外發(fā)展現(xiàn)狀及相關(guān)技術(shù)作了詳細(xì)介紹,并分析了目前常用文本復(fù)制檢測(cè)算法的優(yōu)缺點(diǎn)。 其次,基于傳統(tǒng)的BSP復(fù)制檢測(cè)算法,提出了基于有序最長(zhǎng)公共關(guān)鍵詞序列的句子相似度算法及基于關(guān)鍵詞距離的句子局部復(fù)制檢測(cè)算法,同時(shí)設(shè)計(jì)了詞語(yǔ)-句子,句子-文檔的倒排索引結(jié)構(gòu),有效地提高了復(fù)制檢測(cè)準(zhǔn)確率與檢測(cè)效率。 再次,基于本文提出的文本復(fù)制檢測(cè)方法,設(shè)計(jì)實(shí)現(xiàn)了一款文本復(fù)制檢測(cè)系統(tǒng)。根據(jù)實(shí)際應(yīng)用需求,系統(tǒng)主要功能包括文檔注冊(cè)、文檔檢索、同義詞維護(hù)、本地復(fù)制檢測(cè)、分布式復(fù)制檢測(cè),在線復(fù)制檢測(cè)、網(wǎng)絡(luò)設(shè)置、系統(tǒng)設(shè)置、文檔庫(kù)管理等。 最后,實(shí)驗(yàn)表明:結(jié)果本文所研究的文檔復(fù)制檢測(cè)方法的實(shí)用性和有效性。
[Abstract]:At present, with the rapid development of the Internet, the network information resources are increasingly rich, and the way people exchange information becomes more and more convenient. However, due to the convenient reproduction basis of electronic resources such as text, pictures, video and so on, This leads to excessive redundancy of network resources, reduces the search efficiency of network search engines, and increases the difficulty of information extraction. In recent years, some colleges and universities have also frequently appeared homework plagiarism. In order to improve the efficiency of network information retrieval, protect intellectual property rights, and correct the academic atmosphere, document replication and detection technology has become the research hotspot in the field of natural language processing, and its research significance is very important. This paper makes a detailed study on document replication detection. On the basis of previous studies, the paper improves the document replication detection method based on sentence similarity calculation, which greatly improves the efficiency and accuracy of document replication detection. First of all, this paper introduces the background, significance, development status and related technologies of document replication detection in detail, and analyzes the advantages and disadvantages of common text copy detection algorithms. Secondly, based on the traditional BSP replication detection algorithm, a sentence similarity algorithm based on ordered longest common keyword sequence and a sentence local copy detection algorithm based on keyword distance are proposed. At the same time, word-sentence is designed. Sentence-document inverted index structure effectively improves the accuracy and efficiency of copy detection. Thirdly, based on the text copy detection method proposed in this paper, a text copy detection system is designed and implemented. According to the actual application requirements, the main functions of the system include document registration, document retrieval, synonym maintenance, local copy detection. Distributed replication detection, online replication detection, network settings, system settings, document library management, etc. Finally, the experimental results show the practicability and effectiveness of the document copy detection method studied in this paper.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 樊勇;鄭家恒;;基于主題的網(wǎng)頁(yè)去重[J];電腦開發(fā)與應(yīng)用;2008年04期

2 閻亞杰;;網(wǎng)頁(yè)去重方法研究[J];電腦開發(fā)與應(yīng)用;2008年08期

3 彭宜佳;;畢業(yè)論文抄襲的識(shí)別與防范[J];湖北廣播電視大學(xué)學(xué)報(bào);2006年06期

4 宋擒豹,沈鈞毅;數(shù)字商品非法復(fù)制和擴(kuò)散的監(jiān)測(cè)機(jī)制[J];計(jì)算機(jī)研究與發(fā)展;2001年01期

5 張義忠,趙明生,朱精南;基于內(nèi)容的網(wǎng)頁(yè)特征提取[J];計(jì)算機(jī)工程與應(yīng)用;2001年10期

6 金博,史彥軍,滕弘飛;中文文檔復(fù)制檢測(cè)系統(tǒng)研究[J];計(jì)算機(jī)工程;2005年19期

7 李欣,舒風(fēng)笛;最長(zhǎng)公共子序列問題的改進(jìn)快速算法[J];計(jì)算機(jī)應(yīng)用研究;2000年02期

8 姚新波;馬治坤;;基于特征串的網(wǎng)頁(yè)去重算法[J];科技信息;2008年28期

9 林春實(shí),方燕,全吉成;漢語(yǔ)文獻(xiàn)自動(dòng)分詞與標(biāo)引技術(shù)發(fā)展淺析[J];情報(bào)學(xué)報(bào);1997年S1期

10 付年鈞;彭昌水;王慰;;中文分詞技術(shù)及其實(shí)現(xiàn)[J];軟件導(dǎo)刊;2011年01期



本文編號(hào):1511286

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1511286.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶2b5af***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com