文檔部分重復(fù)檢測研究
發(fā)布時間:2018-04-16 18:38
本文選題:文檔部分重復(fù)檢測 + Low-IDF-SIG特征提取算法; 參考:《復(fù)旦大學(xué)》2012年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)上數(shù)據(jù)的爆炸式的增長,互聯(lián)網(wǎng)上產(chǎn)生了大量的重復(fù)數(shù)據(jù)。這些重復(fù)數(shù)據(jù)給搜索引擎、觀點(diǎn)挖掘等許多Web應(yīng)用帶來了嚴(yán)峻的問題。目前絕大部分的重復(fù)檢測的算法均著重考慮文檔級別,這些方法不能有效地檢測出兩個文檔中只有一部分互為重復(fù)的情況。 本文提出了一種算法以解決文檔部分重復(fù)檢測問題。該方法分為句子級別的重復(fù)檢測以及序列匹配兩個子問題。首先,本文提出了一種快速有效的句子級別的特征提取方法—Low-IDF-SIG算法,并基于該算法實現(xiàn)了一個可以高效地找出句子級別重復(fù)的檢測系統(tǒng)。為了對本文提出的方法的精度及效率進(jìn)行評測,作者還在一個真實的語料庫上對提出的方法與其他方法進(jìn)行了比較。實驗結(jié)果證明本文提出的方法能有效地提高句子級別的重復(fù)檢測任務(wù)的效率和精度。 此外本文還提出了基于MapReduce范式的文檔部分重復(fù)檢測算法PDC-MR-Ⅱ算法。并基于該算法實現(xiàn)了一個基于MapReduce范式的高效的分布式文檔部分重復(fù)檢測系統(tǒng)。 本文中提出的算法和實現(xiàn)的系統(tǒng)可以廣泛用于解決論文抄襲檢測,論壇話題重復(fù)檢測、分頁新聞的重復(fù)檢測等課題。
[Abstract]:With the explosive growth of data on the Internet, a large number of duplicate data have been generated on the Internet.These repeated data bring severe problems to many Web applications such as search engine, viewpoint mining and so on.At present, most of the repeated detection algorithms focus on the document level, and these methods can not effectively detect the situation that only one part of the two documents is duplicated with each other.This paper presents an algorithm to solve the problem of document partial repetition detection.The method is divided into two sub-problems: sentence level repetition detection and sequence matching.Firstly, this paper proposes a fast and effective sentence-level feature extraction method-Low-IDF-SIG algorithm, and implements a detection system based on this algorithm, which can efficiently find sentence level repetition.In order to evaluate the accuracy and efficiency of the proposed method, the author also compares the proposed method with other methods on a real corpus.The experimental results show that the proposed method can effectively improve the efficiency and accuracy of sentence level repeat detection.In addition, PDC-MR- 鈪,
本文編號:1760120
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/1760120.html
最近更新
教材專著