基于Hadoop分布式系統(tǒng)的重復(fù)數(shù)據(jù)檢測技術(shù)研究與應(yīng)用

發(fā)布時間：2018-05-29 10:26

本文選題：云計算 + Hadoop��；參考：《湖南大學(xué)》2013年碩士論文

【摘要】：隨著信息技術(shù)的快速發(fā)展，云計算和重復(fù)數(shù)據(jù)刪除技術(shù)也得到了迅速的發(fā)展。云計算憑借其強(qiáng)大的分布式計算能力以及低成本高可靠性的優(yōu)勢，在海量數(shù)據(jù)處理方面占據(jù)主導(dǎo)地位，但是Hadoop系統(tǒng)中數(shù)據(jù)進(jìn)行歸檔時，存在大量重復(fù)數(shù)據(jù)，影響系統(tǒng)的處理效率。重復(fù)數(shù)據(jù)刪除技術(shù)是一種熱門的存儲技術(shù)，可對存儲容量進(jìn)行優(yōu)化，很大程度上減少對物理存儲空間的浪費(fèi)，從而滿足日益增長的數(shù)據(jù)存儲需求。因此，云計算和重復(fù)數(shù)據(jù)刪除技術(shù)的結(jié)合將會是一個雙贏的解決方案。針對以上問題，本文分析了當(dāng)前云計算平臺Hadoop和重復(fù)數(shù)據(jù)刪除技術(shù)的特點后，利用Hadoop分布式平臺來管理海量數(shù)據(jù)。同時，針對Hadoop系統(tǒng)中存在的大量重復(fù)數(shù)據(jù)，本文提出來一種基于重復(fù)數(shù)據(jù)刪除技術(shù)的去重檢測技術(shù)，利用指紋算法BLAKE生成數(shù)據(jù)塊指紋，采用基于數(shù)據(jù)塊級的刪除粒度，使用In-line方式有效刪除重復(fù)數(shù)據(jù)。哈希SHA-3算法憑借其在數(shù)據(jù)運(yùn)算上的優(yōu)勢，得到業(yè)界的認(rèn)可，，本文首次采用SHA-3候選算法BLAKE作為重復(fù)數(shù)據(jù)檢測技術(shù)中的指紋函數(shù)，取代了原始的重復(fù)數(shù)據(jù)指紋算法MD5，進(jìn)行重復(fù)數(shù)據(jù)指紋的生成和指紋匹配，并單獨(dú)對該算法進(jìn)行詳細(xì)的軟件設(shè)計和實現(xiàn)，實驗性能比傳統(tǒng)指紋算法MD5有了很大的提高。最后將本文的研究應(yīng)用到車聯(lián)網(wǎng)中，利用Hadoop存儲管理大規(guī)模車聯(lián)網(wǎng)數(shù)據(jù)。根據(jù)HBase數(shù)據(jù)模型的特點，設(shè)計了交通數(shù)據(jù)的分布式數(shù)據(jù)存儲模型，其中詳細(xì)給出了主表和反向表的設(shè)計，一定程度上滿足用戶的條件查詢。并利用重復(fù)數(shù)據(jù)刪除技術(shù)對車聯(lián)網(wǎng)歸檔時存在的重復(fù)數(shù)據(jù)進(jìn)行去重檢測，通過對三組汽車終端數(shù)據(jù)集進(jìn)行實驗，給出詳細(xì)性能分析，大大降低了硬盤存儲消耗，提高了存儲效率，消除了數(shù)據(jù)存儲冗余。
[Abstract]:With the rapid development of information technology, cloud computing and duplicate data deletion technology have also been rapidly developed. Cloud computing plays a dominant role in mass data processing because of its powerful distributed computing power and the advantages of low cost and high reliability. However, when archiving data in Hadoop system, there are a lot of duplicate data. Affect the processing efficiency of the system. Repetitive data deletion is a popular storage technology, which can optimize storage capacity, reduce the waste of physical storage space to a great extent, and meet the increasing demand for data storage. Therefore, the combination of cloud computing and duplicate data deletion technology will be a win-win solution. In view of the above problems, this paper analyzes the characteristics of the current cloud computing platform Hadoop and repeated data deletion technology, and uses the Hadoop distributed platform to manage the massive data. At the same time, aiming at the existence of a large number of repeated data in Hadoop system, this paper proposes a kind of de-re-detection technology based on repeated data deletion technology. The fingerprint algorithm BLAKE is used to generate data block fingerprint, and the deletion granularity based on data block level is adopted. Delete duplicate data effectively using In-line. Hash SHA-3 algorithm is recognized by the industry because of its advantage in data operation. In this paper, SHA-3 candidate algorithm BLAKE is first used as fingerprint function in repetitive data detection technology. Instead of the original repeated data fingerprint algorithm (MD5), the algorithm is used to generate and match the repeated data fingerprint, and the algorithm is designed and implemented in detail. The experimental performance is greatly improved than that of the traditional fingerprint algorithm MD5. Finally, the research is applied to vehicle networking, and Hadoop is used to store and manage large scale vehicle networking data. According to the characteristics of HBase data model, the distributed data storage model of traffic data is designed, in which the design of main table and reverse table are given in detail. And the repeated data delete technology is used to detect the duplicate data existing in the vehicle network archiving. Through the experiment of three groups of vehicle terminal data sets, the detailed performance analysis is given, which greatly reduces the storage consumption of hard disk. The storage efficiency is improved and the redundancy of data storage is eliminated.
【學(xué)位授予單位】：湖南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP333

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 楊義先;姚文斌;陳釗;;信息系統(tǒng)災(zāi)備技術(shù)綜論[J];北京郵電大學(xué)學(xué)報;2010年02期

2 孫健;賈曉菁;;Google云計算平臺的技術(shù)架構(gòu)及對其成本的影響研究[J];電信科學(xué);2010年01期

3 劉琦琳;;IBM云計算:從理想到實踐[J];互聯(lián)網(wǎng)周刊;2009年11期

4 孫牧;;云端的小飛象—Hadoop[J];程序員;2008年10期

5 張硯波;劉正偉;文中領(lǐng);王永海;;一種高效存儲解決方案的分析與研究[J];計算機(jī)研究與發(fā)展;2012年S1期

6 陸游游;敖莉;舒繼武;;一種基于重復(fù)數(shù)據(jù)刪除的備份系統(tǒng)[J];計算機(jī)研究與發(fā)展;2012年S1期

7 張曼;李弼程;林琛;;基于SHA-1的郵件去重算法[J];計算機(jī)工程;2008年11期

8 王珊;王會舉;覃雄派;周p

本文編號：1950538

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/1950538.html

上一篇：基于Zookeeper的分布式鎖服務(wù)及性能優(yōu)化
下一篇：針對粗粒度可重構(gòu)處理器的通用循環(huán)編譯技術(shù)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop分布式系統(tǒng)的重復(fù)數(shù)據(jù)檢測技術(shù)研究與應(yīng)用