面向歸檔存儲(chǔ)的重復(fù)數(shù)據(jù)刪除優(yōu)化方法研究
發(fā)布時(shí)間:2018-06-05 10:01
本文選題:重復(fù)數(shù)據(jù)刪除 + 分布式存儲(chǔ); 參考:《華中科技大學(xué)》2013年碩士論文
【摘要】:隨著社會(huì)信息化水平的提高,數(shù)據(jù)變得越來(lái)越重要。與此同時(shí),企業(yè)數(shù)據(jù)中心的存儲(chǔ)需求量呈爆炸式增長(zhǎng)。目前的存儲(chǔ)系統(tǒng)主要是從數(shù)據(jù)的讀寫(xiě)性能和可靠性方面進(jìn)行設(shè)計(jì),忽略了數(shù)據(jù)之間的關(guān)聯(lián)和冗余特性。這不僅造成了存儲(chǔ)空間的浪費(fèi),也使得用戶(hù)難以對(duì)數(shù)量龐大、結(jié)構(gòu)復(fù)雜的數(shù)據(jù)進(jìn)行有效的管理。針對(duì)此,近年來(lái)出現(xiàn)了重復(fù)數(shù)據(jù)刪除技術(shù)(De-duplication)。 在分析重復(fù)數(shù)據(jù)刪除系統(tǒng)中元數(shù)據(jù)訪(fǎng)問(wèn)、查詢(xún)特性和數(shù)據(jù)的布局及讀寫(xiě)特性的基礎(chǔ)上,給出了一種元數(shù)據(jù)與數(shù)據(jù)分離的重復(fù)數(shù)據(jù)刪除系統(tǒng)架構(gòu)方案:(1)采用由客戶(hù)端、元數(shù)據(jù)服務(wù)器和存儲(chǔ)節(jié)點(diǎn)構(gòu)成的三方架構(gòu);(2)將元數(shù)據(jù)訪(fǎng)問(wèn)分離到客戶(hù)端與元數(shù)據(jù)服務(wù)器間,將文件內(nèi)容訪(fǎng)問(wèn)分離到客戶(hù)端與存儲(chǔ)節(jié)點(diǎn)間,從而該方案具有高可擴(kuò)展性和高訪(fǎng)問(wèn)并發(fā)性。在去重功能上,(1)采用固定分塊的數(shù)據(jù)劃分方法,使用哈希算法MD5、SHA-1等作為數(shù)據(jù)分塊的哈希指紋;(2)使用兩層Bloom Filter對(duì)數(shù)據(jù)分塊的哈希指紋進(jìn)行快速判別和過(guò)濾,并使用B+樹(shù)索引結(jié)構(gòu)作為哈希指紋元數(shù)據(jù)的持久化存儲(chǔ)方案。為了進(jìn)一步優(yōu)化I/O性能,(1)采用按照數(shù)據(jù)流分區(qū)域存儲(chǔ)的數(shù)據(jù)布局策略,獲得數(shù)據(jù)訪(fǎng)問(wèn)的空間局部性;(2)結(jié)合客戶(hù)端元數(shù)據(jù)及數(shù)據(jù)緩存機(jī)制,提高文件訪(fǎng)問(wèn)的緩存命中率和文件讀寫(xiě)的性能。 最后,設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)三方架構(gòu)的重復(fù)數(shù)據(jù)刪除系統(tǒng)原型,在系統(tǒng)原型之上進(jìn)行了功能和性能測(cè)試。功能測(cè)試結(jié)果表明,上述重復(fù)數(shù)據(jù)刪除方案在虛擬機(jī)鏡像的測(cè)試集下能獲得130%的數(shù)據(jù)壓縮率;性能測(cè)試結(jié)果表明,緩存機(jī)制可以提高文件訪(fǎng)問(wèn)的性能;指紋過(guò)濾統(tǒng)計(jì)表明,采用的兩層Bloom Filter具有較高的指紋過(guò)濾率,0.071%的實(shí)際誤判率在0.1%的理論誤判率所允許的范圍內(nèi)。
[Abstract]:With the improvement of the level of social information, data become more and more important. At the same time, the enterprise data center storage demand is explosive growth. The current storage system is designed mainly from the aspects of read and write performance and reliability of data, ignoring the correlation and redundancy between data. This not only causes the waste of storage space, but also makes it difficult for users to manage the huge and complicated data effectively. In view of this, in recent years, the repeated data delete technology has appeared De-duplex replication. Based on the analysis of metadata access, query characteristics, data layout, read and write characteristics in the repetitive data deletion system, a scheme of metadata separation from data deletion system architecture: 1) is presented. The tripartite architecture of metadata server and storage node separates metadata access between client and metadata server, and file content access between client and storage node. Therefore, the scheme has high scalability and high access concurrency. (1) using fixed block data partition method, and using hash algorithm MD5SHA-1 as data block hashing fingerprint / 2) using two-layer Bloom Filter to quickly distinguish and filter the hash fingerprint of data partitioning. B tree index structure is used as the persistent storage scheme of hash fingerprint metadata. In order to optimize I / O performance further, the spatial locality of data access is obtained by using the data layout strategy which is stored according to the area of data flow) and the mechanism of client metadata and data cache is combined. Improve file access cache hit rate and file read and write performance. Finally, we design and implement a prototype of repetitive data deletion system based on the three-party architecture, and test the function and performance of the system on top of the prototype. The function test results show that the data compression ratio of the proposed duplicate data deletion scheme can reach 130% under the virtual machine image test set, and the performance test results show that the cache mechanism can improve the performance of file access, and fingerprint filtering statistics show that, The two-layer Bloom Filter has a high fingerprint filtering rate of 0.071% and the actual error rate is within 0.1% of the theoretical error rate.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP333
【參考文獻(xiàn)】
相關(guān)博士學(xué)位論文 前1條
1 吳偉;海量存儲(chǔ)系統(tǒng)元數(shù)據(jù)管理的研究[D];華中科技大學(xué);2010年
,本文編號(hào):1981591
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/1981591.html
最近更新
教材專(zhuān)著