HDFS中文件存儲優(yōu)化的相關技術研究

發(fā)布時間：2018-06-24 15:04

本文選題：Hadoop分布式文件系統(tǒng)(HDFS) + 存儲節(jié)點選擇��；參考：《南京師范大學》2013年碩士論文

【摘要】：面對不斷增長的海量數(shù)據(jù),目前計算機領域提出了一種新的計算模式--云計算,Hadoop是一個可實現(xiàn)大規(guī)模分布式計算的開源框架,具有高吞吐量、高可靠性、高可伸縮性等優(yōu)點,因此被廣泛應用在云計算領域。Hadoop中的分布式文件系統(tǒng)HDFS是被設計成適合運行在通用硬件上的分布式文件系統(tǒng),它是一個高度容錯的系統(tǒng),可以部署在廉價的機器上。HDFS能提供高吞吐量的數(shù)據(jù)訪問,非常適合大規(guī)模數(shù)據(jù)集上的應用,并能夠以流的方式讀取文件系統(tǒng)中的數(shù)據(jù)。但是作為一個正在不斷發(fā)展中的分布式文件系統(tǒng),HDFS也不可避免的存在一些文件數(shù)據(jù)存儲方面的缺陷。例如HDFS在數(shù)據(jù)副本存儲時,是在機架上隨機選擇Datanode進行存儲,可能導致Datanode負載不均衡,從而影響整個系統(tǒng)的性能：并且HDFS最初是被設計用來流式的存儲大文件,未對小文件的存儲進行優(yōu)化,因此在處理小文件時性能十分低下。本文首先對分布式文件系統(tǒng)的發(fā)展做一些簡要的介紹,然后深入分析了HDFS分布式文件系統(tǒng),包括其架構、元數(shù)據(jù)管理、以及文件讀寫流程等,并且分析了現(xiàn)有的解決HDFS數(shù)據(jù)存儲及小文件存儲的一些方案的性能以及不足。本文的主要創(chuàng)新點如下： 1、針對在機架上隨機選擇Datanode進行數(shù)據(jù)副本存儲時,可能導致Datanode負載不均衡等問題,提出了采用多目標優(yōu)化技術,基于Datanode的當前運行狀態(tài),尋找綜合條件最優(yōu)的Datanode進行數(shù)據(jù)存儲的方法。該方法使得數(shù)據(jù)副本均衡的存儲在Datanode中,也可以提高數(shù)據(jù)讀寫的性能。 2、實際的應用中會產生大量的小文件,針對HDFS存儲小文件的不足,提出了小文件合并和Client端緩存小文件等策略。在Client端將小文件合并成若干大文件后,將大文件及相關元數(shù)據(jù)一同存儲到HDFS中；在讀取某個小文件時,Client端緩存從Datanode返回的包含該小文件的整個大文件,再次讀取該小文件,或者大文件中的其它小文件時,可以直接從Client端讀取。減少了Client端向Namenode頻繁請求元數(shù)據(jù)的次數(shù),也減少了Client端向Datanode頻繁請求數(shù)據(jù)塊的次數(shù),大大降低小文件的存取時間。
[Abstract]:In the face of increasing mass data, a new computing model, cloud computing Hadoop, is proposed in the computer field, which is an open source framework for large-scale distributed computing. It has the advantages of high throughput, high reliability, high scalability and so on. So the distributed file system HDFS, which is widely used in cloud computing. Hadoop, is a distributed file system which is designed to run on general hardware. It is a highly fault-tolerant system. It can be deployed on cheap machines. HDFS can provide high throughput data access, is very suitable for large-scale data set applications, and can read data in file system in a stream way. However, as a developing distributed file system, HDFS inevitably has some defects in file data storage. For example, when HDFS stores a copy of data, it selects the DataNode randomly on the rack for storage, which may result in uneven load of the DataNode, which may affect the performance of the entire system: and HDFS was originally designed to stream large files. Storage of small files is not optimized, so performance is very low when processing small files. This paper first introduces the development of distributed file system, then analyzes the HDFS distributed file system, including its architecture, metadata management, file reading and writing process, etc. The performance and shortcomings of existing solutions to HDFS data storage and small file storage are analyzed. The main innovations of this paper are as follows: 1. Aiming at the problem that data replica storage may be caused by random selection of DataNode on the frame, this paper proposes a multi-objective optimization technique based on the current running state of DataNode. The method of data storage for the DataNode with the best synthesis condition is found. This method makes the data copy balanced storage in the DataNode, but also can improve the performance of data reading and writing. 2. In practical applications, a large number of small files will be produced, aiming at the shortcomings of HDFS storage small files. The strategies of small file merging and client side caching are put forward. After the client side merges the small file into a number of large files, the large file and related metadata are stored in HDFS together; when a small file is read, the client side caches the entire large file containing the small file returned from the DataNode, and reads the small file again. Or other small files in large files, can be read directly from the client side. It reduces the frequent request of metadata from the client side to the Namenode and the frequent request of the data block from the client side to the DataNode, which greatly reduces the access time of small files.
【學位授予單位】：南京師范大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP316.4;TP333

【參考文獻】

相關期刊論文前1條

1 周軼男;王宇;;Hadoop文件系統(tǒng)性能分析[J];電子技術;2011年05期

，

本文編號：2061924

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2061924.html

上一篇：GPGPU結構研究與性能分析
下一篇：基于HDFS的多租戶小文件存儲系統(tǒng)的研究與設計

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

HDFS中文件存儲優(yōu)化的相關技術研究