Hadoop平臺(tái)下海量日志數(shù)據(jù)處理模型的研究及改進(jìn)
本文選題:Hadoop + 分層作業(yè)調(diào)度; 參考:《浙江理工大學(xué)》2013年碩士論文
【摘要】:隨著計(jì)算機(jī)技術(shù)以及互聯(lián)網(wǎng)高速地運(yùn)用到人類社會(huì)生產(chǎn)生活的各個(gè)方面,數(shù)據(jù)量呈現(xiàn)出爆發(fā)性的增長。為滿足海量數(shù)據(jù)應(yīng)用的處理要求,基于大規(guī)模計(jì)算機(jī)集群的并行計(jì)算成為了主要途徑,而MapReduce就是一個(gè)最初由谷歌設(shè)計(jì)用來在大型集群上執(zhí)行并行計(jì)算的框架。它能夠減少開發(fā)人員在進(jìn)行并發(fā)編程時(shí)的復(fù)雜性,使得開發(fā)人員在不了解分布式底層細(xì)節(jié)的情況下開發(fā)分布式程序。 Hadoop是一個(gè)實(shí)現(xiàn)MapReduce的開放源代碼的集群平臺(tái)。目前,Hadoop在很多互聯(lián)網(wǎng)公司里都已經(jīng)得到了應(yīng)用,可以說是應(yīng)用最為廣泛的開源云計(jì)算軟件平臺(tái)。但是,Hadoop還是一個(gè)發(fā)展時(shí)間較短的平臺(tái),在許多地方還需要提高和改進(jìn)。本文主要研究工作和貢獻(xiàn)如下: 1)本文對(duì)Hadoop平臺(tái)的架構(gòu)及其核心技術(shù)進(jìn)行了深入的研究;闡述了Hadoop平臺(tái)下現(xiàn)有的調(diào)度算法FIFO、計(jì)算能力調(diào)度算法以及公平調(diào)度算法的設(shè)計(jì)思路、實(shí)現(xiàn)過程以及算法優(yōu)缺點(diǎn)。針對(duì)FIFO調(diào)度策略單一、容易造成大作業(yè)長時(shí)間等待、集群CPU利用率低的問題,,提出了基于紅黑樹的分層調(diào)度算法(HSBRB),并將其引入Hadoop平臺(tái)。 2) HSBRB調(diào)度算法引入了紅黑樹作為存儲(chǔ)作業(yè)信息的數(shù)據(jù)結(jié)構(gòu)。紅黑樹是一種效率非常高的不完全平衡二叉樹,隨著結(jié)點(diǎn)個(gè)數(shù)的增加,紅黑樹會(huì)獲得高速的數(shù)據(jù)插入、刪除速度,從而提高整個(gè)集群的CPU利用率。同時(shí),HSBRB調(diào)度算法采用了層次調(diào)度模型來調(diào)度作業(yè)。當(dāng)多用戶共享集群平臺(tái)時(shí),每個(gè)用戶對(duì)應(yīng)一個(gè)池,每個(gè)池里存放多個(gè)作業(yè),從而解決了FIFO只針對(duì)單用戶提交作業(yè)的不足導(dǎo)致的集群資源利用率低的問題。 3)海量日志數(shù)據(jù)的處理。本文的海量日志數(shù)據(jù)均來自于NBER的專利數(shù)據(jù)集。為獲得不同引用頻率的專利數(shù)目,搭建了一個(gè)小型的Hadoop集群平臺(tái),并在該平臺(tái)上開發(fā)分布式并行程序,結(jié)果保存到指定的目錄文件中。 4)為驗(yàn)證HSBRB算法的性能,本文設(shè)計(jì)了兩個(gè)不同的實(shí)驗(yàn)場景對(duì)Hadoop現(xiàn)有的調(diào)度算法FIFO、Fair Scheduler以及本課題的HSBRB算法進(jìn)行了實(shí)驗(yàn)對(duì)比。實(shí)驗(yàn)結(jié)果驗(yàn)證了HSBRB算法的合理性以及有效性,而且相對(duì)于現(xiàn)有的調(diào)度算法,HSBRB算法能夠更好地減少作業(yè)運(yùn)行時(shí)間、提高CPU的利用率,是一種較為理想的任務(wù)調(diào)度算法。 最后我們對(duì)論文工作進(jìn)行了總結(jié),并討論了對(duì)進(jìn)一步工作的展望。
[Abstract]:With the rapid application of computer technology and Internet to all aspects of human society, the amount of data is increasing explosively. In order to meet the requirements of mass data applications, parallel computing based on large scale computer clusters has become the main approach, and MapReduce is a framework originally designed by Google to perform parallel computing on large clusters. It can reduce the complexity of concurrent programming and enable developers to develop distributed programs without understanding the underlying details of distributed programming. Hadoop is a cluster platform that implements MapReduce's open source code. At present Hadoop has been used in many Internet companies, it can be said to be the most widely used open source cloud computing software platform. But Hadoop is also a relatively short development time platform, in many places still need to be improved and improved. The main research work and contributions of this paper are as follows: 1) in this paper, the architecture and core technology of Hadoop platform are deeply studied, and the design ideas, implementation process, advantages and disadvantages of the existing scheduling algorithms, such as FIFO, computing power scheduling algorithm and fair scheduling algorithm under Hadoop platform are described. Aiming at the problem of single scheduling strategy of FIFO, which is easy to cause long time waiting of large jobs and low utilization of cluster CPU, a hierarchical scheduling algorithm based on red-black tree is proposed and introduced into Hadoop platform. 2) HSBRB scheduling algorithm introduces red-black tree as the data structure to store job information. The red-black tree is a highly efficient binary tree with incomplete balance. With the increase of the number of nodes, the red-black tree will obtain high-speed data insertion, delete speed, and thus improve the CPU utilization of the whole cluster. At the same time, HSBRB scheduling algorithm adopts hierarchical scheduling model to schedule jobs. When multi-users share a cluster platform, each user has a pool, each pool holds more than one job, thus solving the problem of low utilization of cluster resources caused by the shortage of FIFO only for single user to submit jobs. 3) processing of massive log data. The massive log data in this paper come from the patent data set of NBER. In order to obtain the number of patents with different reference frequencies, a small Hadoop cluster platform is built and distributed parallel programs are developed on the platform. The results are saved to a specified directory file. 4) in order to verify the performance of HSBRB algorithm, two different experimental scenarios are designed to compare the existing Hadoop scheduling algorithm, FIFO Fair Scheduler, and the HSBRB algorithm in this paper. The experimental results verify the rationality and validity of the HSBRB algorithm, and it is a more ideal task scheduling algorithm than the existing scheduling algorithm, which can reduce the running time of jobs and improve the utilization of CPU. Finally, we summarize the work of the paper and discuss the prospects for further work.
【學(xué)位授予單位】:浙江理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP301.6;TP338.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 高慶;姜凡;;紅黑樹算法及其應(yīng)用[J];軟件導(dǎo)刊;2008年09期
相關(guān)碩士學(xué)位論文 前10條
1 吳貴鑫;云計(jì)算中的MapReduce并行編程模式研究[D];河南理工大學(xué);2010年
2 施巖;云計(jì)算研究及Hadoop應(yīng)用程序的開發(fā)與測試[D];北京郵電大學(xué);2011年
3 任萱萱;基于Hadoop平臺(tái)的作業(yè)調(diào)度研究[D];天津師范大學(xué);2011年
4 鄧光明;分布式工作流引擎的研究和設(shè)計(jì)[D];太原科技大學(xué);2011年
5 陳艷金;MapReduce模型在Hadoop平臺(tái)下實(shí)現(xiàn)作業(yè)調(diào)度算法的研究和改進(jìn)[D];華南理工大學(xué);2011年
6 吳昊;基于HDFS的分布式文件系統(tǒng)數(shù)據(jù)冗余技術(shù)研究[D];西安電子科技大學(xué);2011年
7 余正祥;基于hadoop平臺(tái)作業(yè)調(diào)度算法的研究[D];云南大學(xué);2011年
8 張敏;云計(jì)算環(huán)境下的并行數(shù)據(jù)挖掘策略研究[D];南京郵電大學(xué);2011年
9 楊宸鑄;基于HADOOP的數(shù)據(jù)挖掘研究[D];重慶大學(xué);2010年
10 王凱;MapReduce集群多用戶作業(yè)調(diào)度方法的研究與實(shí)現(xiàn)[D];國防科學(xué)技術(shù)大學(xué);2010年
本文編號(hào):1906691
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/1906691.html