基于性能預(yù)測(cè)的Spark資源優(yōu)化分配策略

發(fā)布時(shí)間：2018-11-03 16:17

【摘要】：Spark已經(jīng)成為如今最流行的分布式大數(shù)據(jù)計(jì)算平臺(tái),由于其高效的性能、良好的容錯(cuò)性與統(tǒng)一性,在業(yè)界得到了廣泛的使用。但由于Spark平臺(tái)對(duì)數(shù)據(jù)的具體操作對(duì)于用戶來說是透明的,Spark上運(yùn)行的任務(wù)受到許多因素的影響,比如數(shù)據(jù)的分區(qū)策略,算法的設(shè)計(jì)與實(shí)現(xiàn)以及節(jié)點(diǎn)的資源分配等等。使得對(duì)Spark性能預(yù)測(cè)的非常困難。本課題通過建立基于Spark任務(wù)結(jié)構(gòu)的性能模型,研究Spark任務(wù)在不同數(shù)據(jù)量、分區(qū)策略的情況下的執(zhí)行時(shí)間,并在此基礎(chǔ)上尋找任務(wù)執(zhí)行時(shí)間與集群資源消耗間的平衡,提出基于動(dòng)態(tài)重分區(qū)的資源分配優(yōu)化策略。本文在細(xì)粒度監(jiān)控集群資源的基礎(chǔ)上,解析Spark任務(wù)各階段的執(zhí)行信息,建立基于Spark任務(wù)結(jié)構(gòu)的性能模型,通過大量的歷史實(shí)驗(yàn)數(shù)據(jù)訓(xùn)練模型參數(shù),實(shí)現(xiàn)了對(duì)不同負(fù)載類型Spark計(jì)算任務(wù)的性能預(yù)測(cè)。在此基礎(chǔ)上,我們研究了分區(qū)策略對(duì)Spark執(zhí)行時(shí)間的影響,我們發(fā)現(xiàn)盡管增加節(jié)點(diǎn)的并行度可以在一定程度上提升計(jì)算任務(wù)的性能,但在一些情況下,性能提升的幅度與新增的資源消耗相比起來,可以認(rèn)為是微乎其微的,當(dāng)我們已經(jīng)滿足了用戶在任務(wù)運(yùn)行時(shí)間方面的需求,這些微小的性能提升便可以忽略,相應(yīng)的,我們應(yīng)該在用戶給出的時(shí)間要求下盡可能的減少資源配置,以達(dá)到節(jié)約資源的目的。我們將會(huì)通過在一系列的實(shí)際Spark計(jì)算任務(wù)中加入動(dòng)態(tài)重分區(qū)的方式尋找任務(wù)的最佳分區(qū)方案,提出基于任務(wù)時(shí)間預(yù)測(cè)的重分區(qū)策略。在不過多犧牲任務(wù)運(yùn)行時(shí)間的前提下,節(jié)約集群資源,找到任務(wù)執(zhí)行時(shí)間與集群資源配置的平衡,指導(dǎo)用戶對(duì)Spark任務(wù)合理使用集群資源。本文通過實(shí)驗(yàn)驗(yàn)證了性能模型的合理性與對(duì)任務(wù)執(zhí)行時(shí)間預(yù)測(cè)的準(zhǔn)確性,取得了不錯(cuò)的預(yù)測(cè)準(zhǔn)確性。在此基礎(chǔ)上我們提出基于性能預(yù)測(cè)的資源優(yōu)化分配策略,在Spark負(fù)載集合中通過動(dòng)態(tài)重分區(qū)的方法,尋找優(yōu)化的集群資源分配策略,以取得任務(wù)執(zhí)行時(shí)間與集群資源消耗間的平衡。實(shí)驗(yàn)結(jié)果表明,我們的優(yōu)化策略可以用戶給出的執(zhí)行時(shí)間內(nèi)較為明顯地節(jié)約集群資源,在任務(wù)執(zhí)行時(shí)間與集群資源消耗之間尋找到了良好的平衡。
[Abstract]:Spark has become the most popular distributed big data computing platform. Because of its high performance, good fault tolerance and unity, it has been widely used in the industry. However, because the operation of Spark platform is transparent to users, the tasks running on Spark are affected by many factors, such as data partitioning strategy, algorithm design and implementation, resource allocation of nodes and so on. This makes it very difficult to predict Spark performance. By establishing a performance model based on Spark task structure, this paper studies the execution time of Spark task under different data volume and partition strategy, and then finds out the balance between task execution time and cluster resource consumption. An optimal resource allocation strategy based on dynamic repartitioning is proposed. On the basis of fine-grained monitoring cluster resources, this paper analyzes the execution information of each stage of Spark task, establishes a performance model based on Spark task structure, and trains the parameters of the model through a large number of historical experimental data. The performance prediction of Spark computing task with different load types is realized. On this basis, we study the effect of partitioning policy on the execution time of Spark. We find that although increasing the degree of parallelism of nodes can improve the performance of computing tasks to some extent, in some cases, The performance improvement is considered to be minimal compared with the additional resource consumption, and when we have met the user's requirements for task runtime, these small performance improvements can be ignored. In order to save resources, we should reduce the allocation of resources as much as possible under the time requirement given by the user. We will find the best partitioning scheme by adding dynamic repartitioning to a series of actual Spark computing tasks and propose a repartitioning strategy based on task time prediction. On the premise of not sacrificing task running time too much, we can save cluster resources, find the balance between task execution time and cluster resource allocation, and guide users to use cluster resources reasonably for Spark tasks. The rationality of the performance model and the accuracy of the prediction of task execution time are verified by experiments in this paper. On this basis, we propose an optimal resource allocation strategy based on performance prediction, and find the optimized cluster resource allocation strategy through dynamic repartitioning in the Spark load set. To achieve a balance between task execution time and cluster resource consumption. The experimental results show that our optimization strategy can obviously save cluster resources in the execution time given by users and find a good balance between task execution time and cluster resource consumption.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前7條

1 陶洋;黃濤;唐毅;;基于主機(jī)負(fù)載的任務(wù)執(zhí)行時(shí)間預(yù)測(cè)研究[J];計(jì)算機(jī)應(yīng)用;2009年10期

2 欒翠菊;宋廣華;鄭耀;張繼發(fā);;一種網(wǎng)格并行任務(wù)執(zhí)行時(shí)間預(yù)測(cè)算法[J];計(jì)算機(jī)集成制造系統(tǒng);2007年09期

3 韓耀軍;羅雪梅;;網(wǎng)格計(jì)算環(huán)境下任務(wù)執(zhí)行時(shí)間的組合預(yù)測(cè)[J];計(jì)算機(jī)工程;2006年21期

4 吉勤;李培峰;朱巧明;馬鋒明;;網(wǎng)格環(huán)境下基于分塊的任務(wù)執(zhí)行時(shí)間預(yù)測(cè)算法[J];計(jì)算機(jī)應(yīng)用;2009年07期

5 宋滸;李京;劉新春;;云環(huán)境中Bag-of-tasks應(yīng)用的多核虛擬計(jì)算資源分配機(jī)制研究[J];小型微型計(jì)算機(jī)系統(tǒng);2014年01期

6 張勰,龔龍慶;一種基于比特表的實(shí)時(shí)多任務(wù)新調(diào)度算法[J];單片機(jī)與嵌入式系統(tǒng)應(yīng)用;2003年09期

7 ;Evaluation of energy transfer and utilization efficiency of azo dye removal by different pulsed electrical discharge modes[J];Chinese Science Bulletin;2008年12期

相關(guān)會(huì)議論文前1條

1 ;Study on the spark discharge plasma jet driven by nanosecond pulses[A];第十五屆全國(guó)等離子體科學(xué)技術(shù)會(huì)議會(huì)議摘要集[C];2011年

相關(guān)碩士學(xué)位論文前10條

1 唐毅;網(wǎng)格環(huán)境中主機(jī)負(fù)載和任務(wù)執(zhí)行時(shí)間預(yù)測(cè)研究[D];重慶郵電大學(xué);2008年

2 廖志堅(jiān);基于歷史運(yùn)行軌跡的時(shí)間約束參數(shù)預(yù)測(cè)的研究[D];廣東工業(yè)大學(xué);2007年

3 劉江輝;基于RT-CORBA的任務(wù)運(yùn)行時(shí)間預(yù)測(cè)研究[D];廣東工業(yè)大學(xué);2005年

4 王韜;基于Spark的聚類集成系統(tǒng)研究與設(shè)計(jì)[D];西南交通大學(xué);2015年

5 陳曉康;基于Spark 云計(jì)算平臺(tái)的改進(jìn)K近鄰算法研究[D];廣東工業(yè)大學(xué);2016年

6 牟善文;美國(guó)SPARK課程模式小學(xué)生體育課能量代謝特點(diǎn)及干預(yù)實(shí)驗(yàn)研究[D];首都體育學(xué)院;2016年

7 李爭(zhēng)獻(xiàn);基于Spark的移動(dòng)終端信息推送系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];華南理工大學(xué);2016年

8 趙洋;基于spark的網(wǎng)絡(luò)廣告交易計(jì)費(fèi)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2016年

9 尚勃;Spark平臺(tái)下基于深度學(xué)習(xí)的網(wǎng)絡(luò)短文本情感分類研究[D];西安建筑科技大學(xué);2016年

10 王海華;Spark數(shù)據(jù)處理平臺(tái)中內(nèi)存數(shù)據(jù)空間管理技術(shù)研究[D];北京工業(yè)大學(xué);2016年

，

本文編號(hào)：2308292

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2308292.html

上一篇：陀螺儀在基于Hash結(jié)構(gòu)的三維重建中的應(yīng)用
下一篇：基于語義的網(wǎng)絡(luò)流行語趨勢(shì)分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于性能預(yù)測(cè)的Spark資源優(yōu)化分配策略