集群計算引擎Spark中的內(nèi)存優(yōu)化研究與實現(xiàn)
發(fā)布時間:2018-12-16 15:46
【摘要】:在迭代之間使用內(nèi)存做數(shù)據(jù)傳輸?shù)牟⑿杏嬎憧蚣苁钱斍暗囊粋研究熱點。與傳統(tǒng)的基于硬盤和網(wǎng)絡(luò)的計算方式相比,使用內(nèi)存可以減少數(shù)據(jù)傳輸?shù)臅r間。對于數(shù)據(jù)密集類型的任務(wù),可以將運行時間提升十幾倍。在新一代框架快速發(fā)展的同時,如何充分利用相對仍然緊缺的內(nèi)存資源,保證任務(wù)的運行效率,成為一個亟待解決的問題。 本文基于集群計算引擎Spark,研究了并行計算集群對于內(nèi)存的使用行為。通過對內(nèi)存行為進行建模與分析,對內(nèi)存的使用進行了決策自動化以及替換策略優(yōu)化。提高了任務(wù)在資源有限情況下的運行效率,以及在不同集群環(huán)境下任務(wù)效率的穩(wěn)定性。本文的貢獻主要有: 通過對代碼的語義進行分析,實現(xiàn)了內(nèi)存策略的自動化。即調(diào)度器可以自動識別出價值的數(shù)據(jù)集(RDD)放入緩存,,避免緩存存污染的同時,也減輕了程序員的編程負擔(dān)。 在對代碼語義分析,獲得任務(wù)詳細信息的基礎(chǔ)上,對內(nèi)存使用的替換策略進行了優(yōu)化。主要包括RDD大小和權(quán)重的計算,操作順序的優(yōu)化重排,使用寄存器分配模型加權(quán)重信息形成新的替換算法,代替原有的LRU算法以及多級緩存模型的智能化。最后對內(nèi)存在異構(gòu)集群群上的行為也進行了初步的分析。 最后通過不同的實驗,驗證了優(yōu)化后的方案可以提高任務(wù)對不同集群環(huán)境的適應(yīng)性,并且在在內(nèi)存資源相對有限的情況下使任務(wù)運行效率更高,使系統(tǒng)的實用性整體增強,對于其他并行系統(tǒng)中的內(nèi)存使用也有實際的參考價值。
[Abstract]:A parallel computing framework using memory for data transfer between iterations is a hot topic. Compared with the traditional hard disk and network based computing, the use of memory can reduce the time of data transmission. For data-intensive types of tasks, you can increase the running time more than ten times. With the rapid development of the new generation framework, how to make full use of the relatively scarce memory resources and ensure the operational efficiency of the task has become a problem to be solved urgently. This paper studies the memory usage behavior of parallel computing clusters based on cluster computing engine Spark,. Through modeling and analysis of memory behavior, the decision automation and substitution strategy optimization of memory usage are carried out. The efficiency of task is improved under the condition of limited resources and the stability of task efficiency in different cluster environment. The main contributions of this paper are as follows: by analyzing the semantics of the code, the memory strategy is automated. That is, the scheduler can automatically recognize the value of the data set (RDD) into the cache, to avoid cache pollution, but also reduce the programmer's programming burden. On the basis of code semantic analysis and task details, the memory replacement strategy is optimized. It mainly includes the calculation of RDD size and weight, the optimal rearrangement of operation sequence, the use of register allocation model and weight information to form a new replacement algorithm, which replaces the original LRU algorithm and the intelligence of multi-level buffer model. Finally, the behavior of heterogeneous cluster is also analyzed. Finally, through different experiments, it is proved that the optimized scheme can improve the adaptability of the task to different cluster environments, and make the task run more efficiently under the condition of relatively limited memory resources, so that the practicability of the system is enhanced as a whole. It also has practical reference value for memory usage in other parallel systems.
【學(xué)位授予單位】:清華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333.1
本文編號:2382595
[Abstract]:A parallel computing framework using memory for data transfer between iterations is a hot topic. Compared with the traditional hard disk and network based computing, the use of memory can reduce the time of data transmission. For data-intensive types of tasks, you can increase the running time more than ten times. With the rapid development of the new generation framework, how to make full use of the relatively scarce memory resources and ensure the operational efficiency of the task has become a problem to be solved urgently. This paper studies the memory usage behavior of parallel computing clusters based on cluster computing engine Spark,. Through modeling and analysis of memory behavior, the decision automation and substitution strategy optimization of memory usage are carried out. The efficiency of task is improved under the condition of limited resources and the stability of task efficiency in different cluster environment. The main contributions of this paper are as follows: by analyzing the semantics of the code, the memory strategy is automated. That is, the scheduler can automatically recognize the value of the data set (RDD) into the cache, to avoid cache pollution, but also reduce the programmer's programming burden. On the basis of code semantic analysis and task details, the memory replacement strategy is optimized. It mainly includes the calculation of RDD size and weight, the optimal rearrangement of operation sequence, the use of register allocation model and weight information to form a new replacement algorithm, which replaces the original LRU algorithm and the intelligence of multi-level buffer model. Finally, the behavior of heterogeneous cluster is also analyzed. Finally, through different experiments, it is proved that the optimized scheme can improve the adaptability of the task to different cluster environments, and make the task run more efficiently under the condition of relatively limited memory resources, so that the practicability of the system is enhanced as a whole. It also has practical reference value for memory usage in other parallel systems.
【學(xué)位授予單位】:清華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333.1
【共引文獻】
相關(guān)期刊論文 前2條
1 董新華;李瑞軒;周灣灣;王聰;薛正元;廖東杰;;Hadoop系統(tǒng)性能優(yōu)化與功能增強綜述[J];計算機研究與發(fā)展;2013年S2期
2 張永;尹傳曄;吳崇正;;基于MapReduce的PageRank算法優(yōu)化研究[J];計算機應(yīng)用研究;2014年02期
相關(guān)博士學(xué)位論文 前2條
1 劉智;二進制代碼級的漏洞攻擊檢測研究[D];電子科技大學(xué);2013年
2 王榮華;動態(tài)二進制翻譯優(yōu)化研究[D];浙江大學(xué);2013年
相關(guān)碩士學(xué)位論文 前3條
1 賴海明;MapReduce作業(yè)調(diào)度算法分析與優(yōu)化研究[D];杭州電子科技大學(xué);2013年
2 羅杰;基于GCC的YHFT-Matrix編譯器關(guān)鍵技術(shù)研究與實現(xiàn)[D];國防科學(xué)技術(shù)大學(xué);2012年
3 蔣慧斐;海量日志分布式處理系統(tǒng)的研究與應(yīng)用[D];北京交通大學(xué);2014年
本文編號:2382595
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2382595.html
最近更新
教材專著