基于NR-MPI的并行程序容錯設(shè)計技術(shù)研究
發(fā)布時間:2018-05-28 18:02
本文選題:高性能計算 + MPI并行程序 ; 參考:《國防科學(xué)技術(shù)大學(xué)》2012年碩士論文
【摘要】:隨著高性能計算技術(shù)的飛速發(fā)展,高性能計算機(jī)(HPC)的系統(tǒng)規(guī)模急劇增大,系統(tǒng)的平均故障間隔時間(MTBF)隨之降低,遠(yuǎn)低于HPC上大型科學(xué)計算程序的運(yùn)行時間,嚴(yán)重影響了系統(tǒng)的可用性。容錯技術(shù)是提高HPC系統(tǒng)可用性的重要技術(shù)手段。然而,目前常用的容錯方法:系統(tǒng)級檢查點(diǎn),通常會帶來巨大容錯開銷,已不能滿足HPC應(yīng)用的需求。應(yīng)用級檢查點(diǎn)技術(shù)雖然可以較好的控制容錯開銷,但是它仍然需要重新加載出錯的程序,這在大規(guī)模系統(tǒng)中可能會引入很大的開銷。MPI是HPC領(lǐng)域應(yīng)用最廣泛的并行編程方式,而NR-MPI是一種新型、高性能的容錯MPI,因此,基于NR-MPI的并行程序容錯設(shè)計技術(shù)研究具有十分重要的意義。 由于MPI并行程序的復(fù)雜性與多樣性,很難找到一種通用且高效的容錯技術(shù)。本文面向應(yīng)用廣泛的循環(huán)迭代并行程序,對數(shù)據(jù)冗余和結(jié)點(diǎn)冗余這兩種容錯技術(shù)進(jìn)行了深入的研究,主要工作如下: 首先,為評價容錯技術(shù)的優(yōu)劣,定義了三個評價容錯技術(shù)的指標(biāo):容錯空間開銷、容錯時間開銷、失效恢復(fù)時間,并為估計容錯技術(shù)是否適用于某個HPC系統(tǒng)上的某個應(yīng)用,,定義了容錯時間因子,這些工作為基于NR-MPI的并行程序容錯設(shè)計提供了理論支撐。 其次,提出了基于數(shù)據(jù)冗余的容錯并行算法框架:Data Redundancy based FaultTolerant Framework(簡稱DRFTF),并對其中的關(guān)鍵問題:數(shù)據(jù)備份策略、全局一致性、備份周期和關(guān)鍵變量進(jìn)行了重點(diǎn)分析。DRFTF是建立在程序原算法的基礎(chǔ)上的,對原算法不需要太大改動即可實現(xiàn)容錯,而且對于關(guān)鍵變量比例較小的算法,可以保獲得較小的容錯開銷。 第三,對測試程序NPB和Sweep3D的算法進(jìn)行了分析,使用DRFTF實現(xiàn)了NPB和Sweep3D的容錯版本,并對容錯程序進(jìn)行了實驗和性能分析。實驗結(jié)果驗證了DRFTF的容錯能力和較低的容錯開銷。 第四,針對可以在每步循環(huán)維持校驗和關(guān)系的算法,提出了基于結(jié)點(diǎn)冗余的容錯并行算法框架:Node Redundancy based Fault Tolerant Framework(簡稱NRFTF)。NRFTF采用結(jié)點(diǎn)冗余容錯技術(shù),建立了程序數(shù)據(jù)的校驗和,并將其保存在冗余結(jié)點(diǎn),校驗和數(shù)據(jù)由冗余進(jìn)程進(jìn)行更新,不暫停原算法的執(zhí)行,因此可以獲得很小的容錯開銷。 最后,分析了并行高斯消元算法,使用NRFTF設(shè)計了容錯的并行高斯消元算法,并以TOP500超級計算機(jī)排行的測試程序HPL為例,實現(xiàn)了容錯的HPL程序,對容錯程序進(jìn)行了實驗和性能分析。實驗結(jié)果驗證了NRFTF的容錯能力和很低的容錯開銷。
[Abstract]:With the rapid development of high performance computing technology, the scale of high performance computer (HPC) system increases rapidly, and the average fault interval time (MTBF) of the system decreases, which is far less than the running time of large scientific computing program on HPC. The availability of the system is seriously affected. Fault-tolerant technology is an important technique to improve the availability of HPC system. However, the commonly used fault-tolerant methods, system-level checkpoints, usually bring huge fault-tolerant overhead, and can no longer meet the requirements of HPC applications. Although the application-level checkpoint technology can control the fault-tolerant overhead well, it still needs to reload the error-prone program, which may introduce a large amount of overhead in large-scale systems. MPI is the most widely used parallel programming method in the field of HPC. NR-MPI is a new type of fault-tolerant MPI with high performance. Therefore, it is of great significance to study the fault-tolerant design technology of parallel programs based on NR-MPI. Due to the complexity and diversity of MPI parallel programs, it is difficult to find a universal and efficient fault-tolerant technology. In this paper, two kinds of fault-tolerant techniques, data redundancy and node redundancy, are deeply studied for circular iterative parallel programs. The main work is as follows: Firstly, in order to evaluate the merits and demerits of the fault-tolerant technology, three indexes are defined to evaluate the fault-tolerant technique: fault-tolerant space overhead, fault-tolerant time overhead, failure recovery time, and to estimate whether the fault-tolerant technique is suitable for an application in a HPC system. The fault-tolerant time factor is defined, which provides a theoretical support for the fault-tolerant design of parallel programs based on NR-MPI. Secondly, a parallel fault-tolerant algorithm based on data redundancy is proposed, which is called: DRFTF Redundancy based FaultTolerant Framework(, and the key problems are: data backup strategy, global consistency, and so on. The backup period and key variables are analyzed emphatically. DRFTF is based on the original algorithm of the program. It can be fault-tolerant without too much change to the original algorithm, and for the algorithm with small proportion of key variables, It can guarantee less fault tolerance overhead. Thirdly, the algorithms of NPB and Sweep3D are analyzed, the fault-tolerant versions of NPB and Sweep3D are implemented with DRFTF, and the experiment and performance analysis of the fault-tolerant program are carried out. The experimental results show that the DRFTF is fault-tolerant and has a low fault-tolerant overhead. Fourthly, aiming at the algorithm which can maintain the checksum relation in every step, a parallel fault-tolerant algorithm framework named: node Redundancy based Fault Tolerant Framework( based on node redundancy is proposed, which adopts node redundancy fault-tolerant technology and establishes the checksum of program data. The checksum data is updated by the redundant process, and the execution of the original algorithm is not suspended, so the fault tolerant cost can be very small. Finally, the parallel Gao Si elimination algorithm is analyzed, and the fault-tolerant parallel Gao Si elimination algorithm is designed by using NRFTF. Taking HPL, a test program ranked by TOP500 supercomputer, as an example, the fault-tolerant HPL program is implemented. The experiment and performance analysis of fault-tolerant program are carried out. The experimental results show that NRFTF is fault-tolerant and has very low fault-tolerant overhead.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP302.8
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 李曉梅,莫則堯;多重網(wǎng)格算法綜述[J];中國科學(xué)基金;1996年01期
相關(guān)博士學(xué)位論文 前1條
1 杜云飛;容錯并行算法的研究與分析[D];國防科學(xué)技術(shù)大學(xué);2008年
本文編號:1947667
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/1947667.html
最近更新
教材專著