基于內(nèi)存緩存的異步檢查點(diǎn)容錯(cuò)技術(shù)
發(fā)布時(shí)間:2018-11-15 20:35
【摘要】:高性能計(jì)算機(jī)系統(tǒng)規(guī)模越來越大,系統(tǒng)可靠性問題越來越嚴(yán)重.檢查點(diǎn)技術(shù)是最典型的容錯(cuò)方法,但是因?yàn)椴⑿形募到y(tǒng)的性能提高相對緩慢,數(shù)據(jù)寫帶寬低,傳統(tǒng)檢查點(diǎn)方法產(chǎn)生了嚴(yán)峻的性能問題.針對當(dāng)前計(jì)算機(jī)系統(tǒng)計(jì)算和存儲資源豐富,而并行文件系統(tǒng)寫帶寬提高相對滯后的特點(diǎn),提出了基于內(nèi)存緩存的異步檢查點(diǎn)容錯(cuò)技術(shù),傳統(tǒng)的檢查點(diǎn)技術(shù)被劃分為兩步:檢查點(diǎn)文件首先被緩存在計(jì)算結(jié)點(diǎn)的局部內(nèi)存,然后使用一個(gè)獨(dú)立的幫助任務(wù)將數(shù)據(jù)拷貝到并行文件系統(tǒng).利用局部內(nèi)存帶寬高以及幫助任務(wù)和計(jì)算任務(wù)并行執(zhí)行的特點(diǎn),新方法極大減小了檢查點(diǎn)容錯(cuò)引入的時(shí)間開銷,模擬和實(shí)際程序測試驗(yàn)證了異步檢查點(diǎn)容錯(cuò)技術(shù)的有效性.
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者單位】: 國防科學(xué)技術(shù)大學(xué)計(jì)算機(jī)學(xué)院;北方車輛研究所;
【基金】:國家自然科學(xué)基金項(xiàng)目(60903059,61003087,61170049,61120106005) 國家“八六三”高技術(shù)研究發(fā)展計(jì)劃基金項(xiàng)目(2012AA01A309) “核高基”國家科技重大專項(xiàng)基金項(xiàng)目(2009ZX01036-001-003-001)
【分類號】:TP302.8
[Abstract]:The scale of high performance computer system is becoming larger and larger, and the problem of system reliability is becoming more and more serious. Checkpoint technique is the most typical fault-tolerant method, but because the performance of parallel file system is relatively slow and the data write bandwidth is low, the traditional checkpoint method has a severe performance problem. In view of the rich computing and storage resources in current computer systems and the relative lag in the increase of write bandwidth in parallel file systems, an asynchronous checkpoint fault-tolerant technique based on memory cache is proposed. The traditional checkpoint technique is divided into two steps: the checkpoint file is first cached in the local memory of the computing node, and then the data is copied to the parallel file system using an independent help task. Taking advantage of the characteristics of high local memory bandwidth and parallel execution of tasks and computing tasks, the new method greatly reduces the time cost introduced by checkpoint fault tolerance. Simulation and practical program tests verify the effectiveness of asynchronous checkpoint fault tolerance technology.
【作者單位】: 國防科學(xué)技術(shù)大學(xué)計(jì)算機(jī)學(xué)院;北方車輛研究所;
【基金】:國家自然科學(xué)基金項(xiàng)目(60903059,61003087,61170049,61120106005) 國家“八六三”高技術(shù)研究發(fā)展計(jì)劃基金項(xiàng)目(2012AA01A309) “核高基”國家科技重大專項(xiàng)基金項(xiàng)目(2009ZX01036-001-003-001)
【分類號】:TP302.8
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 曹宏嘉;盧宇彤;謝e,
本文編號:2334384
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2334384.html
最近更新
教材專著