大型高能物理計(jì)算集群資源管理方法的評(píng)測(cè)
發(fā)布時(shí)間:2018-05-13 03:24
本文選題:資源管理系統(tǒng) + 作業(yè)調(diào)度器 ; 參考:《計(jì)算機(jī)科學(xué)》2017年10期
【摘要】:高能物理數(shù)據(jù)由物理事例組成,事例之間沒有相關(guān)性。可以通過大量作業(yè)同時(shí)處理大量不同的數(shù)據(jù)文件,從而實(shí)現(xiàn)高能物理計(jì)算任務(wù)的并行化,因此高能物理計(jì)算是典型的高吞吐量計(jì)算場(chǎng)景。高能所計(jì)算集群使用開源的TORQUE/Maui進(jìn)行資源管理及作業(yè)調(diào)度,并通過將集群資源劃分成不同隊(duì)列以及限制用戶最大運(yùn)行作業(yè)數(shù)來保證公平性,然而這也導(dǎo)致了集群整體資源利用率非常低下。SLURM和HTCondor都是近年來流行的開源資源管理系統(tǒng),前者擁有豐富的作業(yè)調(diào)度策略,后者非常適合高吞吐量計(jì)算,二者都能夠替代老舊、缺乏維護(hù)的TORQUE/Maui,都是管理計(jì)算集群資源的可行方案。在SLURM和HTCondor測(cè)試集群上模擬大亞灣實(shí)驗(yàn)用戶的作業(yè)提交行為,對(duì)SLURM和HTCondor的資源分配行為和效率進(jìn)行了測(cè)試,并與相同作業(yè)在高能物理研究所TORQUE/Maui集群上的實(shí)際調(diào)度結(jié)果進(jìn)行了對(duì)比,分析了SLURM及HTCondor的優(yōu)勢(shì)和不足,探討了使用SLURM或HTCondor管理高能物理研究所計(jì)算集群的可行性。
[Abstract]:The data of high energy physics are composed of physical events, and there is no correlation between them. The parallelization of high energy physics computing tasks can be realized by processing a large number of different data files at the same time, so high energy physics computing is a typical high throughput computing scenario. High-energy computing clusters use open source TORQUE/Maui for resource management and job scheduling, and ensure fairness by dividing cluster resources into different queues and limiting the maximum number of jobs run by users. However, this also leads to the very low overall utilization of resources in clusters. SLURM and HTCondor are popular open source resource management systems in recent years. The former has rich job scheduling strategies, and the latter is very suitable for high throughput computing. Both of them can replace the old ones. TORQUER / Maui, which lacks maintenance, is a feasible solution for managing computing cluster resources. The job submission behavior of experimental users in Daya Bay was simulated on SLURM and HTCondor test clusters, and the resource allocation behavior and efficiency of SLURM and HTCondor were tested, and compared with the actual scheduling results of the same jobs on the TORQUE/Maui cluster of the Institute of High Energy Physics. The advantages and disadvantages of SLURM and HTCondor are analyzed, and the feasibility of using SLURM or HTCondor to manage the cluster of high energy physics institutes is discussed.
【作者單位】: 中國科學(xué)院高能物理研究所;
【基金】:國家自然科學(xué)基金項(xiàng)目(11475210)資助
【分類號(hào)】:O572
【相似文獻(xiàn)】
相關(guān)會(huì)議論文 前1條
1 裴爾明;Karim Bernardet;于傳松;孫功星;;基于Agent技術(shù)“推拉”結(jié)合的網(wǎng)格作業(yè)調(diào)度系統(tǒng)[A];第十四屆全國核電子學(xué)與核探測(cè)技術(shù)學(xué)術(shù)年會(huì)論文集(2)[C];2008年
,本文編號(hào):1881445
本文鏈接:http://www.sikaile.net/kejilunwen/wulilw/1881445.html
最近更新
教材專著