數(shù)據(jù)中心集群監(jiān)控系統(tǒng)設(shè)計與實現(xiàn)
本文選題:集群 + 監(jiān)控; 參考:《中國地質(zhì)大學(xué)(北京)》2012年碩士論文
【摘要】:當(dāng)前以太網(wǎng)帶寬不斷提升,普通計算機價格不斷下降,由普通PC計算機作為節(jié)點,構(gòu)成基本運算單元,通過高速局域網(wǎng)相互連接,依靠軟件進行協(xié)作進行工作的集群系統(tǒng)具有性價比高、擴展性好等優(yōu)勢,已經(jīng)取代了傳統(tǒng)的大型機或巨型機,在很多工業(yè)領(lǐng)域得到了廣泛的應(yīng)用,如信息檢索、文本分析、大規(guī)模的數(shù)據(jù)挖掘、機器學(xué)習(xí)和時下流行的云計算。隨著集群系統(tǒng)的使用日益廣泛,人們?yōu)榱颂岣呒合到y(tǒng)的計算性能,不斷增加集群系統(tǒng)中節(jié)點的數(shù)量。集群系統(tǒng)由普通PC機器組成,PC機器性能并不穩(wěn)定,單個節(jié)點失效可能性非常大,在集群的規(guī)模不斷擴大后,集群系統(tǒng)監(jiān)控的作用越來越重要。通過監(jiān)控,可以發(fā)現(xiàn)哪些節(jié)點已經(jīng)失效,停止工作,得到系統(tǒng)中每個節(jié)點的利用情況,分析整個集群的運行趨勢、性能極限和作業(yè)瓶頸,為系統(tǒng)管理員的管理工作和集群任務(wù)調(diào)度提供依據(jù)。 本課題來自于子午工程數(shù)據(jù)中心,意在監(jiān)控數(shù)據(jù)中心負責(zé)空間天氣數(shù)值計算的集群系統(tǒng)的運行情況。本文根據(jù)子午工程數(shù)據(jù)中心的具體要求,設(shè)計和實現(xiàn)了一個集群監(jiān)控系統(tǒng),它的功能包括:采集集群系統(tǒng)中各個節(jié)點和系統(tǒng)負載、處理器各項使用時間、內(nèi)存使用情況、硬盤使用情況、網(wǎng)絡(luò)流量、系統(tǒng)相關(guān)的各種度量項;將各個節(jié)點的度量項匯總,存入數(shù)據(jù)庫,以WEB網(wǎng)頁的形式,展現(xiàn)給終端用戶,供用戶查詢和使用這些監(jiān)控項;根據(jù)用戶設(shè)置的度量項的取值范圍,對這些度量項進行量化分析,一旦發(fā)現(xiàn)存在異常度量項,則通過預(yù)定的通信規(guī)則,將異常的監(jiān)控項發(fā)送給相關(guān)人員,以進行進一步的處理,減少不必要的損失。系統(tǒng)為C/S結(jié)構(gòu),,包括分布在各個節(jié)點的代理程序,一定數(shù)量的匯總程序和前臺顯示界面。系統(tǒng)從/proc獲取監(jiān)控數(shù)據(jù),使用XML進行數(shù)據(jù)傳送,RRDTool來繪制數(shù)值類監(jiān)控項的趨勢圖,后臺包括RRD和MySQL兩種類型的數(shù)據(jù)庫。 本文設(shè)計的集群監(jiān)控系統(tǒng),能夠穩(wěn)定有效的監(jiān)控子午工程數(shù)據(jù)中心,具有占用系統(tǒng)資源少、反應(yīng)靈敏等特點。
[Abstract]:At present, the bandwidth of Ethernet is increasing and the price of ordinary computer is decreasing. The common PC computer is used as the node to form the basic operation unit, which is connected to each other through high-speed local area network. The cluster system, which relies on software to work together, has the advantages of high cost performance and good expansibility. It has replaced the traditional mainframe or supercomputer, and has been widely used in many industrial fields, such as information retrieval, text analysis, etc. Large-scale data mining, machine learning, and the current popularity of cloud computing. With the increasing use of cluster system, in order to improve the computing performance of cluster system, the number of nodes in cluster system is increasing. The cluster system is composed of ordinary PC machines and the performance of PC machine is not stable, and the possibility of single node failure is very large. After the expansion of cluster scale, the monitoring function of cluster system becomes more and more important. Through monitoring, we can find out which nodes have failed, stop working, get the utilization of each node in the system, analyze the running trend, performance limit and job bottleneck of the whole cluster. It provides the basis for the management of the system administrator and the task scheduling of the cluster. The purpose of this paper is to monitor the operation of the cluster system which is responsible for the spatial weather numerical calculation in the Meridian Engineering data Center. According to the specific requirements of Meridian Engineering data Center, a cluster monitoring system is designed and implemented in this paper. Its functions include: collecting each node and system load in the cluster system, processing time, memory usage, etc. Hard disk usage, network traffic, system related measures, the measurement items of each node are summarized, stored in the database, displayed to the end users in the form of Web pages, for users to query and use these monitoring items; According to the value range of the measurement items set by the user, the quantitative analysis of these measures is carried out. Once the abnormal metrics are found, the monitoring items of the exceptions are sent to the relevant personnel through the predetermined communication rules for further processing. Reduce unnecessary losses. The system consists of C / S structure, including agents distributed in each node, a certain number of summary programs and foreground display interface. The system obtains monitoring data from / proc, uses XML to transfer data to RRDTool to draw the trend diagram of numerical class monitoring items, and backstage includes two types of databases: RRD and MySQL. The cluster monitoring system designed in this paper can monitor meridian engineering data center stably and effectively.
【學(xué)位授予單位】:中國地質(zhì)大學(xué)(北京)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP308;TP277
【參考文獻】
相關(guān)期刊論文 前9條
1 劉波,代亞非,吳非;Apache服務(wù)器監(jiān)控系統(tǒng)的研究[J];高技術(shù)通訊;2001年02期
2 邢航,劉清,鄭樺,徐智穹;基于網(wǎng)絡(luò)的遠程監(jiān)控系統(tǒng)研究[J];廣東自動化與信息工程;2004年01期
3 秦中盛;王寅峰;董小社;;支持網(wǎng)格監(jiān)控服務(wù)自動部署的系統(tǒng)[J];華中科技大學(xué)學(xué)報(自然科學(xué)版);2006年S1期
4 魏文國,張凌,董守斌,梁正友;一個可靠的集群簇/網(wǎng)格監(jiān)控系統(tǒng)[J];計算機應(yīng)用;2004年05期
5 門健;網(wǎng)絡(luò)告警管理系統(tǒng)的設(shè)計與測試[J];空軍工程大學(xué)學(xué)報(自然科學(xué)版);2004年04期
6 徐建;張琨;劉鳳玉;;基于Linux的計算系統(tǒng)性能監(jiān)控[J];南京理工大學(xué)學(xué)報(自然科學(xué)版);2007年05期
7 范軍濤;李國慶;;實用的機群監(jiān)控系統(tǒng)[J];計算機工程與設(shè)計;2008年01期
8 孫愛婷;劉青昆;;高效的機群監(jiān)控信息采集模型[J];計算機工程與設(shè)計;2010年20期
9 劉青昆;孫愛婷;;具有容錯機制的機群監(jiān)控系統(tǒng)[J];計算機工程與設(shè)計;2010年21期
本文編號:2063355
本文鏈接:http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2063355.html