EDGE體系結構指令動態(tài)映射算法研究

發(fā)布時間：2018-07-24 13:38

【摘要】：亂序超標量處理器中廣泛存在的集總式結構已嚴重限制微處理器性能的提升。EDGE(Explicit Data Graph Execution)作為應對微處理器性能提升瓶頸的模型之一，從結構模型中摒棄了超標量中能耗大不易擴展的集總式結構。在分布式EDGE結構中，指令映射到多個分片上同時執(zhí)行。分片之間操作數傳遞需要延時從而導致性能下降。指令映射算法通過仔細權衡程序的并行度和分片間通信延時來試圖消除分片后帶來的性能損失。 TRIPS微處理器采用關鍵資源拓撲結構不對稱分布和靜態(tài)指令映射算法(SPDI, Static Placement Dynamic Issue)。這會導致ET(Execute Tile)上較大的負載不均衡和操作數網絡通信熱點，從而引起IPC下降。本文在M5-EDGE模擬器中實現與TRIPS類似的EDGE結構，以此來研究指令動態(tài)Deep映射算法。在缺乏編譯器調度下，采用循環(huán)映射方式的Deep算法在發(fā)射寬度為1和2時IPC分別為SPDI的85%和98.3%。針對RT(Register Tile)和DT(Data-cache Tile)的拓撲位置，對Deep映射進行三種優(yōu)化：依照ET編號順序、“之”字形順序和計算甚塊全局通信跳步數之和來優(yōu)先選擇ET。在發(fā)射寬度為1時三種優(yōu)化與基本的Deep算法相比平均跳步分別減少2.63%、2.18%和4.70%，而IPC分別提升1.07%、1.21%和2.11%。這說明在Deep映射下優(yōu)化指令間通信跳步數能顯著提高IPC。在Deep映射算法中，90%以上的操作數通過操作數旁路來傳遞，大大減少操作數網絡的負載。在bypass寬度為2倍發(fā)射寬度時，，本地的操作數傳遞延時幾乎下降為0。增加本地bypass寬度，能有效的減少操作數傳遞的延時。將RT按編號分配到ET上，基本Deep映射算法的IPC提升1.77%。針對DT位置進行優(yōu)化，優(yōu)先選擇靠近DT的ET和計算甚塊通信跳數之和選擇ET。這兩種優(yōu)化比基本Deep映射IPC分別提升1.17%和1.89%。將RT和DT平鋪到ET中形成4x4的拓撲結構。在發(fā)射寬度為1和2時該結構中Deep映射的IPC分別為SPDI的97.18%和113.42%。計算跳步數選擇ET，這一比值為97.32%和114.06%。微結構變化導致拓撲距離變小或者Deep映射算法優(yōu)化通信跳步數時，能顯著提高系統(tǒng)IPC。
[Abstract]:The lumped structure widely existing in scrambled superscalar processors has seriously restricted the performance improvement of microprocessors. Edge (Explicit Data Graph Execution) is one of the models to deal with the bottleneck of microprocessor performance enhancement. The lumped structure with large energy consumption in superscalar is abandoned from the structural model. In a distributed EDGE architecture, instructions are mapped to multiple slices to execute simultaneously. The transmission of operands between slices requires delay, which results in performance degradation. The instruction mapping algorithm tries to eliminate the performance loss caused by fragmentation by carefully weighing the program parallelism and inter-slice communication delay. The TRIPS microprocessor adopts asymmetric distribution of critical resource topology and static reference. Mapping algorithm (SPDI, Static Placement Dynamic Issue). This will lead to a large load imbalance and Operand network communication hot spots on the ET (Execute Tile), thus causing a decrease in IPC. In this paper, a EDGE structure similar to TRIPS is implemented in the M5-EDGE simulator to study the instruction dynamic Deep mapping algorithm. In the absence of compiler scheduling, the Deep algorithm using cyclic mapping is 85% of SPDI and 98.3% of SPDI when the transmission width is 1 and 2, respectively. According to the topological position of RT (Register Tile) and DT (Data-cache Tile), three kinds of optimization of Deep mapping are carried out: according to the order of et numbering, the glyph order of "its" and the sum of calculating the number of leapfrogging steps in the global communication of very block to select ETs first. When the launch width is 1, the average jump steps of the three optimizations are 2.63% and 4.70% less than those of the basic Deep algorithm, respectively, while the IPC increases by 1.07% and 2.11%, respectively. This shows that optimizing the jump number of inter-instruction communication under Deep mapping can significantly increase the number of jump steps. In the Deep mapping algorithm, more than 90% of the operands are transferred by the optograph bypass, which greatly reduces the load of the operands network. When the bypass width is 2 times the transmit width, the local Operand transfer delay is almost reduced to 0. 0. Increasing the local bypass width can effectively reduce the delay of Operand transfer. RT is assigned to et by number, and the IPC of basic Deep mapping algorithm increases by 1.77. For the DT position optimization, the et near DT and the sum of calculated VBS hops are selected first. These two optimizations are 1.17% and 1.89% higher than the basic Deep mapping IPC, respectively. The RT and DT are tiled into the et to form the topological structure of 4x4. When the emission width is 1 and 2, the IPC of Deep map is 97.18% of SPDI and 113.42% of SPDI, respectively. The ratio of ETs was 97.32% and 114.06% respectively. When the topology distance becomes smaller or the Deep mapping algorithm optimizes the number of communication hops, the system IPCs can be improved significantly.
【學位授予單位】：哈爾濱工業(yè)大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP332;TP301.6

【共引文獻】

相關期刊論文前10條

1 裴頌文;吳小東;唐作其;熊乃學;;異構千核處理器系統(tǒng)的統(tǒng)一內存地址空間訪問方法[J];國防科技大學學報;2015年01期

2 楊文頂;覃志東;;基于NoC的眾核處理器可靠性仿真分析研究[J];智能計算機與應用;2015年02期

3 劉東;張進寶;廖小飛;金海;;面向混合內存體系結構的模擬器[J];華東師范大學學報(自然科學版);2014年05期

4 謝子超;佟冬;黃明凱;;A General Low-Cost Indirect Branch Prediction Using Target Address Pointers[J];Journal of Computer Science and Technology;2014年06期

5 李凌達;陸俊林;程旭;;Retention Benefit Based Intelligent Cache Replacement[J];Journal of Computer Science and Technology;2014年06期

6 李笑天;殷淑娟;何虎;;一種DSP周期精度高效建模方法[J];計算機應用研究;2015年01期

7 劉雨辰;王佳;陳云霽;焦帥;;計算機系統(tǒng)模擬器研究綜述[J];計算機研究與發(fā)展;2015年01期

8 黃明凱;劉先華;譚明星;謝子超;程旭;;一種面向解釋器的間接轉移預測技術[J];計算機研究與發(fā)展;2015年01期

9 黃永兵;陳明宇;;移動設備應用程序的體系結構特征分析[J];計算機學報;2015年02期

10 楊群;李笑天;何虎;;面向Superscalar與VLIW混合架構處理器的調試器設計[J];計算機應用與軟件;2015年05期

相關博士學位論文前2條

1 章鐵飛;基于程序訪存模式的存儲系統(tǒng)節(jié)能技術研究[D];浙江大學;2013年

2 修思文;MPSoC性能估計技術研究[D];浙江大學;2015年

相關碩士學位論文前10條

1 王勛;面向非易失存儲器PCM的節(jié)能技術研究[D];浙江工業(yè)大學;2013年

2 辛愿;面向嵌入式系統(tǒng)的自調數據預取[D];浙江大學;2013年

3 胡妍;結合結構級和門級的多核處理器功耗評估方法[D];湖南大學;2013年

4 劉雨辰;基于多維數組的高速片上網絡模擬器的設計與實現[D];內蒙古大學;2014年

5 單磊;大規(guī)模并行片上系統(tǒng)的分布式并行模擬關鍵技術研究[D];國防科學技術大學;2012年

6 佘超杰;基于多核的片上網絡低延遲與低功耗的研究[D];北京工業(yè)大學;2014年

7 艾天鵬;基于通訊感知的片上網絡加速機制研究[D];浙江工業(yè)大學;2014年

8 陸yN;基于計算模型的體系結構模擬器研究[D];復旦大學;2013年

9 張浪;面向異構集成的NoC路由算法研究[D];武漢理工大學;2014年

10 繆旭陽;復雜體系結構的計算特征分類研究[D];武漢理工大學;2014年

本文編號：2141553

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2141553.html

上一篇：基于單片機的醫(yī)用點滴液速度監(jiān)控系統(tǒng)設計
下一篇：東莞證券數據中心的分析與設計

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

EDGE體系結構指令動態(tài)映射算法研究