片上多核處理器緩存子系統(tǒng)優(yōu)化的研究

發(fā)布時間：2019-01-21 13:17

【摘要】：當前的片上多核處理器需要大容量的緩存系統(tǒng)來降低快速的處理器和慢速的片下主存之間的性能差距。本文認為可以利用和挖掘片上多核處理器的特性來優(yōu)化其緩存子系統(tǒng)的性能和功耗。本文的工作研究了幾個優(yōu)化片上多核處理器緩存子系統(tǒng)性能的機制。具體來說,本文的研究主題包含三個方面：1)研究和設計高效的多播路由算法來提升片上網(wǎng)絡的性能；2)利用當前的新型的非易失性存儲器來為片上多核處理器設計低功耗的緩存系統(tǒng)；3)挖掘利用線程的進度信息來設計更加高效的緩存一致性協(xié)議。針對第一個研究主題,我們提出了一種高效的片上網(wǎng)絡多播路由機制。對于集成越來越多核的片上多核處理器來說,片上網(wǎng)絡為其提供了一個高效的、可擴展的通信基礎架構(gòu)。對于多核架構(gòu)下的片上網(wǎng)絡來說,一對多的通信模式是很普遍的。沒有有效的多播路由機制的支持,傳統(tǒng)的基于單播的片上網(wǎng)絡在處理這些多播通信時是很低效的。本文提出了一個基于網(wǎng)絡劃分的多播路由機制,簡稱DPM。DPM可以高效地減低片上網(wǎng)絡中網(wǎng)絡包的平均傳輸延遲以及降低片上網(wǎng)絡的功耗。具體來說,DPM可以根據(jù)當前網(wǎng)絡中負載均衡級別以及多播通信的鏈路共享特征來動態(tài)地進行路由選擇。本文的第二個研究課題是利用一種新型的非易失性存儲器(自旋轉(zhuǎn)移矩隨機訪問存儲器,STT-RAM)來為片上多核處理器設計低功耗的緩存。STT-RAM具有快速的訪問速度、高存儲密度以及可以忽略不計的泄露功率。然而,大規(guī)模地應用STT-RAM作為多核處理器的緩存受到STT-RAM的較長的寫延遲以及較高的寫功耗的約束。最近研究表明過降低STT-RAM的存儲單元(磁性隧道結(jié)MTJ)的數(shù)據(jù)保持時間可以有效地提升其寫性能。但是保持時間降低的STT-RAM是易失性的,需要通過周期性地刷新其存儲單元來避免數(shù)據(jù)丟失。當這樣的STT-RAM用于多核的最后一級緩存(LLC)時,頻繁的刷新操作在加劇能量消耗的同時也會給系統(tǒng)的性能帶來負面影響。文本提出了一種高效的刷新方案(簡稱CCear)可以最小化這類STT-RAM上的刷新操作。CCear主要通過與緩存一致性協(xié)議以及緩存管理算法進行交互來消除不必要的刷新操作。最后我們提出了一個高效的一致性協(xié)議的調(diào)整機制來優(yōu)化運行在片上多核處理器上的并行程序的性能。片上多核處理器的一個主要目標就是通過挖掘線程級別的并行性來繼續(xù)提升應用程序的性能。但是對于運行在這類系統(tǒng)上的多線程程序來說,由于不均勻的任務分配以及共享資源的沖突,不同的線程通常呈現(xiàn)出不同的執(zhí)行進度。這種進度的不均勻性是多線程程序性能的最大的瓶頸之一。由于多線程程序內(nèi)在的同步機制,如內(nèi)存屏障和鎖,運行具有較快進度的線程的核必須停下來等待進度較慢的核。這樣的空等不僅會降低系統(tǒng)性能,也會導致功耗的浪費。本文提出了一種線程進度感知的一致性調(diào)整機制,簡稱TEACA。TEACA利用線程的進度信息來動態(tài)地調(diào)整每個線程的一致性策略,目的是提升片上網(wǎng)絡帶寬資源的使用效率以及降低功耗。具體來說,TEACA動態(tài)地將線程劃分為二類：領導者線程與落后者線程。隨后,TEACA會根據(jù)線程來類別信息為其一致性請求提供特定的一致性策略。
[Abstract]:the current slice-on-chip multi-core processor requires a high-capacity caching system to reduce the performance gap between a fast processor and a slow chip. It is considered that the performance and power consumption of the cache sub-system can be optimized by using and digging the characteristics of the multi-core processor on the chip. In this paper, the mechanism of multi-core processor cache sub-system performance on several optimization slices is studied. In particular, the research topic in this paper includes three aspects: 1) research and design efficient multicast routing algorithm to improve the performance of the network on the chip; 2) use the current new non-volatile memory to design a low-power cache system for the multi-core processor on the chip; and 3) mining the progress information of the utilization thread to design a more efficient cache coherence protocol. For the first subject of the study, we propose a high-efficiency, on-chip, network-multicast routing machine for multi-core processors with more and more cores, the on-chip network provides an efficient, scalable communication infrastructure architecture. For an on-chip network under a multi-core architecture, a large number of communication modes are common. Without the support of a valid multicast routing mechanism, conventional unicast-based on-chip networks are inefficient in handling these multicast communications This paper presents a network-based multicast routing mechanism, called DPM. DPM can effectively reduce the average transmission delay of the network packets in the network and reduce the work of the network on the chip in particular, DPM can dynamically route that route in accordance with the load balance level in the current network and the link share characteristics of the multicast communication The second subject of this paper is to use a new non-volatile memory (spin transfer moment random access memory, STT-RAM) to design low power consumption for multi-core processors on the chip The cache. STT-RAM has a fast access speed, a high storage density, and a negligible drain however, large-scale application of STT-RAM as that cache of the multi-core processor is subject to a longer write delay of the STT-RAM and high write power consumption The recent study has shown that the data retention time of a memory cell (magnetic tunnel junction MTJ) that has reduced the STT-RAM can effectively increase it Write performance. However, the STT-RAM with reduced retention time is easy to lose, and it is necessary to avoid the number by periodically refreshing its storage unit It is lost. When such STT-RAM is used for the last-level cache (LLC) of a multi-core, frequent refresh operations will also negatively impact the performance of the system while increasing energy consumption The text provides a high-efficiency refresh scheme (CCear) that minimizes the brush on this class of STT-RAM The new operation. The CCear eliminates unnecessary brush by interacting with the cache coherency protocol and the cache management algorithm New operation. Finally, we put forward an efficient consistency protocol adjustment mechanism to optimize the parallelism of the multi-core processor running on the chip The performance of the program. One of the main objectives of the multi-core processor on the chip is to continue to improve the application by digging the parallelism of the thread level the performance of a program. However, for a multi-threaded program running on this class of systems, different threads typically present different threads due to the non-uniform task assignment and the collision of the shared resource The progress of the execution of the progress. The non-uniformity of this progress is the maximum of the multi-threaded program performance One of the bottlenecks in a multi-threaded program, such as a memory barrier and a lock, and the kernel running a thread with a faster progress must stop and wait for entry a relatively slow core. Such an air, etc., will not only reduce the performance of the system, but also This paper presents a thread progress-aware consistency adjusting mechanism, called TEACA. The TEACA dynamically adjusts the consistency of each thread with the thread's progress information. The purpose of this paper is to improve the utilization efficiency of network bandwidth resources on the slice. in particular, that TEACA divide the thread into two types: leader thread and the latter thread. The TEACA then provides a specific request for its consistency request based on the thread's class information
【學位授予單位】：中國科學技術(shù)大學
【學位級別】：博士
【學位授予年份】：2013
【分類號】：TP332

【共引文獻】

相關期刊論文前4條

1 劉軼;吳名瑜;王永會;錢德沛;;一種硬件事務存儲系統(tǒng)中的事務嵌套處理方案[J];電子學報;2014年01期

2 Muhammad Abid Mughal;Hai-Xia Wang;Dong-Sheng Wang;;The Case of Using Multiple Streams in Streaming[J];International Journal of Automation and Computing;2013年06期

3 張駿;田澤;梅魁志;趙季中;;基于節(jié)點預測的直接Cache一致性協(xié)議[J];計算機學報;2014年03期

4 馮超超;張民選;李晉文;戴藝;;一種可配置雙向鏈路的片上網(wǎng)絡容錯偏轉(zhuǎn)路由器[J];計算機研究與發(fā)展;2014年02期

相關博士學位論文前5條

1 王慶;面向嵌入式多核系統(tǒng)的并行程序優(yōu)化技術(shù)研究[D];哈爾濱工業(yè)大學;2013年

2 朱素霞;面向多核處理器確定性重演的內(nèi)存競爭記錄機制研究[D];哈爾濱工業(yè)大學;2013年

3 楊兵;分簇超標量處理器關鍵技術(shù)研究[D];哈爾濱工業(yè)大學;2009年

4 馮超超;片上網(wǎng)絡無緩沖路由器關鍵技術(shù)研究[D];國防科學技術(shù)大學;2012年

5 陳銳忠;非對稱多核處理器的若干調(diào)度問題研究[D];華南理工大學;2013年

相關碩士學位論文前5條

1 閔銀皮;同構(gòu)通用流多核處理器存儲部件關鍵技術(shù)研究[D];國防科學技術(shù)大學;2012年

2 張岐;基于CMP的硬件事務存儲系統(tǒng)優(yōu)化技術(shù)研究[D];哈爾濱工程大學;2013年

3 張杰;基于CMP的共享L2Cache管理策略研究[D];哈爾濱工程大學;2013年

4 馬超;徽商銀行基金代銷自動賬戶系統(tǒng)設計與實現(xiàn)[D];大連理工大學;2013年

5 王勛;面向非易失存儲器PCM的節(jié)能技術(shù)研究[D];浙江工業(yè)大學;2013年

，

本文編號：2412699

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/jisuanjikexuelunwen/2412699.html

上一篇：基于嵌入式平臺的環(huán)境異常事件監(jiān)測
下一篇：基于ZigBee和ARM-Linux的無線傳感網(wǎng)與虛擬云桌面系統(tǒng)的設計

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

片上多核處理器緩存子系統(tǒng)優(yōu)化的研究