當(dāng)前位置：主頁(yè) > 科技論文 > 自動(dòng)化論文 >

貝葉斯強(qiáng)化學(xué)習(xí)中策略迭代算法研究

發(fā)布時(shí)間：2018-07-10 07:50

本文選題：貝葉斯強(qiáng)化學(xué)習(xí) + 策略迭代��；參考：《蘇州大學(xué)》2016年碩士論文

【摘要】：貝葉斯強(qiáng)化學(xué)習(xí)是基于貝葉斯技術(shù),利用概率分布對(duì)值函數(shù)、策略和環(huán)境模型等參數(shù)進(jìn)行建模,求解強(qiáng)化學(xué)習(xí)相關(guān)任務(wù),其主要思想是利用先驗(yàn)分布估計(jì)未知參數(shù)的不確定性,然后通過(guò)獲得的觀察信息計(jì)算后驗(yàn)分布來(lái)學(xué)習(xí)知識(shí)�；诖�,本文以策略迭代方法為框架,提出三種改進(jìn)的基于貝葉斯推理和策略迭代的強(qiáng)化學(xué)習(xí)算法:(1)針對(duì)傳統(tǒng)的貝葉斯強(qiáng)化學(xué)習(xí)算法在學(xué)習(xí)未知的環(huán)境模型時(shí),不能動(dòng)態(tài)地控制環(huán)境模型學(xué)習(xí)次數(shù)的缺陷,提出一種基于貝葉斯智能模型學(xué)習(xí)的策略迭代算法。一方面,算法在模型學(xué)習(xí)部分利用Dirichlet分布方差閾值決定是否需要繼續(xù)學(xué)習(xí)模型,既保證模型學(xué)習(xí)的充分性,又降低模型學(xué)習(xí)的無(wú)效率。另一方面,算法在策略學(xué)習(xí)時(shí)利用探索激勵(lì)因子為選取探索動(dòng)作提供保障,同時(shí),也使得模型學(xué)習(xí)能夠遍歷所有狀態(tài)動(dòng)作對(duì),確保算法收斂。模型學(xué)習(xí)和策略學(xué)習(xí)相輔相成,使得算法收斂到最優(yōu)策略。(2)針對(duì)傳統(tǒng)的強(qiáng)化學(xué)習(xí)算法無(wú)法高效地解決動(dòng)作探索與利用的平衡問(wèn)題,提出一種基于動(dòng)作值函數(shù)(Q值函數(shù))概率估計(jì)的異步策略迭代算法。在策略評(píng)估部分,利用高斯伽瑪分布對(duì)Q值函數(shù)進(jìn)行建模,基于先驗(yàn)分布和觀察的數(shù)據(jù)求解Q值函數(shù)后驗(yàn),評(píng)估策略好壞。在策略改進(jìn)部分,基于Q值函數(shù)后驗(yàn)分布,利用Myopic-VPI求解最優(yōu)動(dòng)作,保證動(dòng)作探索與利用達(dá)到平衡。最后,算法采用異步更新方法,傾向于計(jì)算與策略相關(guān)的動(dòng)作值函數(shù),提高算法收斂速度。(3)針對(duì)傳統(tǒng)的策略迭代算法無(wú)法高效地解決狀態(tài)連續(xù)的且環(huán)境模型未知的MDP問(wèn)題,提出一種基于高斯過(guò)程時(shí)間差分的在線策略迭代算法。主要利用高斯過(guò)程和時(shí)間差分公式對(duì)動(dòng)作值函數(shù)進(jìn)行建模,結(jié)合貝葉斯推理,求解值函數(shù)空間的后驗(yàn)分布。在學(xué)習(xí)過(guò)程中,依據(jù)在線學(xué)習(xí)算法的特性及時(shí)評(píng)估改進(jìn)后的策略,邊學(xué)習(xí)邊改進(jìn)。在一定程度上,所提算法可以完成連續(xù)狀態(tài)空間下強(qiáng)化學(xué)習(xí)任務(wù)且收斂速度較快。
[Abstract]:Bayesian reinforcement learning is based on Bayesian technology, using probability distribution to model value function, strategy and environment model, and solving reinforcement learning related tasks. The main idea of Bayesian reinforcement learning is to use prior distribution to estimate the uncertainty of unknown parameters. Then the knowledge is learned by calculating the posteriori distribution of the observed information. Based on this, three improved reinforcement learning algorithms based on Bayesian reasoning and strategy iteration are proposed in this paper. (1) for the traditional Bayesian reinforcement learning algorithm, when learning unknown environment model, This paper presents a strategy iterative algorithm based on Bayesian intelligent model learning, which can not control the learning times of environment model dynamically. On the one hand, in the part of model learning, the threshold of Dirichlet distribution variance is used to determine whether to continue learning the model, which not only guarantees the adequacy of model learning, but also reduces the inefficiency of model learning. On the other hand, the search incentive factor is used to guarantee the selection of the exploration action in the strategy learning process. At the same time, the model learning can traverse all state action pairs to ensure the convergence of the algorithm. Model learning and strategy learning complement each other, which makes the algorithm converge to the optimal strategy. (2) the traditional reinforcement learning algorithm can not effectively solve the balance problem of action exploration and utilization. An asynchronous strategy iterative algorithm based on the probability estimation of action value function (Q valued function) is proposed. In the part of strategy evaluation, the Q value function is modeled by Gao Si gamma distribution, and the posteriori of Q value function is solved based on the prior distribution and observation data, and the evaluation strategy is good or bad. In the part of strategy improvement, based on the posteriori distribution of Q value function, Myopic-VPI is used to solve the optimal action to ensure the balance between the exploration and utilization of the action. Finally, the algorithm adopts asynchronous updating method, which tends to calculate the action value function related to the strategy, and improves the convergence speed of the algorithm. (3) the traditional strategy iterative algorithm can not efficiently solve the MDP problem with continuous state and unknown environment model. An online policy iterative algorithm based on Gao Si process time difference is proposed. The action value function is modeled by Gao Si process and time difference formula, and the posteriori distribution of value function space is solved by combining Bayesian reasoning. In the process of learning, the improved strategy is evaluated according to the characteristics of the online learning algorithm. To some extent, the proposed algorithm can accomplish reinforcement learning tasks in continuous state space and converge faster.
【學(xué)位授予單位】：蘇州大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類(lèi)號(hào)】：TP181

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 周興銘;張民選;;倒數(shù)迭代算法的理論分析與方案探討[J];計(jì)算機(jī)工程與科學(xué);1980年02期

2 周興銘,張民選;倒數(shù)迭代算法的理論分析與方案探討[J];計(jì)算機(jī)學(xué)報(bào);1981年05期

3 楊泰澄;一種求解局部實(shí)現(xiàn)問(wèn)題的迭代算法[J];信息與控制;1984年06期

4 張銘,吳士達(dá);最大似然陣處理的迭代算法[J];聲學(xué)與電子工程;1991年03期

5 姜亞健;劉停戰(zhàn);劉偉;;一族具有四階收斂的迭代算法[J];中國(guó)傳媒大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年03期

6 楊軍一;;方程求根的逆校正加速迭代算法[J];計(jì)算機(jī)工程與科學(xué);1987年03期

7 張培琨,李育林,劉繼芳,喬學(xué)光,忽滿利;隨機(jī)相位光學(xué)防偽中的前向迭代算法[J];激光雜志;1999年04期

8 張民選;;平方根迭代算法及其初值選擇[J];計(jì)算機(jī)工程與科學(xué);1987年02期

9 黃正良;萬(wàn)百五;韓崇昭;;大規(guī)模工業(yè)過(guò)程穩(wěn)態(tài)優(yōu)化控制新方法——自適應(yīng)雙迭代算法[J];控制與決策;1992年06期

10 凌燮亭,潘明德,林華;電路容差分析的區(qū)間迭代算法[J];電子學(xué)報(bào);1989年03期

相關(guān)會(huì)議論文前7條

1 劉立振;;BPT算法的分辨力與應(yīng)用前景[A];1990年中國(guó)地球物理學(xué)會(huì)第六屆學(xué)術(shù)年會(huì)論文集[C];1990年

2 胡光華;殷英;李世云;;即時(shí)差分策略迭代算法[A];中國(guó)運(yùn)籌學(xué)會(huì)第七屆學(xué)術(shù)交流會(huì)論文集（下卷）[C];2004年

3 劉曉龍;李峻宏;高建波;劉榮燈;劉蘊(yùn)韜;陳東風(fēng);;基于Levenberg-Marquardt算法的衍射峰形擬合[A];中國(guó)原子能科學(xué)研究院年報(bào) 2009[C];2010年

4 唐杰;;變分迭代算法在非線性微分方程中的應(yīng)用[A];第七屆全國(guó)非線性動(dòng)力學(xué)學(xué)術(shù)會(huì)議和第九屆全國(guó)非線性振動(dòng)學(xué)術(shù)會(huì)議論文集[C];2004年

5 代榮獲;張繁昌;劉漢卿;;基于快速閾值收斂迭代算法的基追蹤地震信號(hào)分解[A];2014年中國(guó)地球科學(xué)聯(lián)合學(xué)術(shù)年會(huì)——專(zhuān)題13：計(jì)算地震學(xué)論文集[C];2014年

6 王在華;;求時(shí)滯系統(tǒng)Hopf分岔周期解的迭代算法[A];第二屆全國(guó)動(dòng)力學(xué)與控制青年學(xué)者研討會(huì)論文摘要集[C];2008年

7 何志明;張迪生;;《一類(lèi)廣義L.Q.最優(yōu)控制模型的狀態(tài)迭代算法與并行處理》[A];1991年控制理論及其應(yīng)用年會(huì)論文集（上）[C];1991年

相關(guān)博士學(xué)位論文前3條

1 吳樹(shù)林;分裂——迭代算法的理論分析及應(yīng)用[D];華中科技大學(xué);2010年

2 武文佳;邊值問(wèn)題的四階緊有限差分方法及單調(diào)迭代算法[D];華東師范大學(xué);2012年

3 周小建;求解非線性方程重根的迭代算法[D];南京師范大學(xué);2013年

相關(guān)碩士學(xué)位論文前10條

1 李晨;基于MapReduce的多維迭代算法的研究與實(shí)現(xiàn)[D];東北大學(xué);2014年

2 尤樹(shù)華;貝葉斯強(qiáng)化學(xué)習(xí)中策略迭代算法研究[D];蘇州大學(xué);2016年

3 李枝枝;一類(lèi)復(fù)線性系統(tǒng)的乘積型三角分裂迭代算法[D];蘭州大學(xué);2016年

4 郭丹;Markov跳躍It?隨機(jī)系統(tǒng)中的耦合Lyapunov方程的快速迭代算法[D];哈爾濱工業(yè)大學(xué);2016年

5 王玉;線性隨機(jī)系統(tǒng)中的RICCATI方程加速迭代算法[D];哈爾濱工業(yè)大學(xué);2016年

6 丁可;一類(lèi)廣義協(xié)相補(bǔ)問(wèn)題組的解的存在性以及迭代算法[D];四川大學(xué);2004年

7 李程;M-矩陣及其‖A～（-1）‖_∞計(jì)算的迭代算法[D];電子科技大學(xué);2004年

8 王濤;兩類(lèi)線性系統(tǒng)的迭代算法[D];安徽大學(xué);2013年

9 雷坤;美式期權(quán)最優(yōu)實(shí)施邊界的單調(diào)迭代算法及其在定價(jià)計(jì)算中的應(yīng)用[D];華東師范大學(xué);2013年

10 宇斌彬;基于數(shù)據(jù)劃分的迭代算法的并行與優(yōu)化[D];中國(guó)科學(xué)技術(shù)大學(xué);2015年

，

本文編號(hào)：2112699

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/2112699.html

上一篇：腐蝕環(huán)境下銅薄膜傳感器金屬結(jié)構(gòu)裂紋監(jiān)測(cè)
下一篇：一個(gè)具有隱藏特性的新五維超混沌系統(tǒng)的同步研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

貝葉斯強(qiáng)化學(xué)習(xí)中策略迭代算法研究