基于泛函梯度的策略梯度方法的研究

發(fā)布時間：2019-04-29 17:02

【摘要】：強化學(xué)習(xí)是機器學(xué)習(xí)的重要研究方向之一,旨在使智能體通過與環(huán)境交互,不斷改進自身策略,最大化收到的累計獎賞。經(jīng)典的強化學(xué)習(xí)方法多基于值函數(shù),但是基于值函數(shù)的方法對于連續(xù)動作的任務(wù)難以處理,并且有"策略退化"現(xiàn)象。因此近些年來基于策略搜索的方法得到顯著發(fā)展。策略梯度方法是策略搜索的一類重要方法,基于策略參數(shù)梯度來更新策略。在策略梯度方法中,策略往往使用線性模型表示,導(dǎo)致系統(tǒng)受到線性模型表示能力有限的約束。而泛函梯度在監(jiān)督學(xué)習(xí)中能夠用于產(chǎn)生非參模型,基于泛函梯度的Boosting類方法已成為監(jiān)督學(xué)習(xí)代表性方法之一。然而泛函梯度在強化學(xué)習(xí)中研究較少。本文就泛函梯度在策略梯度方法中的使用開展研究,主要作出了以下工作:首先,設(shè)計了基于泛函梯度的策略梯度方法PolicyBoost,可學(xué)習(xí)決策樹等復(fù)雜模型的組合,避免了以往需要手動設(shè)計線性特征的缺點。其次,本文證明了在一定條件下,PolicyBoost的收斂性。針對理論分析得到可能出現(xiàn)的過擬合現(xiàn)象,通過引入基線和構(gòu)建采樣池,緩解了過擬合的問題。最后,本文在強化學(xué)習(xí)中的經(jīng)典任務(wù)Mountain Car、Acrobot、以及具有挑戰(zhàn)性的直升機懸�？刂迫蝿�(wù)的實驗,驗證了提出的算法效果優(yōu)良并且穩(wěn)定。
[Abstract]:Reinforcement learning is one of the important research directions of machine learning, which aims to make agents improve their own strategies and maximize the accumulated reward by interacting with the environment. Most of the classical reinforcement learning methods are based on the value function, but the method based on the value function is difficult to deal with the task of continuous action, and has the phenomenon of "policy degradation". Therefore, the strategy-based search method has been developed significantly in recent years. Policy gradient method is one of the most important methods in policy search, which updates the strategy based on the policy parameter gradient. In the strategy gradient method, the strategy is usually represented by linear model, which results in the system being constrained by the limited representation ability of linear model. Functional gradient can be used to generate non-parametric models in supervised learning. The Boosting class method based on functional gradient has become one of the representative methods of supervised learning. However, there is little research on functional gradient in reinforcement learning. In this paper, the use of functional gradient method in strategic gradient method is studied. The main work is as follows: firstly, the combination of PolicyBoost, learning decision tree and other complex models based on functional gradient method is designed. It avoids the disadvantage of manual design of linear features in the past. Secondly, we prove the convergence of PolicyBoost under certain conditions. Aiming at the possible over-fitting phenomenon in theoretical analysis, the over-fitting problem is alleviated by introducing the baseline and constructing the sample pool. Finally, the experiments of classical task Mountain Car,Acrobot, and challenging helicopter hover control task in reinforcement learning show that the proposed algorithm is effective and stable.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP181

【相似文獻】

相關(guān)期刊論文前10條

1 劉木;黃知超;鐘奕;范興明;楊升振;;一種改進的梯度方向角的圓檢測方法[J];電子設(shè)計工程;2011年18期

2 高智;仲思東;;基于梯度方向角量化的匹配新算法[J];計算機工程;2007年22期

3 生海迪;段會川;孔超;;詞袋模型中梯度方向離散精度閾值經(jīng)驗分析[J];計算機工程與設(shè)計;2014年09期

4 汪旭東;賈淵;;基于概率密度梯度方向的角點重定位技術(shù)[J];計算機應(yīng)用;2010年02期

5 李立春,馮衛(wèi)東,于起峰;根據(jù)邊緣梯度方向的十字絲目標(biāo)快速自動檢測[J];光學(xué)技術(shù);2004年03期

6 胡海鷗;祝建中;;一種邊點梯度方向引導(dǎo)的光滑邊段提取方法[J];計算機工程與應(yīng)用;2011年16期

7 郭軍;周暉;朱長仁;肖順平;;基于梯度方向二進制模式的空間金字塔模型方法[J];國防科技大學(xué)學(xué)報;2014年02期

8 王健;王孝通;徐曉剛;李博;;基于梯度的隨機Hough快速圓檢測方法[J];計算機應(yīng)用研究;2006年08期

9 裴沛;;基于邊緣梯度方向的圖像二值化方法[J];計算機與現(xiàn)代化;2013年05期

10 王靜;蔣愛德;;基于投影函數(shù)和梯度方向的快速人眼定位方法[J];科技信息(學(xué)術(shù)研究);2007年25期

相關(guān)會議論文前5條

1 趙淼;王珂;莊嚴(yán);王偉;;基于梯度方向雙邊對稱性的旋轉(zhuǎn)人臉中心跟蹤[A];2005年中國智能自動化會議論文集[C];2005年

2 王健;王孝通;徐曉剛;李博;;一種新的基于隨機Hough變換的圓檢測算法[A];第十二屆全國圖象圖形學(xué)學(xué)術(shù)會議論文集[C];2005年

3 李士進;熊輝;陸建峰;楊靜宇;;一種穩(wěn)健的人臉檢測方法[A];中國圖象圖形科學(xué)技術(shù)新進展——第九屆全國圖象圖形科技大會論文集[C];1998年

4 金英俊;王鐵軍;;開孔梯度泡沫彈塑性性質(zhì)的三維數(shù)值模擬[A];2009年度全國復(fù)合材料力學(xué)研討會論文集[C];2009年

5 鄧海峰;苗振江;;基于梯度直方圖的行人檢測算法的改進[A];第六屆和諧人機環(huán)境聯(lián)合學(xué)術(shù)會議（HHME2010)、第19屆全國多媒體學(xué)術(shù)會議（NCMT2010）、第6屆全國人機交互學(xué)術(shù)會議（CHCI2010）、第5屆全國普適計算學(xué)術(shù)會議（PCC2010）論文集[C];2010年

相關(guān)博士學(xué)位論文前2條

1 蘇亞藝;基于房價梯度的城市居住功能疏解研究[D];中國農(nóng)業(yè)大學(xué);2015年

2 戚建強;離心—凝膠成型工藝制備氣孔梯度陶瓷[D];中國建筑材料科學(xué)研究總院;2007年

相關(guān)碩士學(xué)位論文前5條

1 李耀;復(fù)雜環(huán)境中的車牌定位算法研究[D];南京郵電大學(xué);2015年

2 許丹;方差相關(guān)的策略梯度方法研究[D];蘇州大學(xué);2016年

3 侯鵬飛;基于泛函梯度的策略梯度方法的研究[D];南京大學(xué);2017年

4 劉美霞;面向復(fù)雜腦神經(jīng)纖維結(jié)構(gòu)重建的處理方法研究[D];天津大學(xué);2012年

5 楊小上;基于梯度方向特征的行人檢測[D];東北師范大學(xué);2012年

，

本文編號：2468376

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/2468376.html

上一篇：基于Kinect的仿人機器人伺服抓取物體研究
下一篇：一種用于氣體絕緣開關(guān)設(shè)備異物清掃與檢測的機器人系統(tǒng)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于泛函梯度的策略梯度方法的研究