基于智能體的多機(jī)器人系統(tǒng)學(xué)習(xí)方法研究

發(fā)布時(shí)間：2018-08-26 11:03

【摘要】：與單個(gè)機(jī)器人相比較,多機(jī)器人(MRS)具有很多優(yōu)勢和良好的發(fā)展前景,已經(jīng)成為機(jī)器人領(lǐng)域中的研究熱點(diǎn)。多機(jī)器人系統(tǒng)是一個(gè)復(fù)雜的動(dòng)態(tài)系統(tǒng),在設(shè)計(jì)機(jī)器人控制策略的時(shí)候,通常不能夠預(yù)先為每個(gè)機(jī)器人設(shè)定好所有的最優(yōu)行為�；谛袨榈姆椒軌蜃尪鄼C(jī)器人系統(tǒng)呈現(xiàn)出一些智能的特點(diǎn),完成比較復(fù)雜的任務(wù),極大地促進(jìn)了多機(jī)器人系統(tǒng)的發(fā)展。但是僅采用基于行為的方法還不能完全適應(yīng)不斷變化的外界環(huán)境和不同任務(wù)的需求,讓多機(jī)器人系統(tǒng)具有自主的學(xué)習(xí)能力,同時(shí)避免單一學(xué)習(xí)方法的局限性,從而不斷提高個(gè)體機(jī)器人之間的協(xié)調(diào)協(xié)作能力是多機(jī)器人系統(tǒng)的重要發(fā)展方向。因此研究將不同的機(jī)器學(xué)習(xí)方法與基于行為的多機(jī)器人系統(tǒng)相結(jié)合具有很好的研究意義。本文采用智能體理論對(duì)多機(jī)器人系統(tǒng)進(jìn)行研究,其主要的研究內(nèi)容包括:首先,研究了智能體及多智能體系統(tǒng)的理論,分析了單機(jī)器人和多機(jī)器人系統(tǒng)的幾種體系結(jié)構(gòu),提出將基于行為的方法和基于學(xué)習(xí)的方法相結(jié)合來探索多機(jī)器人協(xié)同的研究思路,同時(shí)設(shè)計(jì)了基于行為的機(jī)器人編隊(duì)和足球系統(tǒng)。在多機(jī)器人系統(tǒng)眾多的研究內(nèi)容中,學(xué)習(xí)能力占據(jù)了重要位置�；谛袨榈姆椒ň哂恤敯粜詮�(qiáng)、靈活的特點(diǎn),相對(duì)于其它的方法能更好地使機(jī)器人完成任務(wù)。本文以基于行為的方法為基礎(chǔ),結(jié)合不同的機(jī)器學(xué)習(xí)方法,針對(duì)多機(jī)器人系統(tǒng)的兩個(gè)主要應(yīng)用平臺(tái):機(jī)器人編隊(duì)和足球,在機(jī)器人仿真軟件Mission Lab和Teambots的基礎(chǔ)上,設(shè)計(jì)了基于行為的多機(jī)器人系統(tǒng),從而可以對(duì)本文提出的幾種算法進(jìn)行驗(yàn)證。其次,研究了粒子群優(yōu)化算法(PSO)和基于案例的推理(CBR)方法,針對(duì)這兩種方法各自的優(yōu)勢,提出了一種融合PSO與CBR的混合系統(tǒng)方法。傳統(tǒng)的基于行為的方法雖然具有很多優(yōu)點(diǎn),但是其固定的行為參數(shù)難以適應(yīng)外界復(fù)雜的環(huán)境。CBR作為人工智能中的一項(xiàng)重要技術(shù),因?yàn)槠渚哂幸子跈z索和存儲(chǔ)的特點(diǎn),很適合為不同的行為提供相應(yīng)的參數(shù)。但是傳統(tǒng)的CBR方法缺乏有效的學(xué)習(xí)能力,因此本文提出將PSO作為CBR的優(yōu)化器,讓CBR不斷得到更好的案例,同時(shí)PSO也可以通過CBR獲得更好的初始種群。與遺傳算法(GA)相比較,PSO也是一種群智能方法,但是具有結(jié)構(gòu)更簡單、實(shí)時(shí)性強(qiáng)和適合對(duì)連續(xù)問題進(jìn)行優(yōu)化的特點(diǎn),可以說遺傳算法能夠解決的問題,粒子群優(yōu)化算法都能夠解決。本文將PSO算法與CBR方法相結(jié)合,不僅克服了CBR的缺點(diǎn),同時(shí)也滿足了實(shí)時(shí)性和對(duì)連續(xù)問題進(jìn)行優(yōu)化的需求。同時(shí)以基于行為的機(jī)器人編隊(duì)為測試平臺(tái),與標(biāo)準(zhǔn)的粒子群優(yōu)化算法相比較,驗(yàn)證了該方法的有效性。然后,研究了強(qiáng)化學(xué)習(xí)的基本理論和典型的Q-學(xué)習(xí)方法,針對(duì)傳統(tǒng)Q-學(xué)習(xí)在多機(jī)器人系統(tǒng)中應(yīng)用的缺點(diǎn):缺乏信息交流和結(jié)構(gòu)信度分配問題,提出了一種采用經(jīng)驗(yàn)共享和濾波技術(shù)的改進(jìn)Q-學(xué)習(xí)算法,從而改善了學(xué)習(xí)性能、提高了學(xué)習(xí)效率。Q-學(xué)習(xí)算法的理論基礎(chǔ)是馬爾可夫決策過程,直接把Q-學(xué)習(xí)應(yīng)用到多機(jī)器人系統(tǒng)中雖然破壞了這個(gè)前提,但是Q-學(xué)習(xí)因?yàn)榫哂羞\(yùn)算簡單、狀態(tài)-動(dòng)作空間規(guī)模小的特點(diǎn),在機(jī)器人學(xué)習(xí)中還是得到了廣泛應(yīng)用。與多智能體強(qiáng)化學(xué)習(xí)方法相比較,傳統(tǒng)的Q-學(xué)習(xí)算法缺乏與其它智能體的信息交流,因此本文提出了采用經(jīng)驗(yàn)共享的方式,每個(gè)智能體共享其它智能體的Q值信息,在學(xué)習(xí)的過程中采用了漸進(jìn)的學(xué)習(xí)方式,利用?-Greedy策略以1-?的概率來選取其它智能體的學(xué)習(xí)經(jīng)驗(yàn)。同時(shí)為了加速Q(mào)-學(xué)習(xí)的收斂,不同于簡單地把回報(bào)信號(hào)統(tǒng)一分配給每個(gè)智能體,本文將卡爾曼濾波技術(shù)運(yùn)用到回報(bào)信號(hào)的分配中,即把接收到的回報(bào)信號(hào)看作是真實(shí)的回報(bào)信號(hào)與噪聲信號(hào)的結(jié)合,在一定程度上解決了結(jié)構(gòu)信度分配問題。以機(jī)器人足球?yàn)闇y試平臺(tái),與傳統(tǒng)的Q-學(xué)習(xí)算法相比較,驗(yàn)證了該方法的有效性。最后,研究了幾種典型的多智能體強(qiáng)化學(xué)習(xí)算法Minimax-Q、Nash-Q、FFQ和CE-Q和基于后悔理論的學(xué)習(xí)方法,針對(duì)傳統(tǒng)的CE-Q算法收斂速度慢的缺點(diǎn):缺乏有效的行為探索策略,提出了一種采用無悔策略的新型CE-Q學(xué)習(xí)算法。馬爾可夫?qū)Σ呃碚摓槎嘀悄荏w強(qiáng)化學(xué)習(xí)提供了很好的理論基礎(chǔ),納什均衡在多智能體強(qiáng)化學(xué)習(xí)中起到了重要作用,因此這些算法也被稱作基于均衡的學(xué)習(xí)算法。與Nash-Q學(xué)習(xí)算法中計(jì)算納什均衡相比較,計(jì)算CE-Q中的相關(guān)均衡更容易,因此CE-Q有著更好的應(yīng)用前景。但是傳統(tǒng)的CE-Q學(xué)習(xí)方法缺乏有效的行為探索策略,因此影響了CE-Q學(xué)習(xí)方法的收斂速度。從無悔策略的理論中得到啟發(fā),如果每個(gè)智能體都選擇減少平均后悔值的方法作為行為探索策略,那么所有智能體的行為將趨向于收斂到一組沒有后悔值的集合點(diǎn),這組集合點(diǎn)也被稱為粗糙相關(guān)均衡集合。同時(shí)經(jīng)過分析得到,納什均衡和相關(guān)均衡在本質(zhì)上都屬于粗糙相關(guān)均衡。因此本文提出了采用減少平均后悔值的新型CE-Q學(xué)習(xí)算法,加快CE-Q學(xué)習(xí)方法的收斂速度。最后以機(jī)器人足球?yàn)闇y試平臺(tái),與傳統(tǒng)的CE-Q學(xué)習(xí)算法相比較,驗(yàn)證了該方法的有效性。
[Abstract]:Compared with a single robot, multi-robot system (MRS) has many advantages and good prospects for development, and has become a research hotspot in the field of robotics. Multi-robot system is a complex dynamic system. When designing robot control strategies, it is usually not possible to set all the optimal behaviors for each robot in advance. Behavior-based method can make the multi-robot system show some intelligent characteristics and accomplish complex tasks, which greatly promotes the development of multi-robot system. However, the behavior-based method can not fully adapt to the changing environment and the needs of different tasks, so the multi-robot system can learn independently. It is an important development direction of multi-robot systems to improve the coordination and cooperation ability of individual robots by avoiding the limitation of single learning method. Therefore, it is of great significance to combine different machine learning methods with behavior-based multi-robot systems. The main research contents of multi-robot system include: Firstly, the theory of agent and multi-agent system is studied, several architecture of single robot and multi-robot system are analyzed, and the research idea of multi-robot cooperation is explored by combining behavior-based method with learning-based method. Behavior-based robot formation and soccer system are designed. Learning ability plays an important role in many research contents of multi-robot system. Behavior-based method has the characteristics of robustness and flexibility. Compared with other methods, behavior-based method can make robot accomplish tasks better. With the same machine learning method, for the two main application platforms of multi-robot system: robot formation and soccer, a behavior-based multi-robot system is designed on the basis of robot simulation software Mission Lab and Teambots, which can verify several algorithms proposed in this paper. Secondly, particle swarm optimization algorithm (PS) is studied. O) and Case-based Reasoning (CBR) methods are proposed to integrate PSO and CBR. Traditional behavior-based methods have many advantages, but their fixed behavior parameters are difficult to adapt to the complex environment. CBR is an important technology in artificial intelligence because of its advantages. It is easy to retrieve and store, so it is suitable to provide corresponding parameters for different behaviors. But the traditional CBR method lacks effective learning ability. So this paper proposes PSO as CBR optimizer, which can make CBR get better cases continuously. PSO can also get better initial population through CBR. It is similar to genetic algorithm (GA). In comparison, PSO is also a kind of swarm intelligence method, but it has the characteristics of simpler structure, strong real-time and suitable for continuous problems optimization. It can be said that genetic algorithm can solve the problems, particle swarm optimization algorithm can solve. This paper combines PSO algorithm with CBR method, not only overcomes the shortcomings of CBR, but also meets the real-time requirements. Then, the basic theory of reinforcement learning and the typical Q-learning method are studied to overcome the shortcomings of traditional Q-learning in multi-robot systems. In the absence of information exchange and structure reliability allocation, an improved Q-learning algorithm using experience sharing and filtering techniques is proposed, which improves learning performance and efficiency. The theoretical basis of Q-learning algorithm is Markov decision process. The application of Q-learning directly to multi-robot system destroys this premise. However, Q-learning is still widely used in robot learning because of its simplicity of operation and small size of state-action space. Compared with Multi-Agent Reinforcement learning, traditional Q-learning algorithm lacks information exchange with other agents. Therefore, this paper proposes a method of sharing experience with each agent. In order to speed up the convergence of Q-learning, instead of simply assigning the return signal to each agent, Kalman filter is used in this paper. In the distribution of return signal, the received return signal is regarded as the combination of real return signal and noise signal, which solves the problem of structure reliability allocation to a certain extent. Multi-agent reinforcement learning algorithms Minimax-Q, Nash-Q, FFQ and CE-Q, as well as learning methods based on regret theory, are proposed to overcome the slow convergence speed of traditional CE-Q algorithm: lack of effective behavior exploration strategy. A new CE-Q learning algorithm using no regret strategy is proposed. Markov game theory provides reinforcement learning for multi-agent. Nash Equilibrium plays an important role in Multi-Agent Reinforcement learning, so these algorithms are also called equilibrium-based learning algorithms. Compared with Nash-Q learning algorithm, it is easier to calculate the correlation equilibrium in CE-Q, so CE-Q has a better application prospect. Inspired by the theory of no-regret strategy, if each agent chooses the method of reducing the average regret value as the behavior exploration strategy, the behavior of all agents will tend to converge to a set of set points without regret value. At the same time, it is found that both Nash Equilibrium and correlation Equilibrium belong to rough correlation Equilibrium in essence. Therefore, a new CE-Q learning algorithm is proposed to speed up the convergence of CE-Q learning method by reducing the average regret value. Compared with the traditional CE-Q learning algorithm, the effectiveness of the method is verified.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP242

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 項(xiàng)禎楨;蘇劍波;;表征空間中的機(jī)器人分層運(yùn)動(dòng)規(guī)劃[J];控制理論與應(yīng)用;2015年09期

2 張瑞雷;李勝;陳慶偉;楊春;;復(fù)雜地形環(huán)境下多機(jī)器人編隊(duì)控制方法[J];控制理論與應(yīng)用;2014年04期

3 李猛;梁加紅;李石磊;;一種改進(jìn)的多智能體碰撞避免行為[J];國防科技大學(xué)學(xué)報(bào);2013年03期

4 黎萍;楊宜民;;基于博弈論的多機(jī)器人系統(tǒng)任務(wù)分配算法[J];計(jì)算機(jī)應(yīng)用研究;2013年02期

5 吳軍;徐昕;連傳強(qiáng);賀漢根;;協(xié)作多機(jī)器人系統(tǒng)研究進(jìn)展綜述[J];智能系統(tǒng)學(xué)報(bào);2011年01期

6 李波;王祥鳳;;基于動(dòng)態(tài)Leader多機(jī)器人隊(duì)形控制[J];長春工業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年02期

7 張英菊;仲秋雁;葉鑫;曲曉飛;;基于案例推理的應(yīng)急輔助決策方法研究[J];計(jì)算機(jī)應(yīng)用研究;2009年04期

8 廖振良;劉宴輝;徐祖信;;基于案例推理的突發(fā)性環(huán)境污染事件應(yīng)急預(yù)案系統(tǒng)[J];環(huán)境污染與防治;2009年01期

9 賈兆紅;陳華平;;基于改進(jìn)遺傳算法的權(quán)重發(fā)現(xiàn)技術(shù)[J];計(jì)算機(jī)工程;2007年05期

10 王學(xué)寧,徐昕,吳濤,賀漢根;策略梯度強(qiáng)化學(xué)習(xí)中的最優(yōu)回報(bào)基線[J];計(jì)算機(jī)學(xué)報(bào);2005年06期

相關(guān)博士學(xué)位論文前10條

1 李s，

本文編號(hào)：2204673

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xxkjbs/2204673.html

上一篇：帶有通配符和長度約束的模式匹配問題求解及其應(yīng)用研究
下一篇：全雙工大規(guī)模MIMO系統(tǒng)的頻譜與能量效率研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于智能體的多機(jī)器人系統(tǒng)學(xué)習(xí)方法研究