登录 | 注册 | 退出 | 公司首页 | 繁体中文 | 满意度调查
综合馆
基于连续时间半马尔可夫决策过程的犗狆狋犻狅狀算法
  • 摘要

    针对大规模或复杂的随机动态规划系统,可利用其分层结构特点或引入分层控制方式,借助分层强化学习(HierarchicalReinforcementLearning,HRL)来解决其“维数灾”和“建模难”问题.HRL归属于样本数据驱动优化方法,通过空间/时间抽象机制,可有效加速策略学习过程.其中,Option方法可将系统目标任务分解成多个子目标任务来学习和执行,层次化结构清晰,是具有代表性的HRL方法之一.传统的Option算法主要是建立在离散时间半马尔可夫决策过程(Semi-MarkovDecisionProcesses,SMDP)和折扣性能准则基础上,无法直接用于解决连续时间无穷任务问题.因此本文在连续时间SMDP框架及其性能势理论下,结合现有的Option算法思想,运用连续时间SMDP的相关学习公式,建立一种适用于平均或折扣性能准则的连续时间统一Option分层强化学习模型,并给出相应的在线学习优化算法.最后通过机器人垃圾收集系统为仿真实例,说明了这种HRL算法在解决连续时间无穷任务优化控制问题方面的有效性,同时也说明其与连续时间模拟退火Q学习相比,具有节约存储空间、优化精度高和优化速度快的优势.

  • 作者

    唐昊  张晓艳  韩江洪  周雷 

  • 作者单位

    合肥工业大学计算机与信息学院合肥230009; 合肥工业大学电气与自动化工程学院合肥230009/合肥工业大学计算机与信息学院合肥230009

  • 刊期

    2014年9期 ISTIC EI PKU

  • 关键词

    连续时间半Markov决策过程  分层强化学习  Q学习 

参考文献
  • [1] 彭志平,李绍平. 分层强化学习研究进展. 计算机应用研究, 2008,4
  • [2] 高阳,周如益,王皓,曹志新. 平均奖赏强化学习算法研究. 计算机学报, 2007,8
  • [3] WEI LI,Qingtai YE,Changming ZHU. APPLICATION OF HIERARCHICAL REINFORCEMENT LEARNING IN ENGINEERING DOMAIN. 系统科学与系统工程学报(英文版), 2005,2
  • [4] Kearns M;Singh S. Finite-sample convergence rates for Q-learning and indirect algorithms. Denver,USA, 1998
  • [5] Mehta N;Natarajan S;Tadepalli P;Fern A. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 2008,03
  • [6] Teddy S D;Lai E M;Quek C. Hierarchically clustered adap-tive quantization CMAC and its learning convergence. IEEE Transactions on Neural Networks, 2007,06
  • [7] GhavamzadehM;MahadevanS. Continuous-timehierarchical reinforcement learning. Williamstown,USA, 2001
  • [8] Ghavamzadeh M;Mahadevan S. Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 2007,11
  • [9] Shen Jing;Gu Guo-Chang;Liu Hai-Bo. Multi-agent hierar-chical learning by integrating options into MAXQ. Hangzhou,China, 2006
  • [10] Parr R;Russell S. Reinforcement learning with hierarchies of machines. Denver,USA, 1997
  • [11] Dietterich T G. Hierarchical reinforcement learning with the MAX-Q value function decomposition. Journal of Artificial Intelligence Research, 2000,13
  • [12] Azar M;Munos R;Ghavamzadeh M;Kappen H J. Speedy Q-learning. Vancouver,Canada, 2011
  • [13] Sutton R S;Barto A G. Reinforcement Learning:An Intro-duction. Cambridge,MA:MIT Press, 1998
  • [14] Sutton R S;Precup D;Singh S. Between MDPs and Semi-MDPs:A framework for temporal abstraction in reinforce-ment learning. Artificial Intelligence, 1999,02
  • [15] Guo Fang-Ming;Song Hua. Course-Scheduling algorithm of option-based hierarchical reinforcement learning. Wuhan,China, 2010
  • [16] 沈晶. 分层强化学习理论与方法. 哈尔滨:哈尔滨工业大学出版社, 2007
  • [17] Tang Hao;Arai T. Look-ahead control of conveyor-serviced production station by using Potential-based online policy iteration. International Journal of Control, 2009,10
  • [18] Cao Xi-Ren. Semi-Markov decision problems and performance sensitivity analysis. IEEE Transactions on Automatic Control, 2003,05
  • [19] Cao Xi-Ren. Stochastic Learning and Optimization:A Sensi-tivity-Based Approach. New York:Springer-Verlag New York Inc, 2007
  • [20] Metropolis N;Rosenbluth A W;Rosenbluth MN. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 1953,06
  • [21] Barto A G;Mahadevan S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems Theory and Applications, 2003,04
  • [22] Singh S;Jaakkola T;Littman ML;Szepesvari C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 2000,03
查看更多︾
相似文献 查看更多>>
34.226.244.70