Policy Optimization in Reinforcement Learning: A Tale of Preconditioning and Regularization

Room: M105, Bldg: Muscarelle Center, M105, , 1000 River Road , Teaneck , New Jersey, United States, 07666, Virtual: https://events.vtools.ieee.org/m/304875

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space, whereas the vanilla softmax policy gradient method may take an exponential time to converge. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy, even when the regularizer lacks strong convexity and smoothness. Time permitting, we will discuss how these ideas can be leveraged to solve zero-sum Markov games. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.Co-sponsored by: North Jersey SectionSpeaker(s): Dr. Yuejie Chi, Agenda: Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space, whereas the vanilla softmax policy gradient method may take an exponential time to converge. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy, even when the regularizer lacks strong convexity and smoothness. Time permitting, we will discuss how these ideas can be leveraged to solve zero-sum Markov games. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.Room: M105, Bldg: Muscarelle Center, M105, , 1000 River Road , Teaneck , New Jersey, United States, 07666, Virtual: https://events.vtools.ieee.org/m/304875