You are here

Policy Iteration

Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$

Play Tic-tac-toe with Arthur Cayley! Part Two: Expansion

Marc's picture
Submitted by Marc on Wed, 02/17/2016 - 23:39

In part 1 of this series, the Tic-tac-toe reinforcement learning task was expressed as a Combinatorial Group with the hypothesis that the expansion of the group into a Cayley Graph could be used to learn its associated game tree. In this instalment, the expansion of the group into a Caley Graph will be examined in a bit more detail. Initially, the Tic-tac-toe group will be set aside in favour of a simpler domain which will offer a more compact and pedagogical representation. However, the expansion of the Tic-tac-toe group should follow the same process, this article will circle back to the Tic-tac-toe domain to highlight the equivalences which should ensure that this is so.

Subscribe to RSS - Policy Iteration