In part 1 of this series, the Tic-tac-toe reinforcement learning task was expressed as a Combinatorial Group with the hypothesis that the expansion of the group into a Cayley Graph could be used to learn its associated game tree. In this instalment, the expansion of the group into a Caley Graph will be examined in a bit more detail. Initially, the Tic-tac-toe group will be set aside in favour of a simpler domain which will offer a more compact and pedagogical representation. However, the expansion of the Tic-tac-toe group should follow the same process, this article will circle back to the Tic-tac-toe domain to highlight the equivalences which should ensure that this is so.

Although Tic-tac-toe is a relatively simple problem, its state space makes it intractable for a "back of the napkin" illustration. Therefore, the random walk task proposed by Sutton and Barto (Sutton and Barto, 1998) will be used to discuss the formal expansion into a Cayley Graph. The random walk example consists of a small Markov process with five non-terminal states: $A$, $B$, $C$, $D$, and $E$. In each of the five non-terminal states, two actions with equal probability are possible: move left ($l$), and move right ($r$). An automata describing the random walk domain is illustrated in Figure 1.

**Figure 1**: Diagram of a Markov process for generating random walks on five states plus two terminal states.

Let $\langle R|\cdot\rangle$ represent the random walk group, it can be expressed as a combinatorial group with a generator set $R_G = \{l, r\}$ and associated constraint relations $R_D$. The $l$ and $r$ generators are inverses, therefore the group will have the following constraint: $R_D = \{ l \cdot r = e \}$, where $e$ is the identity element. In light of this constraint, the group expression can be simplified; let $a=r$, and thus $a^{-1} = l$, $R$ can now be expressed as the free group $\langle a | \rangle$. This expresses the composition of all the terms that comprise the group $R$ (e.g.: $aa^{-1}aa^{-1}a$ = a, $a^{-1}a^{-1}a^{-1} = a^{-3}$, $aaa = a^3$...). Given $C$ is the initial state of the random walk, then the following equivalences hold for this group: $C=e$, $D = C \cdot a$, and $A = C \cdot a^{-2}$.

Because the random walk problem has a terminal state (i.e. the task is episodic), two additional constraints are required for a proper group representation to ensure that the random walk does not continue indefinitely:

$$a^{3(-1)^n}\cdot i = a^{3(-1)^n}, \forall i \in R_G \land \forall n \in \mathbb{Z^+}$$

and

$$a^{3} = a^{-3} = F$$

It should be pointed out that although there are an infinite number of random walks that can be taken starting from $C$ to reach the terminal states, the group $R$ is nonetheless a finite group when terms are reduced to their simplest form (i.e. occurrences of an element of the generator set followed by its inverse are elided from the term). The complete set of terms in the random walk group are:

$$

\begin{equation}

R = \{ e, a, a^{-1}, a^2, a^{-2}, a^{3}, a^{-3} \} = \{ C, D, B, E, A, F \}

\end{equation}

$$

The Cayley Graph $\Gamma(R,R_G)$ of the group $R$, illustrated in Figure 2, is constructed as follows:

- Construct the vertex set: $V(\Gamma) = \{ s ~|~ \forall s \in R \}$
- Construct the edge set and partition it into two subsets with colour labels:

$E(\Gamma) = E_\text{red}(\Gamma) \cap E_\text{blue}(\Gamma) = \{ (s_i, s_j) ~|~ a\cdot{s_i} = s_j \} \cap \{ (s_i, s_j) ~|~ a^{-1}\cdot{s_i} = s_j \}$

**Figure 2**: Cayley Graph of the Random Walk group $\langle R | \cdot \rangle$

Note that the set $R$ is the set of all states in the task including the terminal state. In the environment-agent model of reinforcement learning, this is expressed as $S^+$. Additionally, the edge set of the Cayley Graph $E(\Gamma)$ is equivalent to the set of actions $\mathscr{A}(\pi)$ available to a given policy. This graph can therefore serve as the basis of a model for estimating a state-value function which can be improved using a Dynamic Programming implementation of Generalized Policy Iteration. However, some additional information must first be attached to the graph. Let $\mathscr{R}(s,s',a)$ be the function which defines the expected reward for taking action $a$ in state $s$ leading to state $s'$:

$$

\mathscr{R}(s, s', a) = \left\{

\begin{array}{lr}

0 & : s' \neq F \lor a \in E_\text{blue}(\Gamma) \\

1 & : s' = F \land a \in E_\text{red}(\Gamma)

\end{array}

\right .

$$

This will associate a zero weight to all the edges in $\Gamma(R,R_G)$ with the exception of the red edge connecting $E$ to $F$. Additionally, initial value estimations must be assigned to each of the vertices in the graph. All values will initially be set to zero. Given an $\epsilon$-greedy policy, $\pi$, the policy evaluation algorithm described in Figure 3 will be used to get an initial approximation of the value function $V^{\pi}(R)$. The value $\mathscr{P}_{ss'}^{a}$ represents the probability that taking action $a$ in state $s$ will yield state $s'$. For the random walk problem, this is a certainty (probabilty is $1.0$). Therefore the actual value estimation update is calculated as follows:

$$

V^{\pi}(s) \leftarrow \sum_{s'} \mathscr{R}(s, s', \pi(s)) + \gamma V^{\pi}(s')

$$

where $\pi(s)$ will choose either $a$ or $a^{-1}$ with equal probability. Initially, the value estimation will remain zero with the possible exception of $V(E)$ which will have a value of 1 if the policy chooses action $a$ in this pass; which is a 50% probability.

- Repeat
- $\Delta \leftarrow 0$
- For each $s \in R$:
- $t \leftarrow V^{\pi}(s)$
- $V^{\pi}(s) \leftarrow \sum_{s'}{\mathscr{P}_{ss'}^{\pi(s)}[ \mathscr{R}(s,s',\pi(s)) + \gamma V^{\pi}(s') ]}$
- $\Delta \leftarrow \text{max}(\Delta, |t - V^{\pi}(s)|)$

until $\Delta$ < $\theta$ (a small positive number)

**Figure 3**: The Policy Evaluation algorithm

With the updated value estimation, the policy improvement algorithm, described in Figure 4, will update the policy in relation to the new value estimation. As in the previous step, $\mathscr{P}_{ss'}^{a}$ will always be 1.0, therefore the policy update step will be:

$$

\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{R}(s, s', a) + \gamma V^{\pi(s')}}

$$

Following the first policy improvement, the policy will randomly choose either $a$ or $a^{-1}$ in all states with a probability of 0.5. The exception is in state $E$ where the policy will chose $a$ with a probability of $1-\epsilon$ (since an $\epsilon$-greedy policy will select an action randomly with a probability of $\epsilon$). From here, it should be fairly easy to verify, by hand calculating the value-estimation and policy, that this converges toward an optimal policy following a large number of iterations of policy evaluation and improvement. The final value-estimation will assign the values $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$ and $\frac{5}{6}$ to states $A, B, C, D$, and $E$ respectively. Therefore an $\epsilon$-greedy policy will almost always elect to walk toward $E$ to reach the final destination; which yields a higher reward.

- $\mathit{stable} \leftarrow \text{true}$
- For each $s \in R$:
- $b \leftarrow \pi(s)$
- $\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{P}_{ss'}^{a}[ \mathscr{R}(s,s',a) + \gamma V^{\pi}(s')]}, a \in R_G$
- If $b \neq \pi(s)$, then $\mathit{stable} \leftarrow \text{false}$

- If $\mathit{stable}$, then stop; else do PolicyEvaluation

**Figure 4**: The Policy Improvement algorithm

This example illustrates how defining a reinforcement learning task as an combinatorial group yields a suitable model for learning an optimal policy using Dynamic Programming and Generalized Policy Iteration. The same procedure should yield similar results for the Tic-tac-toe domain, although with much greater complexity (it won't be feasible to calculate this by hand). There are a few caveats: 1) there will be multiple possible initial states (depending on whether or not the agent plays first) as opposed to the single initial state in the random walk task described in this article, and 2) the probability value $\mathscr{P}_{ss'}^{a}$ will not be zero because the resulting game tree must account for the various possible moves by the opponent. Aside from this the procedure to define the task should remain the same. Additionally, it should be possible to extend this to even more complex domains if the requirement of constructing the Cayley Graph is relaxed. A more abstract group representation could be used with Monte Carlo methods or Temporal Difference learning which do not require a well-defined model of the environment. These ideas will be explored in future articles.

- Log in to post comments