Thoughts of mind and machine. - Cayley Graph
https://didactronic.vociferousvoid.org/main/taxonomy/term/3
A vertex-transitive graph which encodes the abstract structure of an algebraic group.
enPlay Tic-tac-toe with Arthur Cayley! Part Two: Expansion
https://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax"> <p>In <a href="/main/play-tic-tac-toe-with-arthur-cayley">part 1</a> of this series, the Tic-tac-toe reinforcement learning task was expressed as a <a href="/main/lexicon#Combinatorial_Group" title="An algebraic group which is defined by all the possible expressions (e.g. words or terms) that can be built from a generator set. All terms will be considered distinct unless their equality follows from the group axioms (closure, associativity, identity, invertibility). See Combinatorial Group Theory." class="lexicon-term">Combinatorial Group</a> with the hypothesis that the expansion of the group into a Cayley Graph could be used to learn its associated game tree. In this instalment, the expansion of the group into a Caley Graph will be examined in a bit more detail. Initially, the Tic-tac-toe group will be set aside in favour of a simpler domain which will offer a more compact and pedagogical representation. However, the expansion of the Tic-tac-toe group should follow the same process, this article will circle back to the Tic-tac-toe domain to highlight the equivalences which should ensure that this is so.</p>
<!--break--><p>
Although Tic-tac-toe is a relatively simple problem, its state space makes it intractable for a "back of the napkin" illustration. Therefore, the random walk task proposed by Sutton and Barto (<a href="bibliography#Sutton-Barto:1998">Sutton and Barto, 1998</a>) will be used to discuss the formal expansion into a Cayley Graph. The random walk example consists of a small Markov process with five non-terminal states: $A$, $B$, $C$, $D$, and $E$. In each of the five non-terminal states, two actions with equal probability are possible: move left ($l$), and move right ($r$). An automata describing the random walk domain is illustrated in <a href="#Figure1:RandomWalk">Figure 1</a>.</p>
<p><span><br /><a name="Figure1:RandomWalk" id="Figure1:RandomWalk"></a><br /><img src="/main/sites/default/files/RandomWalkAutomata.png" width="542" height="98" alt="Diagram of a Markov process for generating random walks on five states plus one terminal states." title="A small Markov process for generating random walks." /><br /><strong>Figure 1</strong>: Diagram of a Markov process for generating random walks on five states plus two terminal states.<br /></span></p>
<p>Let $\langle R|\cdot\rangle$ represent the random walk group, it can be expressed as a combinatorial group with a generator set $R_G = \{l, r\}$ and associated constraint relations $R_D$. The $l$ and $r$ generators are inverses, therefore the group will have the following constraint: $R_D = \{ l \cdot r = e \}$, where $e$ is the identity element. In light of this constraint, the group expression can be simplified; let $a=r$, and thus $a^{-1} = l$, $R$ can now be expressed as the free group $\langle a | \rangle$. This expresses the composition of all the terms that comprise the group $R$ (e.g.: $aa^{-1}aa^{-1}a$ = a, $a^{-1}a^{-1}a^{-1} = a^{-3}$, $aaa = a^3$...). Given $C$ is the initial state of the random walk, then the following equivalences hold for this group: $C=e$, $D = C \cdot a$, and $A = C \cdot a^{-2}$.</p>
<p>Because the random walk problem has a terminal state (i.e. the task is episodic), two additional constraints are required for a proper group representation to ensure that the random walk does not continue indefinitely:<br />
$$a^{3(-1)^n}\cdot i = a^{3(-1)^n}, \forall i \in R_G \land \forall n \in \mathbb{Z^+}$$<br />
and<br />
$$a^{3} = a^{-3} = F$$<br />
It should be pointed out that although there are an infinite number of random walks that can be taken starting from $C$ to reach the terminal states, the group $R$ is nonetheless a finite group when terms are reduced to their simplest form (i.e. occurrences of an element of the generator set followed by its inverse are elided from the term). The complete set of terms in the random walk group are:</p>
<p>$$<br />
\begin{equation}<br />
R = \{ e, a, a^{-1}, a^2, a^{-2}, a^{3}, a^{-3} \} = \{ C, D, B, E, A, F \}<br />
\end{equation}<br />
$$</p>
<p>The Cayley Graph $\Gamma(R,R_G)$ of the group $R$, illustrated in <a href="#Figure2:RandomWalk-CayleyGraph">Figure 2</a>, is constructed as follows:</p>
<ol><li>Construct the vertex set: $V(\Gamma) = \{ s ~|~ \forall s \in R \}$</li>
<li>Construct the edge set and partition it into two subsets with colour labels:<br />
$E(\Gamma) = E_\text{red}(\Gamma) \cap E_\text{blue}(\Gamma) = \{ (s_i, s_j) ~|~ a\cdot{s_i} = s_j \} \cap \{ (s_i, s_j) ~|~ a^{-1}\cdot{s_i} = s_j \}$</li>
</ol><p><span><br /><a name="Figure2:RandomWalk-CayleyGraph" id="Figure2:RandomWalk-CayleyGraph"></a><br /><img src="/main/sites/default/files/RandomWalk-CayleyGraph.png" width="522" height="179" alt="Cayley Graph of the random walk group" title="Cayley Graph expansion of the random walk group." /><br /><strong>Figure 2</strong>: Cayley Graph of the Random Walk group $\langle R | \cdot \rangle$<br /></span></p>
<p>Note that the set $R$ is the set of all states in the task including the terminal state. In the environment-agent model of reinforcement learning, this is expressed as $S^+$. Additionally, the edge set of the Cayley Graph $E(\Gamma)$ is equivalent to the set of actions $\mathscr{A}(\pi)$ available to a given policy. This graph can therefore serve as the basis of a model for estimating a state-value function which can be improved using a <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> implementation of Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. However, some additional information must first be attached to the graph. Let $\mathscr{R}(s,s',a)$ be the function which defines the expected reward for taking action $a$ in state $s$ leading to state $s'$:<br />
$$<br />
\mathscr{R}(s, s', a) = \left\{<br />
\begin{array}{lr}<br />
0 & : s' \neq F \lor a \in E_\text{blue}(\Gamma) \\<br />
1 & : s' = F \land a \in E_\text{red}(\Gamma)<br />
\end{array}<br />
\right .<br />
$$<br />
This will associate a zero weight to all the edges in $\Gamma(R,R_G)$ with the exception of the red edge connecting $E$ to $F$. Additionally, initial value estimations must be assigned to each of the vertices in the graph. All values will initially be set to zero. Given an $\epsilon$-greedy policy, $\pi$, the policy evaluation algorithm described in <a href="#alg:PolicyEvaluation">Figure 3</a> will be used to get an initial approximation of the value function $V^{\pi}(R)$. The value $\mathscr{P}_{ss'}^{a}$ represents the probability that taking action $a$ in state $s$ will yield state $s'$. For the random walk problem, this is a certainty (probabilty is $1.0$). Therefore the actual value estimation update is calculated as follows:<br />
$$<br />
V^{\pi}(s) \leftarrow \sum_{s'} \mathscr{R}(s, s', \pi(s)) + \gamma V^{\pi}(s')<br />
$$<br />
where $\pi(s)$ will choose either $a$ or $a^{-1}$ with equal probability. Initially, the value estimation will remain zero with the possible exception of $V(E)$ which will have a value of 1 if the policy chooses action $a$ in this pass; which is a 50% probability.<br /><span><br /><a name="alg:PolicyEvaluation" id="alg:PolicyEvaluation"></a></span></p>
<ul><li>Repeat
<ul><li>$\Delta \leftarrow 0$</li>
<li>For each $s \in R$:
<ul><li>$t \leftarrow V^{\pi}(s)$</li>
<li>$V^{\pi}(s) \leftarrow \sum_{s'}{\mathscr{P}_{ss'}^{\pi(s)}[ \mathscr{R}(s,s',\pi(s)) + \gamma V^{\pi}(s') ]}$</li>
<li>$\Delta \leftarrow \text{max}(\Delta, |t - V^{\pi}(s)|)$</li>
</ul></li>
</ul><p> until $\Delta$ < $\theta$ (a small positive number)
</p></li>
</ul><p><strong>Figure 3</strong>: The Policy Evaluation algorithm<br /></p>
<p>With the updated value estimation, the policy improvement algorithm, described in <a href="#alg:PolicyImprovement">Figure 4</a>, will update the policy in relation to the new value estimation. As in the previous step, $\mathscr{P}_{ss'}^{a}$ will always be 1.0, therefore the policy update step will be:<br />
$$<br />
\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{R}(s, s', a) + \gamma V^{\pi(s')}}<br />
$$<br />
Following the first policy improvement, the policy will randomly choose either $a$ or $a^{-1}$ in all states with a probability of 0.5. The exception is in state $E$ where the policy will chose $a$ with a probability of $1-\epsilon$ (since an $\epsilon$-greedy policy will select an action randomly with a probability of $\epsilon$). From here, it should be fairly easy to verify, by hand calculating the value-estimation and policy, that this converges toward an optimal policy following a large number of iterations of policy evaluation and improvement. The final value-estimation will assign the values $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$ and $\frac{5}{6}$ to states $A, B, C, D$, and $E$ respectively. Therefore an $\epsilon$-greedy policy will almost always elect to walk toward $E$ to reach the final destination; which yields a higher reward.</p>
<p><span><br /><a href="alg:PolicyImprovement"></a></span></p>
<ul><li>$\mathit{stable} \leftarrow \text{true}$</li>
<li>For each $s \in R$:
<ul><li>$b \leftarrow \pi(s)$</li>
<li>$\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{P}_{ss'}^{a}[ \mathscr{R}(s,s',a) + \gamma V^{\pi}(s')]}, a \in R_G$</li>
<li>If $b \neq \pi(s)$, then $\mathit{stable} \leftarrow \text{false}$</li>
</ul></li><li>If $\mathit{stable}$, then stop; else do PolicyEvaluation</li>
</ul><p><strong>Figure 4</strong>: The Policy Improvement algorithm<br /></p>
<p>This example illustrates how defining a reinforcement learning task as an combinatorial group yields a suitable model for learning an optimal policy using <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> and Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. The same procedure should yield similar results for the Tic-tac-toe domain, although with much greater complexity (it won't be feasible to calculate this by hand). There are a few caveats: 1) there will be multiple possible initial states (depending on whether or not the agent plays first) as opposed to the single initial state in the random walk task described in this article, and 2) the probability value $\mathscr{P}_{ss'}^{a}$ will not be zero because the resulting game tree must account for the various possible moves by the opponent. Aside from this the procedure to define the task should remain the same. Additionally, it should be possible to extend this to even more complex domains if the requirement of constructing the Cayley Graph is relaxed. A more abstract group representation could be used with Monte Carlo methods or Temporal Difference learning which do not require a well-defined model of the environment. These ideas will be explored in future articles.</p>
</div></div></div></div><div class="field field-name-field-tags field-type-taxonomy-term-reference field-label-above clearfix"><h3 class="field-label">Tags: </h3><ul class="links"><li class="taxonomy-term-reference-0" rel="dc:subject"><a href="/main/taxonomy/term/3" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Cayley Graph</a></li><li class="taxonomy-term-reference-1" rel="dc:subject"><a href="/main/taxonomy/term/11" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Value Estimation</a></li><li class="taxonomy-term-reference-2" rel="dc:subject"><a href="/main/taxonomy/term/4" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Combinatorial Group</a></li><li class="taxonomy-term-reference-3" rel="dc:subject"><a href="/main/taxonomy/term/10" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Policy Iteration</a></li></ul></div>Thu, 18 Feb 2016 04:39:14 +0000Marc5 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion#commentsPlay Tic-tac-toe with Arthur Cayley!
https://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax"> <p><a href="https://en.wikipedia.org/wiki/Tic-tac-toe">Tic-tac-toe</a>, (or <em>noughts and crosses</em> or <em>Xs and Ox</em>), is a turn-based game for two players who alternately tag the spaces of a $3 \times 3$ grid with their respective marker: an X or an O. The object of the game is to place three markers in a row, either horizontally, vertically, or diagonally. Given only the mechanics of Tic-tac-toe, the game can be expressed as <a href="https://en.wikipedia.org/wiki/Combinatorial_group_theory">Combinatorial Group</a> by defining a set $A$ of generators $\{a_i\}$ which describe the actions that can be taken by either player. The <a href="https://en.wikipedia.org/wiki/Cayley_graph">Cayley Graph</a> of this group can be constructed which will express all the possible ways the game can be played. Using the Cayley Graph as a model, it should be possible to learn the Tic-tac-toe game tree using dynamic programming techniques (hint: the game tree is a sub-graph of the Cayley Graph).</p>
<!--break--><p>
Before going any further, it is important to understand the structure of the Tic-tac-toe group. Tic-tac-toe is expressed as a finite combinatorial group on the set, $S$, of $4^9$ possible board positions: the 9 grid locations which can be empty or contain an X, an O, or the superposition of X and O, $\ast$. The generator set, $A$, is a proper subset of $S$ with a cardinality of 10; the tagging of each of the 9 grid locations with a marker, and the empty grid (not playing is also a valid play). The identity element of the group is the empty grid, $\varnothing$, which is also the initial configuration in the game. The group law is the bijective group operation which combines an initial state with an action to produce the final state, and is expressed as follows:</p>
<p>$$ p: S \times S \mapsto S $$</p>
<p>with</p>
<p>$$ p(S,S) = \{ s, s' \in S : s \cdot s' = s_{ij} \cdot s'_{ij} \} $$</p>
<p>In other words, the application of the group law will evaluate the dot-product of each grid cell location. The dot-product of grid cells is defined as follows:<br />
$$ s_{ij} \cdot s'_{ij} = \left\{<br />
\begin{array}{lr}<br />
s_{ij} & \quad s_{ij} \neq \varnothing \land s'_{ij} = \varnothing \\<br />
s'_{ij} & \quad s_{ij} = \varnothing \land s'_{ij} \neq \varnothing \\<br />
\ast & \quad s_{ij} = \overline{s'_{ij}} \\<br />
\varnothing & \quad s_{ij} = s'_{ij} \\<br />
\overline{s_{ij}} & s'_{ij} = \ast \land s_{ij} \neq \varnothing<br />
\end{array}<br />
\right .<br />
$$<br />
The product of a marker with an empty cell tags the cell with the marker, two different markers will tag the cell with the superposition of both ($\ast$). The product of two similar markers will tag the cell as empty, therefore the group law described here is an autoinverse; this means that applying the law to a position with itself will result in the identity element.</p>
<p>The group $E$ is expressed as $\langle A|p \rangle$, and its full state space is specified by repeated applications of the generator. The fact that $E$ is a group can be asserted by verifying that it satisfies the group axioms:</p>
<ul><li>Totality: The set is closed under the operation $p$.</li>
<li>Associativity: The operation $p$ will combine any two positions in $S$ and yield another position in $S$.</li>
<li>Identity: There exists an identity element.</li>
<li>Divisibility: For each element in the group, there exists an inverse which yields the identity element when the group law is applied thereto.</li>
</ul><p>The proof that the group satisfies these axioms should be pretty evident. A formal proof of this fact is left as a future exercise.</p>
<dl><dt><b>NOTE:</b></dt>
<dd>The state space can be further constrained by defining a more intelligent group law. The state set $S$ could be partitioned into two sub-sets: $S = X \cup O$; where $X$ is the set of positions which allow X to play, and $O$ is the set of positions which allow O to play (note that the intersection of $X$ and $O$ is not empty). This would simplify the Cayley Graph and thus reduce the time required to learn the game tree. However, this would greatly increase the complexity of the group law, making it more prone to error.</dd>
</dl><p>The abstract structure of the Tic-tac-toe group can be encoded with a Cayley graph, $\Gamma$, where each of vertices represents a position, and the edges represent that possible transitions resulting from an agent making a move.</p>
<p>The Cayley graph of the Tic-tac-toe group is isomorphic to the backup diagram of the approximate value function, $V^\pi(s)$. By extending the graph -- associating values for each of the vertices (states), and weights for the edges -- it can be used as an initial approximation of the value function. Dynamic programming algorithms will iteratively update the values and weights to obtain a better approximation of the optimal value function. By removing the edges that tend toward a zero probability of being followed, the resulting graph should be isomorphic to the game tree.</p>
<p>Initially, the value of each state will be set to zero with the exception of winning states which will have high values, and losing states which have low values. Given the sets $W$ and $L$ which contain all the winning and losing positions respectively (note: $W \cap L = \emptyset$), the initial values could be assigned as follows:</p>
<p>$$\forall s \in S \quad : \quad V^\pi(s) = \left\{<br />
\begin{array}{lr}<br />
\gg 0 & \quad s \in W \\<br />
\ll 0 & \quad s \in L \\<br />
0 & \quad s \notin W \cup L<br />
\end{array}<br />
\right .<br />
$$</p>
<p>The Tic-tac-toe group allows for positions that are not valid in a regular game (i.e. the states with superpositions). These moves should be suppressed in the process of iteratively improving the approximation of the value function. To do this, the transitions leading to invalid positions could be assigned a very small weight, ensuring that the probability of following the edge tends toward zero. The same could be done to prevent actions which place a marker in a previously occupied grid cell:</p>
<p>$$<br />
P( s \cdot a = s') = \left\{<br />
\begin{array}{lr}<br />
0 & \quad \exists i,j \in \mathbb{Z}/3 : \quad s'_{ij} \neq \varnothing \land a_{ij} \neq \varnothing \\<br />
>0 & \quad \forall i,j \in \mathbb{Z}/3 : \quad s_{ij} = \varnothing \lor a_{ij} = \varnothing<br />
\end{array}<br />
\right .<br />
$$</p>
<p>This will ensure that an agent using the Cayley graph as a value function approximation will generally not take actions leading to invalid states (which would be seen as a newbie error or an attempt at cheating by an opponent).</p>
<p>The simplicity of the Tic-tac-toe problem make it a good pedagogical tool to learn about reinforcement learning.<q>It is straightforward to write a computer program to play Tic-tac-toe perfectly, to enumerate the 765 essentially different positions (the state space complexity), or the 26,830 possible games up to rotations and reflections (the game tree complexity) on this space.</q><sup><a href="https://en.wikipedia.org/wiki/Tic-tac-toe">[1]</a></sup> However, by designing a program which learns how to play rather than manually building the game tree, the relatively small state space makes it easier to validate the techniques and algorithms used. Additionally, the theoretical foundations should also be applicable to more complex problems with state spaces that are too large to hand build the associated game tree.</p>
<p>In this article, the Tic-tac-toe problem was expressed in group theoretic terms. There is an entire body of work on group theory which may provide valuable tools for reasoning about dynamic programming algorithms used to learn approximations of the solutions to modelled problems. In future articles, the ideas developed herein will be tested by implementing them using the Didactronic toolkit. The goals of this endeavour are two-fold: 1) to validate the hypothesis that group theory provides a useful formalism for expressing reinforcement learning systems, and 2) to drive the development of the Didactronic Toolkit to make it more useful as a generalized machine learning framework.</p>
</div></div></div></div><div class="field field-name-field-tags field-type-taxonomy-term-reference field-label-above clearfix"><h3 class="field-label">Tags: </h3><ul class="links"><li class="taxonomy-term-reference-0" rel="dc:subject"><a href="/main/taxonomy/term/3" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Cayley Graph</a></li><li class="taxonomy-term-reference-1" rel="dc:subject"><a href="/main/taxonomy/term/4" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Combinatorial Group</a></li><li class="taxonomy-term-reference-2" rel="dc:subject"><a href="/main/taxonomy/term/5" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Tic-Tac-Toe</a></li><li class="taxonomy-term-reference-3" rel="dc:subject"><a href="/main/taxonomy/term/9" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Dynamic Programming</a></li></ul></div>Sat, 06 Feb 2016 03:51:07 +0000Marc2 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley#comments