Thoughts of mind and machine. - Value Estimation
https://didactronic.vociferousvoid.org/main/taxonomy/term/11
enPlay Tic-tac-toe with Arthur Cayley! Part Two: Expansion
https://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax"> <p>In <a href="/main/play-tic-tac-toe-with-arthur-cayley">part 1</a> of this series, the Tic-tac-toe reinforcement learning task was expressed as a <a href="/main/lexicon#Combinatorial_Group" title="An algebraic group which is defined by all the possible expressions (e.g. words or terms) that can be built from a generator set. All terms will be considered distinct unless their equality follows from the group axioms (closure, associativity, identity, invertibility). See Combinatorial Group Theory." class="lexicon-term">Combinatorial Group</a> with the hypothesis that the expansion of the group into a <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> could be used to learn its associated game tree. In this instalment, the expansion of the group into a Caley Graph will be examined in a bit more detail. Initially, the Tic-tac-toe group will be set aside in favour of a simpler domain which will offer a more compact and pedagogical representation. However, the expansion of the Tic-tac-toe group should follow the same process, this article will circle back to the Tic-tac-toe domain to highlight the equivalences which should ensure that this is so.</p>
<!--break--><p>
Although Tic-tac-toe is a relatively simple problem, its state space makes it intractable for a "back of the napkin" illustration. Therefore, the random walk task proposed by Sutton and Barto (<a href="bibliography#Sutton-Barto:1998">Sutton and Barto, 1998</a>) will be used to discuss the formal expansion into a <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a>. The random walk example consists of a small Markov process with five non-terminal states: $A$, $B$, $C$, $D$, and $E$. In each of the five non-terminal states, two actions with equal probability are possible: move left ($l$), and move right ($r$). An automata describing the random walk domain is illustrated in <a href="#Figure1:RandomWalk">Figure 1</a>.</p>
<p><span><br /><a name="Figure1:RandomWalk" id="Figure1:RandomWalk"></a><br /><img src="/main/sites/default/files/RandomWalkAutomata.png" width="542" height="98" alt="Diagram of a Markov process for generating random walks on five states plus one terminal states." title="A small Markov process for generating random walks." /><br /><strong>Figure 1</strong>: Diagram of a Markov process for generating random walks on five states plus two terminal states.<br /></span></p>
<p>Let $\langle R|\cdot\rangle$ represent the random walk group, it can be expressed as a combinatorial group with a generator set $R_G = \{l, r\}$ and associated constraint relations $R_D$. The $l$ and $r$ generators are inverses, therefore the group will have the following constraint: $R_D = \{ l \cdot r = e \}$, where $e$ is the identity element. In light of this constraint, the group expression can be simplified; let $a=r$, and thus $a^{-1} = l$, $R$ can now be expressed as the free group $\langle a | \rangle$. This expresses the composition of all the terms that comprise the group $R$ (e.g.: $aa^{-1}aa^{-1}a$ = a, $a^{-1}a^{-1}a^{-1} = a^{-3}$, $aaa = a^3$...). Given $C$ is the initial state of the random walk, then the following equivalences hold for this group: $C=e$, $D = C \cdot a$, and $A = C \cdot a^{-2}$.</p>
<p>Because the random walk problem has a terminal state (i.e. the task is episodic), two additional constraints are required for a proper group representation to ensure that the random walk does not continue indefinitely:<br />
$$a^{3(-1)^n}\cdot i = a^{3(-1)^n}, \forall i \in R_G \land \forall n \in \mathbb{Z^+}$$<br />
and<br />
$$a^{3} = a^{-3} = F$$<br />
It should be pointed out that although there are an infinite number of random walks that can be taken starting from $C$ to reach the terminal states, the group $R$ is nonetheless a finite group when terms are reduced to their simplest form (i.e. occurrences of an element of the generator set followed by its inverse are elided from the term). The complete set of terms in the random walk group are:</p>
<p>$$<br />
\begin{equation}<br />
R = \{ e, a, a^{-1}, a^2, a^{-2}, a^{3}, a^{-3} \} = \{ C, D, B, E, A, F \}<br />
\end{equation}<br />
$$</p>
<p>The <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> $\Gamma(R,R_G)$ of the group $R$, illustrated in <a href="#Figure2:RandomWalk-CayleyGraph">Figure 2</a>, is constructed as follows:</p>
<ol><li>Construct the vertex set: $V(\Gamma) = \{ s ~|~ \forall s \in R \}$</li>
<li>Construct the edge set and partition it into two subsets with colour labels:<br />
$E(\Gamma) = E_\text{red}(\Gamma) \cap E_\text{blue}(\Gamma) = \{ (s_i, s_j) ~|~ a\cdot{s_i} = s_j \} \cap \{ (s_i, s_j) ~|~ a^{-1}\cdot{s_i} = s_j \}$</li>
</ol><p><span><br /><a name="Figure2:RandomWalk-CayleyGraph" id="Figure2:RandomWalk-CayleyGraph"></a><br /><img src="/main/sites/default/files/RandomWalk-CayleyGraph.png" width="522" height="179" alt="Cayley Graph of the random walk group" title="Cayley Graph expansion of the random walk group." /><br /><strong>Figure 2</strong>: <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> of the Random Walk group $\langle R | \cdot \rangle$<br /></span></p>
<p>Note that the set $R$ is the set of all states in the task including the terminal state. In the environment-agent model of reinforcement learning, this is expressed as $S^+$. Additionally, the edge set of the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> $E(\Gamma)$ is equivalent to the set of actions $\mathscr{A}(\pi)$ available to a given policy. This graph can therefore serve as the basis of a model for estimating a state-value function which can be improved using a <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> implementation of Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. However, some additional information must first be attached to the graph. Let $\mathscr{R}(s,s',a)$ be the function which defines the expected reward for taking action $a$ in state $s$ leading to state $s'$:<br />
$$<br />
\mathscr{R}(s, s', a) = \left\{<br />
\begin{array}{lr}<br />
0 & : s' \neq F \lor a \in E_\text{blue}(\Gamma) \\<br />
1 & : s' = F \land a \in E_\text{red}(\Gamma)<br />
\end{array}<br />
\right .<br />
$$<br />
This will associate a zero weight to all the edges in $\Gamma(R,R_G)$ with the exception of the red edge connecting $E$ to $F$. Additionally, initial value estimations must be assigned to each of the vertices in the graph. All values will initially be set to zero. Given an $\epsilon$-greedy policy, $\pi$, the policy evaluation algorithm described in <a href="#alg:PolicyEvaluation">Figure 3</a> will be used to get an initial approximation of the value function $V^{\pi}(R)$. The value $\mathscr{P}_{ss'}^{a}$ represents the probability that taking action $a$ in state $s$ will yield state $s'$. For the random walk problem, this is a certainty (probabilty is $1.0$). Therefore the actual value estimation update is calculated as follows:<br />
$$<br />
V^{\pi}(s) \leftarrow \sum_{s'} \mathscr{R}(s, s', \pi(s)) + \gamma V^{\pi}(s')<br />
$$<br />
where $\pi(s)$ will choose either $a$ or $a^{-1}$ with equal probability. Initially, the value estimation will remain zero with the possible exception of $V(E)$ which will have a value of 1 if the policy chooses action $a$ in this pass; which is a 50% probability.<br /><span><br /><a name="alg:PolicyEvaluation" id="alg:PolicyEvaluation"></a></span></p>
<ul><li>Repeat
<ul><li>$\Delta \leftarrow 0$</li>
<li>For each $s \in R$:
<ul><li>$t \leftarrow V^{\pi}(s)$</li>
<li>$V^{\pi}(s) \leftarrow \sum_{s'}{\mathscr{P}_{ss'}^{\pi(s)}[ \mathscr{R}(s,s',\pi(s)) + \gamma V^{\pi}(s') ]}$</li>
<li>$\Delta \leftarrow \text{max}(\Delta, |t - V^{\pi}(s)|)$</li>
</ul></li>
</ul><p> until $\Delta$ < $\theta$ (a small positive number)
</p></li>
</ul><p><strong>Figure 3</strong>: The Policy Evaluation algorithm<br /></p>
<p>With the updated value estimation, the policy improvement algorithm, described in <a href="#alg:PolicyImprovement">Figure 4</a>, will update the policy in relation to the new value estimation. As in the previous step, $\mathscr{P}_{ss'}^{a}$ will always be 1.0, therefore the policy update step will be:<br />
$$<br />
\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{R}(s, s', a) + \gamma V^{\pi(s')}}<br />
$$<br />
Following the first policy improvement, the policy will randomly choose either $a$ or $a^{-1}$ in all states with a probability of 0.5. The exception is in state $E$ where the policy will chose $a$ with a probability of $1-\epsilon$ (since an $\epsilon$-greedy policy will select an action randomly with a probability of $\epsilon$). From here, it should be fairly easy to verify, by hand calculating the value-estimation and policy, that this converges toward an optimal policy following a large number of iterations of policy evaluation and improvement. The final value-estimation will assign the values $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$ and $\frac{5}{6}$ to states $A, B, C, D$, and $E$ respectively. Therefore an $\epsilon$-greedy policy will almost always elect to walk toward $E$ to reach the final destination; which yields a higher reward.</p>
<p><span><br /><a href="alg:PolicyImprovement"></a></span></p>
<ul><li>$\mathit{stable} \leftarrow \text{true}$</li>
<li>For each $s \in R$:
<ul><li>$b \leftarrow \pi(s)$</li>
<li>$\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{P}_{ss'}^{a}[ \mathscr{R}(s,s',a) + \gamma V^{\pi}(s')]}, a \in R_G$</li>
<li>If $b \neq \pi(s)$, then $\mathit{stable} \leftarrow \text{false}$</li>
</ul></li><li>If $\mathit{stable}$, then stop; else do PolicyEvaluation</li>
</ul><p><strong>Figure 4</strong>: The Policy Improvement algorithm<br /></p>
<p>This example illustrates how defining a reinforcement learning task as an combinatorial group yields a suitable model for learning an optimal policy using <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> and Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. The same procedure should yield similar results for the Tic-tac-toe domain, although with much greater complexity (it won't be feasible to calculate this by hand). There are a few caveats: 1) there will be multiple possible initial states (depending on whether or not the agent plays first) as opposed to the single initial state in the random walk task described in this article, and 2) the probability value $\mathscr{P}_{ss'}^{a}$ will not be zero because the resulting game tree must account for the various possible moves by the opponent. Aside from this the procedure to define the task should remain the same. Additionally, it should be possible to extend this to even more complex domains if the requirement of constructing the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> is relaxed. A more abstract group representation could be used with Monte Carlo methods or Temporal Difference learning which do not require a well-defined model of the environment. These ideas will be explored in future articles.</p>
</div></div></div></div><div class="field field-name-field-tags field-type-taxonomy-term-reference field-label-above clearfix"><h3 class="field-label">Tags: </h3><ul class="links"><li class="taxonomy-term-reference-0" rel="dc:subject"><a href="/main/taxonomy/term/3" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Cayley Graph</a></li><li class="taxonomy-term-reference-1" rel="dc:subject"><a href="/main/taxonomy/term/11" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Value Estimation</a></li><li class="taxonomy-term-reference-2" rel="dc:subject"><a href="/main/taxonomy/term/4" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Combinatorial Group</a></li><li class="taxonomy-term-reference-3" rel="dc:subject"><a href="/main/taxonomy/term/10" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Policy Iteration</a></li></ul></div>Thu, 18 Feb 2016 04:39:14 +0000Marc5 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion#comments