Cogitationes ex mentis et machina
https://didactronic.vociferousvoid.org/main
enGame, Set, Match
https://didactronic.vociferousvoid.org/main/node/23
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>
One of the motivations in creating the Didactronic Framework was to
learn new technology. Many ports of the framework have been started
including using Python, Java, Clojure, and most recently Rust. Rust
was an interesting options because of its promise of speed, safety and
expressiveness. It seemed a good middle ground between imperative and
functional programming. Since this is a completely new language and
development paradigm for me (being primarily a C and Lisp hacker), the
Rust framework will need to be refined over time to make use of the
various constructs that are unique to that language. One such
construct is the match form.
</p>
<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1">Match</h2>
<div class="outline-text-2" id="text-1">
<p>
The match form in Rust is similar to a switch statement in C or the
cond form in Lisp. Essentially it will evaluate the given
expression, and find a matching clause defined in the body of the
match statement. This is illustrated in the listing that follows:
</p>
<div class="org-src-container">
<pre class="src src-rust">match rand::random::<u8>() {
1 => println!( "Strike one" ),
2 => println!( "Strike two!" ),
3 => println!( "You're out!!" ),
_ => println!( "Wait! What?!" )
}
</pre>
</div>
<p>
In this example an 8-bit unsigned value is randomly selected. The
result of this function is that passed to the match expression. The
match expression will match the result with one of the values on the
left side. When a match is found, the statement that follows the
'=>' will be executed. The underscore is a placeholder which will
match any value. Note that matches are attempted in the order in
which the branches appear in the statement, therefore a random value
of 3 will match the brach whose head is the value 3 before matching
the underscore.
</p>
<p>
This same code snippet can be implemented in C as follows:
</p>
<div class="org-src-container">
<pre class="src src-c">#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main( int argc, char* argv[] ) {
srand( time(NULL) ) ;
switch ( rand() ) {
case 1: printf( "Strike one" ) ; break ;
case 2: printf( "Strike two" ) ; break ;
case 3: printf( "Strike three" ) ; break ;
default: printf( "Wait! What?!" ) ;break ;
}
return 0 ;
}
</pre>
</div>
<p>
Or as a Lisp cond form:
</p>
<div class="org-src-container">
<pre class="src src-lisp">(let ((count (random))
(cond
((= count 1) (print "Strike one!"))
((= count 2) (print "Strike two!"))
((= count 3) (print "Strike three!"))
('t (print "Wait! What?!"))
)
)
)
</pre>
</div>
<p>
The structure of each of these examples are fairly similar. However,
where the Rust match statement really shines is in binding with
sub-expressions.
</p>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2">Sub-Expression Matching</h2>
<div class="outline-text-2" id="text-2">
<p>
In experimenting with the Didactronic framework, I have created an
example tic-tac-toe program to serve as a reference, as well as to
test out the framework's design. Each player in the game is
associated with a marker which is defined as an enumeration:
</p>
<div class="org-src-container">
<pre class="src src-rust">pub enum Marker {
X,
O
}
</pre>
</div>
<p>
A configuration of the game board will represent the state of the
game. The Configuration structure, which incidentally implements the
State trait from the framework, is defined as follows:
</p>
<div class="org-src-container">
<pre class="src src-rust">pub struct Grid {
states: RefCell<HashMap<u32,Rc<Configuration>>>,
}
pub struct Configuration {
id: u32,
last: Option<Marker>,
value: f32,
grid: Rc<Grid>,
}
</pre>
</div>
<p>
Each configuration has associated therewith an ID which uniquely
identifies it within the Grid environment and a value. The last
field indicates the marker associated with the player who made the
move that lead to the current Configuration. This is will be one of:
Some(Marker::X), Some(Marker:O), or None. The match expression can
be used to determine the marker of the next player to play as
follows:
</p>
<div class="org-src-container">
<pre class="src src-rust">match configuration.last {
Some(Player::X) => Player::O,
_ => Player:X,
}
</pre>
</div>
<p>
In this expression, Rust will attempt to bind the configuration's
last field with the value Some(Player::X) and return Player::O,
otherwise it will match the underscore and return Player::X.
</p>
<p>
This is very similar to the use of match from the previous
section. A more intersting use could be in retrieving Configurations
from the Grid's state set. When retrieving a Configuration via
HashMap::get() function, either some state will be obtained, or None
if no such state exists in the set. When a state is found, we want
to clone its counted reference in order for the state set to retain
ownership of the original state:
</p>
<div class="org-src-container">
<pre class="src src-rust">let state = match grid.borrow().get( &id ) {
Some(s) => Rc::clone(&s),
None => Rc::new(Configuration{ id, last, value: 0.0, grid: Rc::clone(&self.grid) })
}
</pre>
</div>
<p>
In this example, Rust will attempt to bind the result of the get()
operation with Some(s) where s is the unwrapped version of the
Option container. By definition of the Grid structure, this will be
a standard Rc<Configuration> which can be cloned. However, if the
state is not found in the Grid, a new state will be created. This
illustrates the sub-expression binding that is possible using Rust.
</p>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3">Conclusion</h2>
<div class="outline-text-2" id="text-3">
<p>
The match expression in Rust is a very powerful and expressive
form. In many ways, it reminds me of my undergraduate days writing
in Prolog; the bind logic seems to be very similar. Its ability to
do sub-expression matching make it more powerful than the equivalent
C (switch) or Lisp (cond) forms. That being said, I have not done
any kind of performance analysis to determine how efficient this
expression is. For the time being, I am satisfied with getting
everything to work. There will be time to make it work faster once
that particular summit has been reached. If and when such an
analysis is performed, the results will assuredly appear in this
blog. Until then, the hacking continues: Game (Tic-Tac-Toe), set
(Grid.states), and match (in case the inspiration for the title was
not clear).
</p>
</div></div></div>Tue, 21 Aug 2018 12:58:39 +0000Marc23 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/node/23#commentsDesign of the Didactronic Toolkit.
https://didactronic.vociferousvoid.org/main/node/22
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"> <p>The Didactronic Toolkit came about as a way to investigate ideas about formulating reinforcement learning problems using group theory. Its name, didactronic, is a contraction of the "didact" part of the word didactic, appended with the suffix "-tronic". </p>
<dl class="org-dl"><dt> <a href="https://www.merriam-webster.com/dictionary/didactic" rel="nofollow">didactic</a> </dt>
<dd>a) designed or intended to teach, b) intended to convey instruction and information as well as pleasure and entertainment didactic poetry. </dd>
<dt> <a href="http://wordinfo.info/unit/2190/page:13" rel="nofollow">-tronic</a> </dt>
<dd>Greek: a suffix referring to a device, tool, or instrument; more generally, used in the names of any kind of chamber or apparatus used in experiments. </dd>
</dl><p>Therefore the term Didactronic signifies an instrument or apparatus intended to teach or convey instruction to an agent through experiments; this is the essence of reinforcement learning. The Didactronic Toolkit is meant to provide the basic tools to build such an instrument for an arbitrary task. However, since the toolkit is meant to be independent of domain, it must be both useful enough to simplify the task while being generic enough not to constrain it. The goal of this article is to distill reinforcement learning into its most basic elements to provide insight into the design philosophy behind the toolkit. The secondary objective of this work is to provide a vehicle for learning the Rust language. To that end, the Didactronic Toolkit will be re-implemented as a crate in Rust.</p>
<p>The main elements of reinforcement learning are can be grouped into three modules:</p>
<dl class="org-dl"><dt> Environment </dt>
<dd>The environment lays out the rules of the universe in which an agent exists. The environment can be composed by a set of States which describe the domain in which agents are allowed to operate. The environment will also define the rules which prescribe how any State can be reached from any other State.</dd>
<dt> Agent </dt>
<dd>An agent exists in a given environment and is endowed with certain capabilities which enable it to operate therein. An agent's capabilities can be expressed as a set of Actions. An agent may follow some Policy in selecting an Action to take given its current state; making sure to follow the rules of its environment.</dd>
<dt> Task </dt>
<dd>A task describes a goal that an agent is trying to achieve. This can be described by a set of goal States that an agent may want to reach. A Task may also define some anti-goal States which describe situations that an agent may wish to avoid; otherwise the task will be considered failed and incomplete. An agent will attempt to learn a Strategy to reach one or more of the goal States defined for a task while avoiding any anti-goal States.</dd>
</dl><p>The terminology described in this article largely matches that used by <a href="https://books.google.ca/books?id=CAFR6IBF4xYC" rel="nofollow">Sutton and Barto</a> in describing the reinforcement learning problem with one notable addition: Strategy. This concept allows the decoupling of an action selection policy from a strategy employed to accomplish a particular task.</p>
<p>In the sections that follow, the components of the Didactronic Toolkit will be described in more detail and their interfaces defined using the Rust language. The framework is mostly comprised of traits which must be exhibited for each entity of a specific domain. This allows the actual reinforcement learning algorithms to be implemented without consideration for the environments in which they are applied.</p>
<h2 id="sec-1">Strategy</h2>
<p>In the Didactronic Framework, the Strategy represents the thing that and agent is trying to learn. The reinforcement learning formalism proposed by Sutton and Barto prescribes updating a policy followed by an agent for selecting actions in a given state. This can lead to confusion because a Policy should be independent of the task at hand. For example, a greedy policy will always select the action which leads to the state with the greatest reward, regardless of the task. If updating a policy for a given task, then the greedy policy is only greedy for that particular objective. To address this potential confusion, I propose the concept of the strategy. The advantage of this is two-fold:</p>
<ol class="org-ol"><li>The policy remains agnostic to the task for which it is applied. In other words, a greedy policy will always be a greedy policy irrespective of the task. The next action will depend on both the strategy and the policy being followed. </li>
<li>A learned strategy can be followed using a variety of different policies. This provides opportunities to verify if a learned strategy is effective for various different policy types. </li>
</ol><p>A Strategy is defined as a trait which exposes a function to determine the best action(s) to take given a state. The following listing illustrates the definition of the Strategy trait.</p>
<pre class="src src-rust">pub trait Strategy<S: State, A: Action> {
fn next( &self, state: S ) -> [A] ;
fn get_value( &self, state: S ) -> S::Value ;
}
</pre><p>
The next() function will return an array of one or more actions ordered by preference for a given state; the most preferred action being first. An agent will select one action from this array according to some policy. For example, given an epsilon-greedy policy, the preferred action will be selected with a probability $1-\varepsilon$, otherwise an action will be randomly selected from the array using a normal distribution.</p>
<p>The Strategy trait also defines the getvalue() function which will determine the value of a given state. This allows the Task to be decoupled from the Environment; the same state may have different values depending on the task. For example, in a game of tic-tac-toe, two agents will have competing goals, therefore the value of a winning state for one player will be a losing state for its opponent. Naturally this state's value will be different for each agent.</p>
<h2 id="sec-2">Environment</h2>
<p>The Environment trait defines the functions to describe all of its valid states and the possible transitions between them. The following listing illustrates the Rust definition of the Environment trait.</p>
<pre class="src src-rust">pub trait Environment<S: State, A: Action> {
fn contains( &self, state: S ) -> bool ;
fn initial( &self ) -> &S ;
fn current( &self ) -> &S ;
fn is_valid( &self, state: &S, action: &A ) -> bool ;
fn get_probability( &self, current: &S, action: &A, next: &S ) -> f32 {
1.0
}
}
</pre><p>
The contains() function is used to assert whether or not the given state is part of the current environment. This design pattern is more general than mandating that an environment expose a set of all known states. For very large domains, it may not be possible to fully express this set. It is therefore preferrable to simply assert whether or not a state belongs to the environment. This should be a sufficient condition to define the environment's bounds.</p>
<p>Additionally, the environment trait defines functions to retrieve the current state, the initial state, and to determine whether or not an action is valid for a given state. These capabilities can be used by agents to help them determine which actions to take.</p>
<p>The trait also allows for stochastic state transitions by exposing a function, transitionprobability(), which will evaluate the probability that taking an action in a given state will lead to the specified next state. In combination with the apply() function, which returns all possible outcomes of taking a given action in the current state, it is possible to estimate the value of an action.</p>
<p>Finally, the execute() function will apply the given action to update the current environment's state.</p>
<h2 id="sec-3">Task</h2>
<p>The Task trait describes an agent's objective by specifying its goals and the rewards received for its actions.</p>
<pre class="src src-rust">pub trait Task<S: State, A: Action> {
fn get_environment( &self ) -> &Environment<S,A> ;
fn is_terminal( &self, state: &S ) -> bool ;
fn is_goal( &self, state: &S ) -> bool ;
fn get_reward( &self, initial: &S, action: &A, next: &S ) -> f32 ;
}
</pre><p>
During the learning process, a task will be tied to a particular environment. The task's environment can be retrieved using the getenvironment() function. Additionally, the task provides functions to assert whether or not a given state is terminal (isterminal()), and whether it represents a goal (isgoal()). It is possible for a state to be terminal without being a goal. For example, a losing state in a game would be terminal without being a goal.</p>
<h2 id="sec-4">Agent</h2>
<p>The Agent trait represents the entity learning a Strategy to solve a given Task. The Rust definition of this trait is illustrated in the following listing:</p>
<pre class="src src-rust">pub trait Agent<S: State, A: Action> {
fn get_capabilities( &self ) -> Vec<&A> ;
fn next_action( &self, state: S, strategy: Strategy<S,A>, policy: Policy<A> ) -> A ;
}
</pre><p>
The Agent trait is very simple. It defines a function to retrieve the agent's capabilities as well as the action it will take in some state while following a given strategy and policy.</p>
<p>Note that the Agent may also express an agency. This is necessary when trying to define a multi-agent learning system. The agent in this case will represetn 2 or more cooperating agents. The strategy will have to be implemented accordingly.</p>
<p>The reinforcement learning elements described in this article will serve as the basis of what will hopefully become a useful tool for researching reinforcement learning. However, what is described herein is by no means final or complete. This will be an on-going project which will produce a useful crate to allow fellow Rustaceans to create applications therewith.</p>
</div></div></div>Fri, 06 Jul 2018 02:38:09 +0000Marc22 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/node/22#commentsPlay Tic-tac-toe with Arthur Cayley! Part Two: Expansion
https://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax"> <p>In <a href="/main/play-tic-tac-toe-with-arthur-cayley">part 1</a> of this series, the Tic-tac-toe reinforcement learning task was expressed as a <a href="/main/lexicon#Combinatorial_Group" title="An algebraic group which is defined by all the possible expressions (e.g. words or terms) that can be built from a generator set. All terms will be considered distinct unless their equality follows from the group axioms (closure, associativity, identity, invertibility). See Combinatorial Group Theory." class="lexicon-term">Combinatorial Group</a> with the hypothesis that the expansion of the group into a <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> could be used to learn its associated game tree. In this instalment, the expansion of the group into a Caley Graph will be examined in a bit more detail. Initially, the Tic-tac-toe group will be set aside in favour of a simpler domain which will offer a more compact and pedagogical representation. However, the expansion of the Tic-tac-toe group should follow the same process, this article will circle back to the Tic-tac-toe domain to highlight the equivalences which should ensure that this is so.</p>
<!--break--><p>
Although Tic-tac-toe is a relatively simple problem, its state space makes it intractable for a "back of the napkin" illustration. Therefore, the random walk task proposed by Sutton and Barto (<a href="bibliography#Sutton-Barto:1998">Sutton and Barto, 1998</a>) will be used to discuss the formal expansion into a <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a>. The random walk example consists of a small Markov process with five non-terminal states: $A$, $B$, $C$, $D$, and $E$. In each of the five non-terminal states, two actions with equal probability are possible: move left ($l$), and move right ($r$). An automata describing the random walk domain is illustrated in <a href="#Figure1:RandomWalk">Figure 1</a>.</p>
<p><span><br /><a name="Figure1:RandomWalk" id="Figure1:RandomWalk"></a><br /><img src="/main/sites/default/files/RandomWalkAutomata.png" width="542" height="98" alt="Diagram of a Markov process for generating random walks on five states plus one terminal states." title="A small Markov process for generating random walks." /><br /><strong>Figure 1</strong>: Diagram of a Markov process for generating random walks on five states plus two terminal states.<br /></span></p>
<p>Let $\langle R|\cdot\rangle$ represent the random walk group, it can be expressed as a combinatorial group with a generator set $R_G = \{l, r\}$ and associated constraint relations $R_D$. The $l$ and $r$ generators are inverses, therefore the group will have the following constraint: $R_D = \{ l \cdot r = e \}$, where $e$ is the identity element. In light of this constraint, the group expression can be simplified; let $a=r$, and thus $a^{-1} = l$, $R$ can now be expressed as the free group $\langle a | \rangle$. This expresses the composition of all the terms that comprise the group $R$ (e.g.: $aa^{-1}aa^{-1}a$ = a, $a^{-1}a^{-1}a^{-1} = a^{-3}$, $aaa = a^3$...). Given $C$ is the initial state of the random walk, then the following equivalences hold for this group: $C=e$, $D = C \cdot a$, and $A = C \cdot a^{-2}$.</p>
<p>Because the random walk problem has a terminal state (i.e. the task is episodic), two additional constraints are required for a proper group representation to ensure that the random walk does not continue indefinitely:<br />
$$a^{3(-1)^n}\cdot i = a^{3(-1)^n}, \forall i \in R_G \land \forall n \in \mathbb{Z^+}$$<br />
and<br />
$$a^{3} = a^{-3} = F$$<br />
It should be pointed out that although there are an infinite number of random walks that can be taken starting from $C$ to reach the terminal states, the group $R$ is nonetheless a finite group when terms are reduced to their simplest form (i.e. occurrences of an element of the generator set followed by its inverse are elided from the term). The complete set of terms in the random walk group are:</p>
<p>$$<br />
\begin{equation}<br />
R = \{ e, a, a^{-1}, a^2, a^{-2}, a^{3}, a^{-3} \} = \{ C, D, B, E, A, F \}<br />
\end{equation}<br />
$$</p>
<p>The <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> $\Gamma(R,R_G)$ of the group $R$, illustrated in <a href="#Figure2:RandomWalk-CayleyGraph">Figure 2</a>, is constructed as follows:</p>
<ol><li>Construct the vertex set: $V(\Gamma) = \{ s ~|~ \forall s \in R \}$</li>
<li>Construct the edge set and partition it into two subsets with colour labels:<br />
$E(\Gamma) = E_\text{red}(\Gamma) \cap E_\text{blue}(\Gamma) = \{ (s_i, s_j) ~|~ a\cdot{s_i} = s_j \} \cap \{ (s_i, s_j) ~|~ a^{-1}\cdot{s_i} = s_j \}$</li>
</ol><p><span><br /><a name="Figure2:RandomWalk-CayleyGraph" id="Figure2:RandomWalk-CayleyGraph"></a><br /><img src="/main/sites/default/files/RandomWalk-CayleyGraph.png" width="522" height="179" alt="Cayley Graph of the random walk group" title="Cayley Graph expansion of the random walk group." /><br /><strong>Figure 2</strong>: <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> of the Random Walk group $\langle R | \cdot \rangle$<br /></span></p>
<p>Note that the set $R$ is the set of all states in the task including the terminal state. In the environment-agent model of reinforcement learning, this is expressed as $S^+$. Additionally, the edge set of the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> $E(\Gamma)$ is equivalent to the set of actions $\mathscr{A}(\pi)$ available to a given policy. This graph can therefore serve as the basis of a model for estimating a state-value function which can be improved using a <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> implementation of Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. However, some additional information must first be attached to the graph. Let $\mathscr{R}(s,s',a)$ be the function which defines the expected reward for taking action $a$ in state $s$ leading to state $s'$:<br />
$$<br />
\mathscr{R}(s, s', a) = \left\{<br />
\begin{array}{lr}<br />
0 & : s' \neq F \lor a \in E_\text{blue}(\Gamma) \\<br />
1 & : s' = F \land a \in E_\text{red}(\Gamma)<br />
\end{array}<br />
\right .<br />
$$<br />
This will associate a zero weight to all the edges in $\Gamma(R,R_G)$ with the exception of the red edge connecting $E$ to $F$. Additionally, initial value estimations must be assigned to each of the vertices in the graph. All values will initially be set to zero. Given an $\epsilon$-greedy policy, $\pi$, the policy evaluation algorithm described in <a href="#alg:PolicyEvaluation">Figure 3</a> will be used to get an initial approximation of the value function $V^{\pi}(R)$. The value $\mathscr{P}_{ss'}^{a}$ represents the probability that taking action $a$ in state $s$ will yield state $s'$. For the random walk problem, this is a certainty (probabilty is $1.0$). Therefore the actual value estimation update is calculated as follows:<br />
$$<br />
V^{\pi}(s) \leftarrow \sum_{s'} \mathscr{R}(s, s', \pi(s)) + \gamma V^{\pi}(s')<br />
$$<br />
where $\pi(s)$ will choose either $a$ or $a^{-1}$ with equal probability. Initially, the value estimation will remain zero with the possible exception of $V(E)$ which will have a value of 1 if the policy chooses action $a$ in this pass; which is a 50% probability.<br /><span><br /><a name="alg:PolicyEvaluation" id="alg:PolicyEvaluation"></a></span></p>
<ul><li>Repeat
<ul><li>$\Delta \leftarrow 0$</li>
<li>For each $s \in R$:
<ul><li>$t \leftarrow V^{\pi}(s)$</li>
<li>$V^{\pi}(s) \leftarrow \sum_{s'}{\mathscr{P}_{ss'}^{\pi(s)}[ \mathscr{R}(s,s',\pi(s)) + \gamma V^{\pi}(s') ]}$</li>
<li>$\Delta \leftarrow \text{max}(\Delta, |t - V^{\pi}(s)|)$</li>
</ul></li>
</ul><p> until $\Delta$ < $\theta$ (a small positive number)
</p></li>
</ul><p><strong>Figure 3</strong>: The Policy Evaluation algorithm<br /></p>
<p>With the updated value estimation, the policy improvement algorithm, described in <a href="#alg:PolicyImprovement">Figure 4</a>, will update the policy in relation to the new value estimation. As in the previous step, $\mathscr{P}_{ss'}^{a}$ will always be 1.0, therefore the policy update step will be:<br />
$$<br />
\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{R}(s, s', a) + \gamma V^{\pi(s')}}<br />
$$<br />
Following the first policy improvement, the policy will randomly choose either $a$ or $a^{-1}$ in all states with a probability of 0.5. The exception is in state $E$ where the policy will chose $a$ with a probability of $1-\epsilon$ (since an $\epsilon$-greedy policy will select an action randomly with a probability of $\epsilon$). From here, it should be fairly easy to verify, by hand calculating the value-estimation and policy, that this converges toward an optimal policy following a large number of iterations of policy evaluation and improvement. The final value-estimation will assign the values $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}$ and $\frac{5}{6}$ to states $A, B, C, D$, and $E$ respectively. Therefore an $\epsilon$-greedy policy will almost always elect to walk toward $E$ to reach the final destination; which yields a higher reward.</p>
<p><span><br /><a href="alg:PolicyImprovement"></a></span></p>
<ul><li>$\mathit{stable} \leftarrow \text{true}$</li>
<li>For each $s \in R$:
<ul><li>$b \leftarrow \pi(s)$</li>
<li>$\pi(s) \leftarrow \text{arg}~\text{max}_a \sum_{s'}{\mathscr{P}_{ss'}^{a}[ \mathscr{R}(s,s',a) + \gamma V^{\pi}(s')]}, a \in R_G$</li>
<li>If $b \neq \pi(s)$, then $\mathit{stable} \leftarrow \text{false}$</li>
</ul></li><li>If $\mathit{stable}$, then stop; else do PolicyEvaluation</li>
</ul><p><strong>Figure 4</strong>: The Policy Improvement algorithm<br /></p>
<p>This example illustrates how defining a reinforcement learning task as an combinatorial group yields a suitable model for learning an optimal policy using <a href="/main/lexicon#Dynamic_Programming" title="Dynamic Programming (DP) refers to a collection of algorithms which, given a perfect model of an environment, can compute optimal policies for a Markov Decision Process. The classical DP algorithms are of limited use due to their assumption of a perfect model. " class="lexicon-term">Dynamic Programming</a> and Generalized <a href="/main/lexicon#Policy_Iteration" title="Policy iteration is the process of iteratively improving a policy, $\pi_t$, using approximations of a state-value function $V^{\pi_t}$. At each iteration $t$, the approximation from the previous step is used to improve ($\overset{I}{\rightarrow}$) the policy, which in turn is used to update ($\overset{E}{\rightarrow}$) the state-value approximation for the next iteration, $V^{\pi_{t+1}}$. Policy iteration ends when the policy becomes stable ($\pi^*$). This is illustrated as follows:
$$
V^{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} V^{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow}... \overset{I}{\rightarrow} \pi^*
$$" class="lexicon-term">Policy Iteration</a>. The same procedure should yield similar results for the Tic-tac-toe domain, although with much greater complexity (it won't be feasible to calculate this by hand). There are a few caveats: 1) there will be multiple possible initial states (depending on whether or not the agent plays first) as opposed to the single initial state in the random walk task described in this article, and 2) the probability value $\mathscr{P}_{ss'}^{a}$ will not be zero because the resulting game tree must account for the various possible moves by the opponent. Aside from this the procedure to define the task should remain the same. Additionally, it should be possible to extend this to even more complex domains if the requirement of constructing the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> is relaxed. A more abstract group representation could be used with Monte Carlo methods or Temporal Difference learning which do not require a well-defined model of the environment. These ideas will be explored in future articles.</p>
</div></div></div></div><div class="field field-name-field-tags field-type-taxonomy-term-reference field-label-above clearfix"><h3 class="field-label">Tags: </h3><ul class="links"><li class="taxonomy-term-reference-0" rel="dc:subject"><a href="/main/taxonomy/term/3" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Cayley Graph</a></li><li class="taxonomy-term-reference-1" rel="dc:subject"><a href="/main/taxonomy/term/11" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Value Estimation</a></li><li class="taxonomy-term-reference-2" rel="dc:subject"><a href="/main/taxonomy/term/4" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Combinatorial Group</a></li><li class="taxonomy-term-reference-3" rel="dc:subject"><a href="/main/taxonomy/term/10" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Policy Iteration</a></li></ul></div>Thu, 18 Feb 2016 04:39:14 +0000Marc5 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley-part2-expansion#commentsPlay Tic-tac-toe with Arthur Cayley!
https://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley
<div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax"> <p><a href="https://en.wikipedia.org/wiki/Tic-tac-toe">Tic-tac-toe</a>, (or <em>noughts and crosses</em> or <em>Xs and Ox</em>), is a turn-based game for two players who alternately tag the spaces of a $3 \times 3$ grid with their respective marker: an X or an O. The object of the game is to place three markers in a row, either horizontally, vertically, or diagonally. Given only the mechanics of Tic-tac-toe, the game can be expressed as <a href="https://en.wikipedia.org/wiki/Combinatorial_group_theory">Combinatorial Group</a> by defining a set $A$ of generators $\{a_i\}$ which describe the actions that can be taken by either player. The <a href="https://en.wikipedia.org/wiki/Cayley_graph">Cayley Graph</a> of this group can be constructed which will express all the possible ways the game can be played. Using the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> as a model, it should be possible to learn the Tic-tac-toe game tree using dynamic programming techniques (hint: the game tree is a sub-graph of the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a>).</p>
<!--break--><p>
Before going any further, it is important to understand the structure of the Tic-tac-toe group. Tic-tac-toe is expressed as a finite combinatorial group on the set, $S$, of $4^9$ possible board positions: the 9 grid locations which can be empty or contain an X, an O, or the superposition of X and O, $\ast$. The generator set, $A$, is a proper subset of $S$ with a cardinality of 10; the tagging of each of the 9 grid locations with a marker, and the empty grid (not playing is also a valid play). The identity element of the group is the empty grid, $\varnothing$, which is also the initial configuration in the game. The group law is the bijective group operation which combines an initial state with an action to produce the final state, and is expressed as follows:</p>
<p>$$ p: S \times S \mapsto S $$</p>
<p>with</p>
<p>$$ p(S,S) = \{ s, s' \in S : s \cdot s' = s_{ij} \cdot s'_{ij} \} $$</p>
<p>In other words, the application of the group law will evaluate the dot-product of each grid cell location. The dot-product of grid cells is defined as follows:<br />
$$ s_{ij} \cdot s'_{ij} = \left\{<br />
\begin{array}{lr}<br />
s_{ij} & \quad s_{ij} \neq \varnothing \land s'_{ij} = \varnothing \\<br />
s'_{ij} & \quad s_{ij} = \varnothing \land s'_{ij} \neq \varnothing \\<br />
\ast & \quad s_{ij} = \overline{s'_{ij}} \\<br />
\varnothing & \quad s_{ij} = s'_{ij} \\<br />
\overline{s_{ij}} & s'_{ij} = \ast \land s_{ij} \neq \varnothing<br />
\end{array}<br />
\right .<br />
$$<br />
The product of a marker with an empty cell tags the cell with the marker, two different markers will tag the cell with the superposition of both ($\ast$). The product of two similar markers will tag the cell as empty, therefore the group law described here is an autoinverse; this means that applying the law to a position with itself will result in the identity element.</p>
<p>The group $E$ is expressed as $\langle A|p \rangle$, and its full state space is specified by repeated applications of the generator. The fact that $E$ is a group can be asserted by verifying that it satisfies the group axioms:</p>
<ul><li>Totality: The set is closed under the operation $p$.</li>
<li>Associativity: The operation $p$ will combine any two positions in $S$ and yield another position in $S$.</li>
<li>Identity: There exists an identity element.</li>
<li>Divisibility: For each element in the group, there exists an inverse which yields the identity element when the group law is applied thereto.</li>
</ul><p>The proof that the group satisfies these axioms should be pretty evident. A formal proof of this fact is left as a future exercise.</p>
<dl><dt><b>NOTE:</b></dt>
<dd>The state space can be further constrained by defining a more intelligent group law. The state set $S$ could be partitioned into two sub-sets: $S = X \cup O$; where $X$ is the set of positions which allow X to play, and $O$ is the set of positions which allow O to play (note that the intersection of $X$ and $O$ is not empty). This would simplify the <a href="/main/lexicon#Cayley_Graph" title="A vertex-transitive graph which encodes the abstract structure of an algebraic group." class="lexicon-term">Cayley Graph</a> and thus reduce the time required to learn the game tree. However, this would greatly increase the complexity of the group law, making it more prone to error.</dd>
</dl><p>The abstract structure of the Tic-tac-toe group can be encoded with a Cayley graph, $\Gamma$, where each of vertices represents a position, and the edges represent that possible transitions resulting from an agent making a move.</p>
<p>The Cayley graph of the Tic-tac-toe group is isomorphic to the backup diagram of the approximate value function, $V^\pi(s)$. By extending the graph -- associating values for each of the vertices (states), and weights for the edges -- it can be used as an initial approximation of the value function. Dynamic programming algorithms will iteratively update the values and weights to obtain a better approximation of the optimal value function. By removing the edges that tend toward a zero probability of being followed, the resulting graph should be isomorphic to the game tree.</p>
<p>Initially, the value of each state will be set to zero with the exception of winning states which will have high values, and losing states which have low values. Given the sets $W$ and $L$ which contain all the winning and losing positions respectively (note: $W \cap L = \emptyset$), the initial values could be assigned as follows:</p>
<p>$$\forall s \in S \quad : \quad V^\pi(s) = \left\{<br />
\begin{array}{lr}<br />
\gg 0 & \quad s \in W \\<br />
\ll 0 & \quad s \in L \\<br />
0 & \quad s \notin W \cup L<br />
\end{array}<br />
\right .<br />
$$</p>
<p>The Tic-tac-toe group allows for positions that are not valid in a regular game (i.e. the states with superpositions). These moves should be suppressed in the process of iteratively improving the approximation of the value function. To do this, the transitions leading to invalid positions could be assigned a very small weight, ensuring that the probability of following the edge tends toward zero. The same could be done to prevent actions which place a marker in a previously occupied grid cell:</p>
<p>$$<br />
P( s \cdot a = s') = \left\{<br />
\begin{array}{lr}<br />
0 & \quad \exists i,j \in \mathbb{Z}/3 : \quad s'_{ij} \neq \varnothing \land a_{ij} \neq \varnothing \\<br />
>0 & \quad \forall i,j \in \mathbb{Z}/3 : \quad s_{ij} = \varnothing \lor a_{ij} = \varnothing<br />
\end{array}<br />
\right .<br />
$$</p>
<p>This will ensure that an agent using the Cayley graph as a value function approximation will generally not take actions leading to invalid states (which would be seen as a newbie error or an attempt at cheating by an opponent).</p>
<p>The simplicity of the Tic-tac-toe problem make it a good pedagogical tool to learn about reinforcement learning.<q>It is straightforward to write a computer program to play Tic-tac-toe perfectly, to enumerate the 765 essentially different positions (the state space complexity), or the 26,830 possible games up to rotations and reflections (the game tree complexity) on this space.</q><sup><a href="https://en.wikipedia.org/wiki/Tic-tac-toe">[1]</a></sup> However, by designing a program which learns how to play rather than manually building the game tree, the relatively small state space makes it easier to validate the techniques and algorithms used. Additionally, the theoretical foundations should also be applicable to more complex problems with state spaces that are too large to hand build the associated game tree.</p>
<p>In this article, the Tic-tac-toe problem was expressed in group theoretic terms. There is an entire body of work on group theory which may provide valuable tools for reasoning about dynamic programming algorithms used to learn approximations of the solutions to modelled problems. In future articles, the ideas developed herein will be tested by implementing them using the Didactronic toolkit. The goals of this endeavour are two-fold: 1) to validate the hypothesis that group theory provides a useful formalism for expressing reinforcement learning systems, and 2) to drive the development of the Didactronic Toolkit to make it more useful as a generalized machine learning framework.</p>
</div></div></div></div><div class="field field-name-field-tags field-type-taxonomy-term-reference field-label-above clearfix"><h3 class="field-label">Tags: </h3><ul class="links"><li class="taxonomy-term-reference-0" rel="dc:subject"><a href="/main/taxonomy/term/3" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Cayley Graph</a></li><li class="taxonomy-term-reference-1" rel="dc:subject"><a href="/main/taxonomy/term/4" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Combinatorial Group</a></li><li class="taxonomy-term-reference-2" rel="dc:subject"><a href="/main/taxonomy/term/5" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Tic-Tac-Toe</a></li><li class="taxonomy-term-reference-3" rel="dc:subject"><a href="/main/taxonomy/term/9" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Dynamic Programming</a></li></ul></div>Sat, 06 Feb 2016 03:51:07 +0000Marc2 at https://didactronic.vociferousvoid.org/mainhttps://didactronic.vociferousvoid.org/main/play-tic-tac-toe-with-arthur-cayley#comments