You are here

Design of the Didactronic Toolkit.

Marc's picture
Submitted by Marc on Thu, 07/05/2018 - 22:38

The Didactronic Toolkit came about as a way to investigate ideas about formulating reinforcement learning problems using group theory. Its name, didactronic, is a contraction of the "didact" part of the word didactic, appended with the suffix "-tronic".

didactic
a) designed or intended to teach, b) intended to convey instruction and information as well as pleasure and entertainment didactic poetry.
-tronic
Greek: a suffix referring to a device, tool, or instrument; more generally, used in the names of any kind of chamber or apparatus used in experiments.

Therefore the term Didactronic signifies an instrument or apparatus intended to teach or convey instruction to an agent through experiments; this is the essence of reinforcement learning. The Didactronic Toolkit is meant to provide the basic tools to build such an instrument for an arbitrary task. However, since the toolkit is meant to be independent of domain, it must be both useful enough to simplify the task while being generic enough not to constrain it. The goal of this article is to distill reinforcement learning into its most basic elements to provide insight into the design philosophy behind the toolkit. The secondary objective of this work is to provide a vehicle for learning the Rust language. To that end, the Didactronic Toolkit will be re-implemented as a crate in Rust.

The main elements of reinforcement learning are can be grouped into three modules:

Environment
The environment lays out the rules of the universe in which an agent exists. The environment can be composed by a set of States which describe the domain in which agents are allowed to operate. The environment will also define the rules which prescribe how any State can be reached from any other State.
Agent
An agent exists in a given environment and is endowed with certain capabilities which enable it to operate therein. An agent's capabilities can be expressed as a set of Actions. An agent may follow some Policy in selecting an Action to take given its current state; making sure to follow the rules of its environment.
Task
A task describes a goal that an agent is trying to achieve. This can be described by a set of goal States that an agent may want to reach. A Task may also define some anti-goal States which describe situations that an agent may wish to avoid; otherwise the task will be considered failed and incomplete. An agent will attempt to learn a Strategy to reach one or more of the goal States defined for a task while avoiding any anti-goal States.

The terminology described in this article largely matches that used by Sutton and Barto in describing the reinforcement learning problem with one notable addition: Strategy. This concept allows the decoupling of an action selection policy from a strategy employed to accomplish a particular task.

In the sections that follow, the components of the Didactronic Toolkit will be described in more detail and their interfaces defined using the Rust language. The framework is mostly comprised of traits which must be exhibited for each entity of a specific domain. This allows the actual reinforcement learning algorithms to be implemented without consideration for the environments in which they are applied.

Strategy

In the Didactronic Framework, the Strategy represents the thing that and agent is trying to learn. The reinforcement learning formalism proposed by Sutton and Barto prescribes updating a policy followed by an agent for selecting actions in a given state. This can lead to confusion because a Policy should be independent of the task at hand. For example, a greedy policy will always select the action which leads to the state with the greatest reward, regardless of the task. If updating a policy for a given task, then the greedy policy is only greedy for that particular objective. To address this potential confusion, I propose the concept of the strategy. The advantage of this is two-fold:

  1. The policy remains agnostic to the task for which it is applied. In other words, a greedy policy will always be a greedy policy irrespective of the task. The next action will depend on both the strategy and the policy being followed.
  2. A learned strategy can be followed using a variety of different policies. This provides opportunities to verify if a learned strategy is effective for various different policy types.

A Strategy is defined as a trait which exposes a function to determine the best action(s) to take given a state. The following listing illustrates the definition of the Strategy trait.

pub trait Strategy<S: State, A: Action> {
    fn next( &self, state: S ) -> [A] ;
    fn get_value( &self, state: S ) -> S::Value ;
}

The next() function will return an array of one or more actions ordered by preference for a given state; the most preferred action being first. An agent will select one action from this array according to some policy. For example, given an epsilon-greedy policy, the preferred action will be selected with a probability $1-\varepsilon$, otherwise an action will be randomly selected from the array using a normal distribution.

The Strategy trait also defines the getvalue() function which will determine the value of a given state. This allows the Task to be decoupled from the Environment; the same state may have different values depending on the task. For example, in a game of tic-tac-toe, two agents will have competing goals, therefore the value of a winning state for one player will be a losing state for its opponent. Naturally this state's value will be different for each agent.

Environment

The Environment trait defines the functions to describe all of its valid states and the possible transitions between them. The following listing illustrates the Rust definition of the Environment trait.

pub trait Environment<S: State, A: Action> {
    fn contains( &self, state: S ) -> bool ;
    fn initial( &self ) -> &S ;
    fn current( &self ) -> &S ;
    fn is_valid( &self, state: &S, action: &A ) -> bool ;
    fn get_probability( &self, current: &S, action: &A, next: &S ) -> f32 {
	1.0
    }
}

The contains() function is used to assert whether or not the given state is part of the current environment. This design pattern is more general than mandating that an environment expose a set of all known states. For very large domains, it may not be possible to fully express this set. It is therefore preferrable to simply assert whether or not a state belongs to the environment. This should be a sufficient condition to define the environment's bounds.

Additionally, the environment trait defines functions to retrieve the current state, the initial state, and to determine whether or not an action is valid for a given state. These capabilities can be used by agents to help them determine which actions to take.

The trait also allows for stochastic state transitions by exposing a function, transitionprobability(), which will evaluate the probability that taking an action in a given state will lead to the specified next state. In combination with the apply() function, which returns all possible outcomes of taking a given action in the current state, it is possible to estimate the value of an action.

Finally, the execute() function will apply the given action to update the current environment's state.

Task

The Task trait describes an agent's objective by specifying its goals and the rewards received for its actions.

pub trait Task<S: State, A: Action> {
    fn get_environment( &self ) -> &Environment<S,A> ;
    fn is_terminal( &self, state: &S ) -> bool ;
    fn is_goal( &self, state: &S ) -> bool ;
    fn get_reward( &self, initial: &S, action: &A, next: &S ) -> f32 ;
}

During the learning process, a task will be tied to a particular environment. The task's environment can be retrieved using the getenvironment() function. Additionally, the task provides functions to assert whether or not a given state is terminal (isterminal()), and whether it represents a goal (isgoal()). It is possible for a state to be terminal without being a goal. For example, a losing state in a game would be terminal without being a goal.

Agent

The Agent trait represents the entity learning a Strategy to solve a given Task. The Rust definition of this trait is illustrated in the following listing:

pub trait Agent<S: State, A: Action> {
    fn get_capabilities( &self ) -> Vec<&A> ;
    fn next_action( &self, state: S, strategy: Strategy<S,A>, policy: Policy<A> ) -> A ;
}

The Agent trait is very simple. It defines a function to retrieve the agent's capabilities as well as the action it will take in some state while following a given strategy and policy.

Note that the Agent may also express an agency. This is necessary when trying to define a multi-agent learning system. The agent in this case will represetn 2 or more cooperating agents. The strategy will have to be implemented accordingly.

The reinforcement learning elements described in this article will serve as the basis of what will hopefully become a useful tool for researching reinforcement learning. However, what is described herein is by no means final or complete. This will be an on-going project which will produce a useful crate to allow fellow Rustaceans to create applications therewith.