## Esoteric Methods (Ongoing)

Esoteric is the (admittedly a little over dramatic)
name I've applied to a suite of methods for machine learning
that I've been working with over the past couple years. The basic notion is that the use of some novel and
unintuitive training approaches allow for interesting emergent behavior to
evolve in trainable agents. In particular, the use of these methods has
allowed me to implement agents which learn to solve new problems for
which a good or simple metric was unavailable, or for which the
immediate intuitive method later turns out to have hidden flaws or
nuance that prevent effective learning from occurring. In such cases,
the investigation of alternative methods has born fruit several times,
and allowed for solutions to problems for which I had no effective
method of resolution previously.

I use the term rather flippantly to refer to a number of techniques I use to extend machine learning to cases beyond which the initial algorithms are suited. Typically, I will think up a technique while approaching a problem for which the admission of a solution is unknown, or for which effective quality metrics are non-trivial to determine and capricious to work with. such cases, it can be difficult to determine a proper critic for the learning problem- in some cases difficult to the point that producing a deterministic algorithm which solves the problem may be more efficient that generating a critic and training an agent, if developing such an algorithm is even feasible.

These techniques have found use when a specific solution is not known, or parameters which would indicated improving performance are difficult to isolate, or the input data is somehow limited. In all these cases, these peculiar training methods give a workaround by examining what are essentially more abstract aspects of the problem such as exploration, relative performance, or variation.

I also want to point out that I'm not claiming to specifically have invented any of these techniques. For instance, using competition to drive reinforcement learning is a well known method. Although some of them I have not personally seen before (like novel state creation), I've not done an exhaustive search for them. I'm mostly recording my own work with them and analysis of their utility. If you've used something here, or something similar, I'd love to hear about it.

1) Input Set Noise Injection

2) Rules-only Training

3) Novel State Creation

I use the term rather flippantly to refer to a number of techniques I use to extend machine learning to cases beyond which the initial algorithms are suited. Typically, I will think up a technique while approaching a problem for which the admission of a solution is unknown, or for which effective quality metrics are non-trivial to determine and capricious to work with. such cases, it can be difficult to determine a proper critic for the learning problem- in some cases difficult to the point that producing a deterministic algorithm which solves the problem may be more efficient that generating a critic and training an agent, if developing such an algorithm is even feasible.

These techniques have found use when a specific solution is not known, or parameters which would indicated improving performance are difficult to isolate, or the input data is somehow limited. In all these cases, these peculiar training methods give a workaround by examining what are essentially more abstract aspects of the problem such as exploration, relative performance, or variation.

I also want to point out that I'm not claiming to specifically have invented any of these techniques. For instance, using competition to drive reinforcement learning is a well known method. Although some of them I have not personally seen before (like novel state creation), I've not done an exhaustive search for them. I'm mostly recording my own work with them and analysis of their utility. If you've used something here, or something similar, I'd love to hear about it.

__Techniques__1) Input Set Noise Injection

2) Rules-only Training

3) Novel State Creation

## Input Set Noise Injection

I'll start by saying that this technique is very unintuitive. The basic idea is that you have a set of exemplar data points on which your algorithm will train, and you expand the training set by manually corrupting in various ways those original samples. This does two things for you. It increases the effective size of your training set, potentially by a very large margin (say three different corruptions with three different degrees each, gives 64x expansion), with little loss of integrity, assuming you pick your transformations carefully.

The first time I employed this method was to expand the dataset for a character recognition project. The goal as to use a Neural Network- a Sanger net which emulates PCA- to recognize digits and letters, essentially to try to interpret CAPCHA codes. Our method was to linearize the array representation of the digit and use the computed principal components to identify most likely matches from a pre-computed set of vectors. To do this, we trained a Sanger network on a set of example letters, and built up a database of known components for the characters we wanted to identify. Once we had a letter we wished to classify, we fed it into the network, and used a nearest-neighbor classifier to make the final selection for that instance.

When we attempted to train the network on the example letters we had, we found very weak convergence, and poor performance on even in-set classification. To cope with this problem, I wrote a few simple corrupting transformations to grow the data set without collecting any more original samples. When we used this material to train, we not only found excellent convergence in the network, but exceptional in-set and out-set classification performance: about 98% success for in-set and 92% for out-set groups.

From these results, it occurred to me that the use of these corrupted samples helped to breed into the system a fair amount of resilience against noise, as well giving as a broader base from which patterns could be drawn. What's more, it seemed (and in other circumstances, this seems to hold as well) to have enabled the agent a certain level of resistance to superficial patterns as well, strengthening the response to deeper correlations by washing out the unsubtle relations with noise.

Since working on this project, I've applied this principal to several different projects. One of the most notable being the Murin Tic Tac Toe learning project. In that experiment, I found that the agent learned perfectly well playing against an optimal hard-coded agent, but failed to play effectively against me. I realized that playing a 'perfect' opponent fails to teach generalizable strategy, and so introduced an element of randomness into the teaching agent, after which point, the agent played well against humans.

The first time I employed this method was to expand the dataset for a character recognition project. The goal as to use a Neural Network- a Sanger net which emulates PCA- to recognize digits and letters, essentially to try to interpret CAPCHA codes. Our method was to linearize the array representation of the digit and use the computed principal components to identify most likely matches from a pre-computed set of vectors. To do this, we trained a Sanger network on a set of example letters, and built up a database of known components for the characters we wanted to identify. Once we had a letter we wished to classify, we fed it into the network, and used a nearest-neighbor classifier to make the final selection for that instance.

When we attempted to train the network on the example letters we had, we found very weak convergence, and poor performance on even in-set classification. To cope with this problem, I wrote a few simple corrupting transformations to grow the data set without collecting any more original samples. When we used this material to train, we not only found excellent convergence in the network, but exceptional in-set and out-set classification performance: about 98% success for in-set and 92% for out-set groups.

From these results, it occurred to me that the use of these corrupted samples helped to breed into the system a fair amount of resilience against noise, as well giving as a broader base from which patterns could be drawn. What's more, it seemed (and in other circumstances, this seems to hold as well) to have enabled the agent a certain level of resistance to superficial patterns as well, strengthening the response to deeper correlations by washing out the unsubtle relations with noise.

Since working on this project, I've applied this principal to several different projects. One of the most notable being the Murin Tic Tac Toe learning project. In that experiment, I found that the agent learned perfectly well playing against an optimal hard-coded agent, but failed to play effectively against me. I realized that playing a 'perfect' opponent fails to teach generalizable strategy, and so introduced an element of randomness into the teaching agent, after which point, the agent played well against humans.

## Rules-only Training

This particular method will most likely seem, at first, to be of limited worth. The basic idea is that any system with which an autonomous agent will interact with will have certain well-defined operational rules. These rules may be known empirically, or be enforced by the environment of the agent (the latter being the most common case for learning agents). These rules possess known impacts on the agent's performance, and are natural consequences to actions taken in the environment (and by extension, are readily identifiable- if the agent cannot detect the effects readily, they aren't truly rules but rather the more subtle effects the agent learns from other metrics). This clean, first-order relationship can be taken advantage of by a composite critic to drastically improve agent performance, or even applied usefully without other critical input.

For a quick definition, when I mention a 'composite critic' I mean a critic which supplies feedback to an agent in training which is composed of two or more independent analytical components, determining feedback based on a priority cascade. You may have a list of factual rules which relate to cleanly delimited states which you can apply uniformly. Such information is likely to be absolute, but not spanning of the problem space. Likewise, you may have a metric which applies as a broad generality, but does not necessarily produce exactly correct output for every individual circumstance. I.e., it may serve to indicate and drive trends of improvement, but not be 100% reliable step-to step. These two individual metrics may be combined to more effectively teach by applying the specific knowledge when applicable, and the general knowledge otherwise.

For instance, consider an agent in an environment that is a sorting problem of some kind. Let it be a known fact that in a proper solution, item C will never be placed ahead of item B, for the sake of argument, consider this a direct result of a cursory mathematical treatment. If this is the case then the agent may at some point, during learning, try to place C ahead of B. Some loose, general metric (which we use in addition to the more strict criteria because it is more broadly applicable) may indicate that this was an improvement, and thus want to reward the agent for this maneuver. However, it's a known fact (a

I apply this sort of thinking to my agents in almost every problem case I work with. Most times, when learning in a simulation, there is at least an enforced set of system boundaries which the simulation itself automatically identifies as improper actions. In the Tower of Hanoi problem, for example, the simulation side will not allow an agent to put a larger disk atop a smaller one, as that is a rule violation. Using these cases as strong cases for punishment can dramatically reduce learning time by immediately eliminating whole potential action cases.

Interestingly, a common trend in problems I approach is that positive feedback from the general critic in known good cases is effective, while negative feedback is ineffective. This is sensible, as there is typically a wide range of negative cases for every positive case, and negative short-term cases may eventually contribute to a long-range solution (this is especially relevant to the Murin agent, which specifically seeks to enforce these longer-range approaches). As such is the case, there is little call for negative reinforcement when general patterns are being exploited for long-range goals. However, in a case where there are known-negative examples, the application of a negative signal is massively beneficial, as it tends to prune long chains from the action space tree- especially if such pruning can occur early in training.

This makes a nicely broad approach, with specific negative feedback early on reducing the total exploration space, and general positive trends enforcing preferential traversal of efficient solutions later. As mentioned, this overlap is especially pertinent to the Murin algorithm, as it looks for temporal patterns, and thus is highly reliant on efficient exploration, which discourages negative feedback due to potential long-term impedance of exploration. This long term impeding effect can be thought of as an emergent effect of goal seeking behavior- a negative impulse on a specific action in the tree can be regarded as a small potential barrier the stochastic exploration process must overcome to investigate that chain of behaviors.

Since historic updates tend to be less strong than immediate updates, a lucky excursion to a highly effective solution path may not remove that negative barrier, and thus essentially truncate that path as other, shorter term (but less effective) solutions are presented. Essentially, negative feedback from a general agent tends to bias the agent towards greedy solutions. That is why positive feedback should be associated with general metrics and negative feedback with specific ones (as a rule of thumb- obviously there are always exceptions).

For a quick definition, when I mention a 'composite critic' I mean a critic which supplies feedback to an agent in training which is composed of two or more independent analytical components, determining feedback based on a priority cascade. You may have a list of factual rules which relate to cleanly delimited states which you can apply uniformly. Such information is likely to be absolute, but not spanning of the problem space. Likewise, you may have a metric which applies as a broad generality, but does not necessarily produce exactly correct output for every individual circumstance. I.e., it may serve to indicate and drive trends of improvement, but not be 100% reliable step-to step. These two individual metrics may be combined to more effectively teach by applying the specific knowledge when applicable, and the general knowledge otherwise.

For instance, consider an agent in an environment that is a sorting problem of some kind. Let it be a known fact that in a proper solution, item C will never be placed ahead of item B, for the sake of argument, consider this a direct result of a cursory mathematical treatment. If this is the case then the agent may at some point, during learning, try to place C ahead of B. Some loose, general metric (which we use in addition to the more strict criteria because it is more broadly applicable) may indicate that this was an improvement, and thus want to reward the agent for this maneuver. However, it's a known fact (a

*rule*) that this is wrong, and so we ought to penalize the agent. The mathematical knowledge should be checked first, and only if that more reliable check produces no signal should the more general metric be consulted- that is the prioritization element.I apply this sort of thinking to my agents in almost every problem case I work with. Most times, when learning in a simulation, there is at least an enforced set of system boundaries which the simulation itself automatically identifies as improper actions. In the Tower of Hanoi problem, for example, the simulation side will not allow an agent to put a larger disk atop a smaller one, as that is a rule violation. Using these cases as strong cases for punishment can dramatically reduce learning time by immediately eliminating whole potential action cases.

Interestingly, a common trend in problems I approach is that positive feedback from the general critic in known good cases is effective, while negative feedback is ineffective. This is sensible, as there is typically a wide range of negative cases for every positive case, and negative short-term cases may eventually contribute to a long-range solution (this is especially relevant to the Murin agent, which specifically seeks to enforce these longer-range approaches). As such is the case, there is little call for negative reinforcement when general patterns are being exploited for long-range goals. However, in a case where there are known-negative examples, the application of a negative signal is massively beneficial, as it tends to prune long chains from the action space tree- especially if such pruning can occur early in training.

This makes a nicely broad approach, with specific negative feedback early on reducing the total exploration space, and general positive trends enforcing preferential traversal of efficient solutions later. As mentioned, this overlap is especially pertinent to the Murin algorithm, as it looks for temporal patterns, and thus is highly reliant on efficient exploration, which discourages negative feedback due to potential long-term impedance of exploration. This long term impeding effect can be thought of as an emergent effect of goal seeking behavior- a negative impulse on a specific action in the tree can be regarded as a small potential barrier the stochastic exploration process must overcome to investigate that chain of behaviors.

Since historic updates tend to be less strong than immediate updates, a lucky excursion to a highly effective solution path may not remove that negative barrier, and thus essentially truncate that path as other, shorter term (but less effective) solutions are presented. Essentially, negative feedback from a general agent tends to bias the agent towards greedy solutions. That is why positive feedback should be associated with general metrics and negative feedback with specific ones (as a rule of thumb- obviously there are always exceptions).

## Novel State Creation

The utility of this technique is mostly rooted in the fact that it is system-independent. It is an approach which is designed to give the agent impetus to explore state space efficiently, and therefor come to find solutions by efficient space-searching. The method is thus: the agent should keep a record of what states it has visited recently (typically, this means 'within the current epoch' but may be within a certain number of time steps), and whenever experiencing a state that is new, with respect to that record, it should receive positive feedback. When the agent begins a new epoch (however that is defined) the state queue should be reset and the process of filling it begun again.

In general, negative feedback should not be applied due to visiting a previously visited state because this tends to discourage exploration. Consider the potential chains of actions a a tree- giving negative feedback to a previously visited state will create a bias towards not exploring paths that would require back tracking from later states. It also tends to discourage entry to that state when later you restart from the initial node again. As such, it creates a premature partial pruning of the paths that succeed that node. This is, naturally, inimical to exploration- effectively makes the exploration wider at the expense of depth.

There are, of course, a number of potential modifications and derivative methods that this critic can lead us to consider. First and foremost, it meshes very nicely with a rules-based critic, to form a fairly general composite critic which can make use of whatever knowledge of the simple dynamics of a problem are know, and then take a combinatorial exploration approach to ferreting out the remaining information it can. Other uses include continuing to apply positive reinforcement until each state has been visited a number of times, if it is thought that some level of backtracking will be necessary or useful. Another variant would be to apply reward when a frequency plot of the visits to various states is made more uniform, encouraging expansive search of the space. Such an approach would be particularly relevant if there was a stochastic element to the solution, doubly so if there are hidden variable elements to the domain.

One could also (as mentioned in the first paragraph) keep states in the 'recently visited' memory for a certain amount of time, and then let that state 'expire' and get reinforced when visited again later. This could be implemented either by keeping a timer for each state, or by letting the queue reach a certain size, after which adding new states pops the oldest entry out. This would help in cases in which the actual input space is smaller than the true state space of the problem, and thus likely to have it's 'already visited' list fill much more quickly than the state space can be covered.

This critic is most well suited to combinatorial problems, in which state space traversal is an effective model of how the agent should solve the problem. In other situations, such as navigational problems, it may be less effective as the natural state space representation (a discretized partition of locations the agent can occupy) is not particularly well suited to resolution by exploration (unless the goal is to locate something for which the location is unknown, it which case the problem is really a constrained search problem). In general, the process of exploration becoming limited by increasing probability of taking paths to goals leads to generally effective learning, as it is essentially an abstracted form of gradient descent.

Of particular note is the interaction between solution paths of differing lengths, when interacting with reinforcement learning. Over many epochs, shorter paths will come to be preferred, though occasional coalescence to a long solution may have to be overcome by persistent exploration (see the example in the Tower of Hanoi wherein the agent settles to one solution then spontaneously identifies a new refinement).

To see why this is the case, consider that a reinforcement learning agent makes a decision on what action to take at a given juncture based on some metric quality factor that represents the benefit associated with taking that action from the current state. Typically the value of this quality factor is determined by the frequency with which that state-action pair receives positive feedback, and we pick the action by making probability of selection proportional to quality. Now, at a time step after which the qualities are near their steady-state values, consider two paths which both lead to solutions. The total probability associated with taking a given path is the product of probabilities all constituent actions. Near the steady state, all probabilities in a given solution chain will be approximately 1, however, because the learning rate will never exceed 1.0, they will never quite reach 1.0. Therefor, for a number of actions, all of roughly uniform probability, a shorter chain will tend to be marginally more likely, and thus more stable, in a statistical sense. This is the mechanism that causes this critic to tend to select for faster solutions, and to eventually settle on shorter solutions over longer ones, though, as a stochastic relaxation process, this often takes time to jar the agent from a previous metastable solution.

I've used this particular strategy a few times, most notably with the Tower of Hanoi and the Sorting problem. In each of these cases, the combinatorial nature of the problem made identification of a specific metric (without solving the problem wholesale) tenuous, and effective learning was achieved by exploration training via this mechanism. Additionally, both example cases can be observed to exhibit the settling behavior mentions above, finding local optima, and in some cases dislodging to superior, or even global, performance maxima.

In general, negative feedback should not be applied due to visiting a previously visited state because this tends to discourage exploration. Consider the potential chains of actions a a tree- giving negative feedback to a previously visited state will create a bias towards not exploring paths that would require back tracking from later states. It also tends to discourage entry to that state when later you restart from the initial node again. As such, it creates a premature partial pruning of the paths that succeed that node. This is, naturally, inimical to exploration- effectively makes the exploration wider at the expense of depth.

There are, of course, a number of potential modifications and derivative methods that this critic can lead us to consider. First and foremost, it meshes very nicely with a rules-based critic, to form a fairly general composite critic which can make use of whatever knowledge of the simple dynamics of a problem are know, and then take a combinatorial exploration approach to ferreting out the remaining information it can. Other uses include continuing to apply positive reinforcement until each state has been visited a number of times, if it is thought that some level of backtracking will be necessary or useful. Another variant would be to apply reward when a frequency plot of the visits to various states is made more uniform, encouraging expansive search of the space. Such an approach would be particularly relevant if there was a stochastic element to the solution, doubly so if there are hidden variable elements to the domain.

One could also (as mentioned in the first paragraph) keep states in the 'recently visited' memory for a certain amount of time, and then let that state 'expire' and get reinforced when visited again later. This could be implemented either by keeping a timer for each state, or by letting the queue reach a certain size, after which adding new states pops the oldest entry out. This would help in cases in which the actual input space is smaller than the true state space of the problem, and thus likely to have it's 'already visited' list fill much more quickly than the state space can be covered.

This critic is most well suited to combinatorial problems, in which state space traversal is an effective model of how the agent should solve the problem. In other situations, such as navigational problems, it may be less effective as the natural state space representation (a discretized partition of locations the agent can occupy) is not particularly well suited to resolution by exploration (unless the goal is to locate something for which the location is unknown, it which case the problem is really a constrained search problem). In general, the process of exploration becoming limited by increasing probability of taking paths to goals leads to generally effective learning, as it is essentially an abstracted form of gradient descent.

Of particular note is the interaction between solution paths of differing lengths, when interacting with reinforcement learning. Over many epochs, shorter paths will come to be preferred, though occasional coalescence to a long solution may have to be overcome by persistent exploration (see the example in the Tower of Hanoi wherein the agent settles to one solution then spontaneously identifies a new refinement).

To see why this is the case, consider that a reinforcement learning agent makes a decision on what action to take at a given juncture based on some metric quality factor that represents the benefit associated with taking that action from the current state. Typically the value of this quality factor is determined by the frequency with which that state-action pair receives positive feedback, and we pick the action by making probability of selection proportional to quality. Now, at a time step after which the qualities are near their steady-state values, consider two paths which both lead to solutions. The total probability associated with taking a given path is the product of probabilities all constituent actions. Near the steady state, all probabilities in a given solution chain will be approximately 1, however, because the learning rate will never exceed 1.0, they will never quite reach 1.0. Therefor, for a number of actions, all of roughly uniform probability, a shorter chain will tend to be marginally more likely, and thus more stable, in a statistical sense. This is the mechanism that causes this critic to tend to select for faster solutions, and to eventually settle on shorter solutions over longer ones, though, as a stochastic relaxation process, this often takes time to jar the agent from a previous metastable solution.

I've used this particular strategy a few times, most notably with the Tower of Hanoi and the Sorting problem. In each of these cases, the combinatorial nature of the problem made identification of a specific metric (without solving the problem wholesale) tenuous, and effective learning was achieved by exploration training via this mechanism. Additionally, both example cases can be observed to exhibit the settling behavior mentions above, finding local optima, and in some cases dislodging to superior, or even global, performance maxima.