__Experiment 6: Tower of Hanoi__

In looking at the previous five experiments, it is easy to see that the testing so far has been very much focused on solving navigational problems (The XOR problem is not navigation, but it is also not a hard problem at all- not very indicative by itself that Murin is good at non-navigational problems). In the interest of expanding the base of types of problems solved, I explored various puzzles and games, and it occurred to me that the Tower of Hanoi puzzle would be a good example task- it's procedural, strongly associated with recursion, and

*definitely*not a navigation task.

This problem ends up being interesting for three related reasons. The first is, as mentioned, that it's a new class of problem that we can see learning on- a sort of combinatorial problem, rather than navigational. The second is that in developing the critic, which turns out to be a more nuanced task than doing so for the navigation problems, we end up stumbling upon a new, more general method of generating feedback. This technique turns out to be fairly powerful, and will form the basis of other work later on, for reasons which we will explore when we talk about training the agent. Finally, we observe the agent learning quickly in a situation in which it is provided with an input space that is vastly reduced from the actual state space of the problem (which for a combinatorial problem like this, can become staggeringly massive very fast), reinforcing the utility of the local nature of the algorithm's design.

The basic idea, for those that have not seen the puzzle, is that you have a number of rings, of varying size, on one of three pegs, and you want to move the rings, one at a time, from the first ring to the third. The trick is that you cannot place a ring on top of one that is smaller. For example, the initial configuration has ring 1 on top, ring 2 below it, ring 3 below 2... etc. If later on you have a situation in which ring 1 is on peg two, you cannot put ring 2 on peg two, since it would be on top of ring 1.

For example, see the 4-ring game below:

To solve this puzzle, there are a few questions we need to ask ourselves. First, what exactly are our inputs going to be? In previous problems, this was a trivial question (or nearly trivial, anyway, we did have to partition the analog values)- the inputs were whatever the sensors reported. In this case, though, there's no intuitive 'sensor' to plug into. We might consider feeding in the abstract state of the problem- an encoding of the positions of all rings, for instance- but this could easily metastasize into a tremendously unwieldy state space.

Consider the 5-ring case (the one we'll be learning from shortly). In a naive sense, we could consider each ring a partial state, and its position on a ring another state component. You could encode this as a set of three 5-position binary states- 2^15- just shy of 33,000 possible positions. Probably too many for such a small problem. We could also just address each ring, and let the natural sorting properties of the problem allow inference of the full order (potentially risky approach, but relying on inherent patterns in the world can be useful, sometimes). In this case, we'd have 3^5 addresses, about 250 states. Much better, but reliant on the global positions of the rings, and honestly, we want a more local approach.

The tactic I settled on was to build a state from comparisons of the heights of the towers- whether the stack on peg 1 is taller than, or the same height as, peg 2, peg 2 than peg 3, etc. This results in a very pleasant 27 state input space that is definitely local: it's independent of the actual total number of rings in use. There is the question, however, of whether or not this represents sufficient information for the agent to learn from. This is a question that can be answered, possibly, by a test- if the agent does learn a solution, then clearly it is sufficient information. If not, then you can't be sure whether the input density is sufficient, of course. In this case, I'll go ahead and spoil the surprise: this is sufficient information, since the agent does indeed solve the problem. The action space is simple- 6 possibilities to account for moving a ring from one peg to another.

Consider the 5-ring case (the one we'll be learning from shortly). In a naive sense, we could consider each ring a partial state, and its position on a ring another state component. You could encode this as a set of three 5-position binary states- 2^15- just shy of 33,000 possible positions. Probably too many for such a small problem. We could also just address each ring, and let the natural sorting properties of the problem allow inference of the full order (potentially risky approach, but relying on inherent patterns in the world can be useful, sometimes). In this case, we'd have 3^5 addresses, about 250 states. Much better, but reliant on the global positions of the rings, and honestly, we want a more local approach.

The tactic I settled on was to build a state from comparisons of the heights of the towers- whether the stack on peg 1 is taller than, or the same height as, peg 2, peg 2 than peg 3, etc. This results in a very pleasant 27 state input space that is definitely local: it's independent of the actual total number of rings in use. There is the question, however, of whether or not this represents sufficient information for the agent to learn from. This is a question that can be answered, possibly, by a test- if the agent does learn a solution, then clearly it is sufficient information. If not, then you can't be sure whether the input density is sufficient, of course. In this case, I'll go ahead and spoil the surprise: this is sufficient information, since the agent does indeed solve the problem. The action space is simple- 6 possibilities to account for moving a ring from one peg to another.

On the left, input state example, on the right, action illustration.

The main task which remains is the design of the critic. This is where this experiment becomes very interesting. I ran into issues almost immediately when trying to generate the feedback signal- questions of how to account for mixed positions of rings are difficult to answer, without getting into a complete description of the procedure for solving the puzzle, which is, as we know, something to avoid. For what it's worth, I have tried several loose score based metrics which attempt to reward both moving rings and making structures which are generally closer to the final stack, none of which worked very well. I am certain that such a metric is possible to create, but I am not certain that doing so is not equivalent to a deterministic algorithm to solve the problem.

Instead, I switched gears and implemented an off-hand idea I'd toyed around with a while back as a possible semi-unsupervised training method: a list of states, initially empty, for which the critic can check if the current state is a member of the list. If not, positive feedback is applied and the new state is added to the list. Otherwise, if the state has been visited, there is no feedback, and we move on. This approach rewards the agent for exploring its state space, rather than for achieving specific goal milestones, which is a useful feature when those milestones are unknown. Additionally, because the simulation enforces the rules of the puzzle, we add in an extra bit wherein the agent, if it attempts to perform a maneuver which is an illegal operation, it receives negative feedback. This method is discussed further in the Esoteric Training section.

Generally speaking, we want to encourage spanning searches with positive feedback for new discoveries, but don't want to punish visiting states previously occupied, as this tends to discourage exploration by reducing the probability of visiting earlier nodes on the path tree, which may branch out to many other unvisited possibilities, and in extreme cases, may trap the agent for a long period of time in a state space region which does not contain a solution. Though consistent negative feedback from visiting old states would eventually nullify the potential barrier preventing further exploration, it would be a costly experience (time wise) which could be avoided all together with null feedback for the revisiting case. Such feedback could also, even without trapping the agent, have long-reaching effects if it discourages a state at some point which would lead to a solution path of superior quality, but instead leads to the entrenchment of a low-quality solution. We want to avoid this, naturally.

We do, however, want to discourage behavior which fails to change the state of the agent, which is the case when an agent tries an illegal move. Such moves are never allowed to impact the 'real world' (the simulation environment's true state, anyway) and thus are simply time wasting exercises. This approach of applying feedback to encourage rule-following, implemented before even solution-seeking critics are employed, is a strong technique, which can coax performance gains from most all autonomous learners, and are very nearly computationally free to implement if the simulation simply denies actions which are illegal, either by ending the epoch (less constructive, but functional) or by waiting for a valid command and returning a fault state (preferred, generally, as it allows continued exploration within the same epoch). We address this method further in the Esoteric Training section as well.

This method is particularly fascinating because it presents an opportunity to implement a new style of learning- being semi unsupervised means that we do not need, speaking loosely, a specific critic to observe the agent's progress, though we do need there to be some form of termination mechanism to restart the 'epoch' (re-empty the list) when such a goal is achieved, to allow further learning. This naturally means that the method is really mostly suited for combinatorial type problems, but none the less gives us a powerful tool to apply. Hypothetically, it should also be applicable to other sorts of learning, though I have yet to make any such investigations.

Of course, this approach is most useful when a problem space is progressive- each state occupied during a solution is occupied once in the solution, or put differently, back-tracking is minimal or unnecessary. The class of problems for which this is the case is very broad, as most of the time, needing to revisit a specific location in state space is indicative of either information scarcity or the start of a new epoch.

Using this relatively simple critic, the agent does in fact learn to solve the Tower of Hanoi puzzle, and does so rather quickly as well, demonstrating the same sort of exponential learning we've seen in other scenarios. Below is a representative learning curve, operating on the 5-ring case.

The main task which remains is the design of the critic. This is where this experiment becomes very interesting. I ran into issues almost immediately when trying to generate the feedback signal- questions of how to account for mixed positions of rings are difficult to answer, without getting into a complete description of the procedure for solving the puzzle, which is, as we know, something to avoid. For what it's worth, I have tried several loose score based metrics which attempt to reward both moving rings and making structures which are generally closer to the final stack, none of which worked very well. I am certain that such a metric is possible to create, but I am not certain that doing so is not equivalent to a deterministic algorithm to solve the problem.

Instead, I switched gears and implemented an off-hand idea I'd toyed around with a while back as a possible semi-unsupervised training method: a list of states, initially empty, for which the critic can check if the current state is a member of the list. If not, positive feedback is applied and the new state is added to the list. Otherwise, if the state has been visited, there is no feedback, and we move on. This approach rewards the agent for exploring its state space, rather than for achieving specific goal milestones, which is a useful feature when those milestones are unknown. Additionally, because the simulation enforces the rules of the puzzle, we add in an extra bit wherein the agent, if it attempts to perform a maneuver which is an illegal operation, it receives negative feedback. This method is discussed further in the Esoteric Training section.

Generally speaking, we want to encourage spanning searches with positive feedback for new discoveries, but don't want to punish visiting states previously occupied, as this tends to discourage exploration by reducing the probability of visiting earlier nodes on the path tree, which may branch out to many other unvisited possibilities, and in extreme cases, may trap the agent for a long period of time in a state space region which does not contain a solution. Though consistent negative feedback from visiting old states would eventually nullify the potential barrier preventing further exploration, it would be a costly experience (time wise) which could be avoided all together with null feedback for the revisiting case. Such feedback could also, even without trapping the agent, have long-reaching effects if it discourages a state at some point which would lead to a solution path of superior quality, but instead leads to the entrenchment of a low-quality solution. We want to avoid this, naturally.

We do, however, want to discourage behavior which fails to change the state of the agent, which is the case when an agent tries an illegal move. Such moves are never allowed to impact the 'real world' (the simulation environment's true state, anyway) and thus are simply time wasting exercises. This approach of applying feedback to encourage rule-following, implemented before even solution-seeking critics are employed, is a strong technique, which can coax performance gains from most all autonomous learners, and are very nearly computationally free to implement if the simulation simply denies actions which are illegal, either by ending the epoch (less constructive, but functional) or by waiting for a valid command and returning a fault state (preferred, generally, as it allows continued exploration within the same epoch). We address this method further in the Esoteric Training section as well.

This method is particularly fascinating because it presents an opportunity to implement a new style of learning- being semi unsupervised means that we do not need, speaking loosely, a specific critic to observe the agent's progress, though we do need there to be some form of termination mechanism to restart the 'epoch' (re-empty the list) when such a goal is achieved, to allow further learning. This naturally means that the method is really mostly suited for combinatorial type problems, but none the less gives us a powerful tool to apply. Hypothetically, it should also be applicable to other sorts of learning, though I have yet to make any such investigations.

Of course, this approach is most useful when a problem space is progressive- each state occupied during a solution is occupied once in the solution, or put differently, back-tracking is minimal or unnecessary. The class of problems for which this is the case is very broad, as most of the time, needing to revisit a specific location in state space is indicative of either information scarcity or the start of a new epoch.

Using this relatively simple critic, the agent does in fact learn to solve the Tower of Hanoi puzzle, and does so rather quickly as well, demonstrating the same sort of exponential learning we've seen in other scenarios. Below is a representative learning curve, operating on the 5-ring case.

On this learning curve, we can see the characteristic initial poor performance which is followed by an insight event, after which the exponential descent to the near-optimal (within 10% in this particular case) level occurs.

On longer running scenarios, we occasionally observe the agent settling to a given solution for a period of time, before eventually destabilizing and re-coalescing to a superior minima. This emergent behavior in the Murin agent presents occasionally, and illustrates that when allowed to continue experimentation, it can identify other paths to the goal and free itself from a local minima. I refer to this as an 'Innovation Phenomena', in similar parlance to the insight event.

On longer running scenarios, we occasionally observe the agent settling to a given solution for a period of time, before eventually destabilizing and re-coalescing to a superior minima. This emergent behavior in the Murin agent presents occasionally, and illustrates that when allowed to continue experimentation, it can identify other paths to the goal and free itself from a local minima. I refer to this as an 'Innovation Phenomena', in similar parlance to the insight event.

For instance, in the above graph we have performance on a 4-ring puzzle. The agent initially identifies a 36-step solution, which is certainly a learned solution, but misses the optimum by about a factor of 2. Then, around 500 epochs in, it destabilizes and settles on a new solution which takes 19 steps- still a bit shy of the 15 step theoretical minimum, but still an effective solution and far superior to the prior method. Of course, it is also possible for the agent to leave a global minima and identify a new, but less efficient, local minima. However, these cases tend to be less common, as solutions of greater quality tend to possess greater stability, due to the lack of opportunities for random exploration to destabilize the agent, as the paths are self-reinforcing.

Another interesting observation is that the natural structure leads to an emergent minimization of paths. Consider that the list critic gives less and less feedback as exploration continues. Now, given that the reset comes when the agent applies its goal, there will be a high proportion of positive feedback for actions that contribute to reaching the goal, relative to the number of times they are performed without receiving feedback. This will lead to a natural statistical progression towards the exclusion of transitions to irrelevant states, resulting eventually in a locally minimal path to the goal condition, similar to the emphasis on the 'correct' behavior that QLearning experiences over many trials.

This experiment represents the culmination of two very important aspects of the greater investigation into this algorithm. It illustrates the resolution of a fairly complex, non-navigational problem, and does so with a highly reduced input state space (down to 27 from a potential high of 33,000). In doing so, the utility of the algorithm is highlighted by virtue of its flexibility in terms of the sort of problem it can solve. The second salient point is that the training mechanism is not only very loose with respect to the mechanics of the particular problem, but in fact completely general to the actual problem statement- it relies not at all on any analysis of the actual puzzle. It may be worthwhile at this point to note that this is another problem domain in which QLearning fails to identify a solution (or at least fails to demonstrate any evidence of learning on an absurdly large time scale), even given many different possible parameter settings.

Another interesting observation is that the natural structure leads to an emergent minimization of paths. Consider that the list critic gives less and less feedback as exploration continues. Now, given that the reset comes when the agent applies its goal, there will be a high proportion of positive feedback for actions that contribute to reaching the goal, relative to the number of times they are performed without receiving feedback. This will lead to a natural statistical progression towards the exclusion of transitions to irrelevant states, resulting eventually in a locally minimal path to the goal condition, similar to the emphasis on the 'correct' behavior that QLearning experiences over many trials.

This experiment represents the culmination of two very important aspects of the greater investigation into this algorithm. It illustrates the resolution of a fairly complex, non-navigational problem, and does so with a highly reduced input state space (down to 27 from a potential high of 33,000). In doing so, the utility of the algorithm is highlighted by virtue of its flexibility in terms of the sort of problem it can solve. The second salient point is that the training mechanism is not only very loose with respect to the mechanics of the particular problem, but in fact completely general to the actual problem statement- it relies not at all on any analysis of the actual puzzle. It may be worthwhile at this point to note that this is another problem domain in which QLearning fails to identify a solution (or at least fails to demonstrate any evidence of learning on an absurdly large time scale), even given many different possible parameter settings.