__Experiment 7:__

__Binary Addition__

In continuing the exploration of problem spaces which are non-navigational, I seek other procedural tasks on which to test the agent. Addition is one such task that piques my interest because it is procedural, sequential, and lends itself to critic generation. It is interesting also, to compare addition to multiplication, as far as problem scope goes- addition is a far more localized problem than multiplication- addition requires carrying digits, but not much else, in terms of sophistication. Multiplication, on the other hand, is very global- it explores the whole cross-product space of the digits in the numbers involved, and requires subsequent mixing thereof (by addition). This examination aside, addition provides another interesting problem for us to solve. In this case we are going to do binary addition, based on the notion that it is mathematically equivalent to other bases, and reduces the input space, speeding learning and simplifying the input space (which we'd have to encode somehow, anyway).

Let's start by defining the problem space. We're going to be adding two numbers together, call them A and B. We don't want to have to specify the maximum length of the numbers, plus using all the digits from each addend would produce a massive state space. In addition to being much too global for Murin's taste, this is also a pretty ineffective procedure, we'd essentially be asking the agent to memorize an addition table, which is not a particularly useful goal. Instead, we're going to have the agent tick through the digits, and progressively construct the whole solution in sequential order, as would a human when doing long arithmetic. The input, then, will in this case be just the two current digits. The 'actions' (really more of an output in this case) are selecting '1' or '0' as the correct entry.

The simple structure of this problem permits an equally simple critic- this is one of the examples for which the solution is known, and we can simply apply feedback based on whether or not the given output is correct, such that the reinforcement signal is given by the equality state of the output from the agent with the correct answer. This leading simplicity, however, leads to some trouble down the road. Fortunately it will be a learning experience, as we will see.

Before we charge into the actual experiment, I want to make one more observation. Note that reinforcement is provided in this case exclusively on whether or not the correct digit is produced by the agent. There is no explicit training which supplies information about the carry operation to the agent. Additionally, since the carry at a given digit depends on the preceding digits, knowledge of only the immediately preceding addend digits is insufficient information. That is to say, knowing the previous digit pair does not tell you whether you have a carry unless you know whether it had a carry and so on. The agent must therefor learn how to handle this internally. That is the most interesting aspect of this experiment.

With that in mind, we'll start looking at some actual data. Below is a first run of the agent on the problem, as outlined above, and for comparison, a QL agent and a random agent:

Here, we've got the Murin agent in red, the QL agent in blue, and a random agent in yellow. The random agent demonstrates the poorest performance, a completely predictable 50% accuracy rate. The QL agent fairs a little better, settling to a success level around 57% or so- only marginally better, but measurable improvement is observed. The Murin agent performs most strongly, reaching about 73% accuracy. Though this is a clear indicator of learning, it is of course dramatically reduced from what we would want, which is 100% accuracy from a trained agent.

This sort of behavior- stabilization to a partial solution, performing at a level that is less than optimal, but demonstrating measurable learning, is indicative of information starvation to the agent- the input and temporal space do not contain sufficient information to generate a complete solution, but do allow some insights to be gleaned. This sort of behavior can be seen frequently when experimenting with alternative input schemes which attempt to simplify the input space, but remove too much information. Based on our earlier discussion of the method of long addition, and how the carry procedure requires long-ranging relationships, this makes sense- what the agent should do is explicitly dependent on all previous states. We could add a mechanism to emulate the carry, and apply a deliberate training signal for that output, but that would be against the spirit of the investigation. Instead, we'll experiment with ways of increasing the internal pattern information available to the agent.

Our tactic here is actually rather simple. The agent as initially implemented includes the prior action vector as augmentation to the next input. To expand the available information at each time step, we now augment the input with both the prior action and the prior state, both of which we know to be universally available to agents. When we implement such an agent, we get the following learning curve:

This sort of behavior- stabilization to a partial solution, performing at a level that is less than optimal, but demonstrating measurable learning, is indicative of information starvation to the agent- the input and temporal space do not contain sufficient information to generate a complete solution, but do allow some insights to be gleaned. This sort of behavior can be seen frequently when experimenting with alternative input schemes which attempt to simplify the input space, but remove too much information. Based on our earlier discussion of the method of long addition, and how the carry procedure requires long-ranging relationships, this makes sense- what the agent should do is explicitly dependent on all previous states. We could add a mechanism to emulate the carry, and apply a deliberate training signal for that output, but that would be against the spirit of the investigation. Instead, we'll experiment with ways of increasing the internal pattern information available to the agent.

Our tactic here is actually rather simple. The agent as initially implemented includes the prior action vector as augmentation to the next input. To expand the available information at each time step, we now augment the input with both the prior action and the prior state, both of which we know to be universally available to agents. When we implement such an agent, we get the following learning curve:

Which is far more satisfying- railing to 100% with only occasional mistakes. Now, it's reasonable to ask if this new input provides sufficient information that the agent is just learning a small scale table of addition answers, and that the temporal aspect is immaterial. That is to say, I claimed earlier that the agent learned the carry mechanism abstractly, but might the new information eliminate the need for a carry at all? To answer this, consider first that knowledge of the preceding digit pair and the action taken is sufficient to infer whether a carry is necessary-

Though it is not direct corroboration of this, adding in this additional information to the QL agent does significantly improve the performance of that agent, actually to a point where it is marginally better than the Murin agent without the additional information, but only to about an average of about 78% accuracy. Long running fine tuning fails to improve this measure, however, indicating that some level of the function is still missing, and the suggestion would be that the abstracted notion of carrying is what is missing from the QL agent's process, given what's happening with the Murin agent.

What this experiment show us is a case in which we have a simple problem with a simple metric, but remains resistant to learning without temporal linking in spite of these factors- QLearning barely makes any gains over the random agent. The basic Murin agent does better, approaching 75% accuracy, but fails to fully converge to the full solution. This reminds us of other cases, matching the pathology of information scarcity, and so we add an additional feedback loop in the form of additional augmentations to the input vector. With this modification to increase the volume of temporal information available, we see the more familiar meteoric rise in performance that is typical of successful learning. Furthermore, this investigation has opened our awareness to a new method of increasing the available information the agent can work with, by demonstrating that the additional feedback of the prior state (in addition to action) to the augmented input can supply sufficient information for the agent to solve a problem which it cannot completely resolve otherwise.

*if*the prior answer is known to be correct or incorrect, specifically. The agent does not know whether or not the preceding entry is correct, and so there's a bootstrap problem- it must correctly perform addition*before*the supplied information is sufficient without additional abstraction. Consequently, it must be emulating the function of the carry at some level of abstraction.Though it is not direct corroboration of this, adding in this additional information to the QL agent does significantly improve the performance of that agent, actually to a point where it is marginally better than the Murin agent without the additional information, but only to about an average of about 78% accuracy. Long running fine tuning fails to improve this measure, however, indicating that some level of the function is still missing, and the suggestion would be that the abstracted notion of carrying is what is missing from the QL agent's process, given what's happening with the Murin agent.

What this experiment show us is a case in which we have a simple problem with a simple metric, but remains resistant to learning without temporal linking in spite of these factors- QLearning barely makes any gains over the random agent. The basic Murin agent does better, approaching 75% accuracy, but fails to fully converge to the full solution. This reminds us of other cases, matching the pathology of information scarcity, and so we add an additional feedback loop in the form of additional augmentations to the input vector. With this modification to increase the volume of temporal information available, we see the more familiar meteoric rise in performance that is typical of successful learning. Furthermore, this investigation has opened our awareness to a new method of increasing the available information the agent can work with, by demonstrating that the additional feedback of the prior state (in addition to action) to the augmented input can supply sufficient information for the agent to solve a problem which it cannot completely resolve otherwise.