## Reinforcement Modified Neural Networks (Incomplete Work)

One of the most pressing limitations to many of my projects in machine learning is the restriction to discrete quantities for inputs and outputs. Most such algorithms require any sort of continuous signal to be partitioned and discretized, with any increase in resolution duly offset by dramatic increases in state space size- combinatorial explosion. In most situations, the final result is the compression of the space to a set of binary variables- a distance sensor becomes a yea/nay presence detector, for example.

In practice, this is often a fairly effective parametrization for some real world problems. For instance, in the Mordax implementation, obstacle detection and avoidance is easily accomplished with a binary threshold on the pertinent distances. There are cases in which a continuous range of inputs would be preferable, though. To that purpose, I've been working on some alternate implementations of reinforcement learning using neural networks as the basis, replacing the state-action table.

The basic workhorse neural network is a 2-layer Error Back-Propagation (EBP) implementation. There are many many variants which are suited for various tasks. Since I prefer to work to a generalizable model for arbitrary problems, rather than invent another one (or pick one- there's

My basic approach is to leave all the learning apparatus unchanged. An EBP learner then needs two things to learn. It needs sample input data, and it needs desired output data to compare its answer to. The input data will naturally remain essentially unchanged- it will be the input I want to train on. I have experimented a little with using Restricted Boltzmann Machines (and other autoencoders) to preprocess inputs, but my experience has been that the time to train the encoder, then the network on the encoded input, rarely improves direct efficiency. But maybe I'm not trying the right kinds of problems. I have a sneaking suspicion that the pairing could be extremely useful, especially considering some observations we'll make in a little bit.

That leaves the desired output data. The way a neural network operates (in brief) is to apply a successive adjustment of weight adjustments to its layers, which are computed so as to reduce the error between the desired output and the computed output for the given input. Essentially, it generates a map from the input data onto the output data. To eliminate the need for known output data, and convert the algorithm to run on reinforcement, I'm going to construct the desired output based on the reinforcement.

Here's how this is going to work. I'm going to consider the Network to have one output for each action the agent can take. The output of the network due to a given input will be a 'score' associated with the value of taking the corresponding action in response to that input. I'll then pick the action based on that score (by either taking the maximum score directly, or by making the probability of selecting the action proportional to its score, just like in Murin). With this interpretation, I know then that the 'desired output' just has the correct action as the greatest output value. All I need to do then, is construct the desired output so that it trains to make the proper output more likely. For negative feedback, this means reducing the value of the selected action and potentially increasing (a little bit) the values of the other choices, so they'll be more likely to be picked. For positive feedback, this means retaining the current relation of chosen action as most probable.

This can be a little tricky, as the Neural Network's global response is changed every time you make a single update. We want to reinforce each correct answer, so that it's harder for other training to dislodge it, but not so strongly that it disturbs other training work itself. Likewise, when training down for negative reinforcement, too strong of an impulse will destabilize other cases. We must keep this in mind as we develop our imaginary desired output generator.

In practice, this is often a fairly effective parametrization for some real world problems. For instance, in the Mordax implementation, obstacle detection and avoidance is easily accomplished with a binary threshold on the pertinent distances. There are cases in which a continuous range of inputs would be preferable, though. To that purpose, I've been working on some alternate implementations of reinforcement learning using neural networks as the basis, replacing the state-action table.

The basic workhorse neural network is a 2-layer Error Back-Propagation (EBP) implementation. There are many many variants which are suited for various tasks. Since I prefer to work to a generalizable model for arbitrary problems, rather than invent another one (or pick one- there's

*hundreds*of them) that's well conditioned to my problems, I'm instead going to tack on some extras to the stock model and work with that.My basic approach is to leave all the learning apparatus unchanged. An EBP learner then needs two things to learn. It needs sample input data, and it needs desired output data to compare its answer to. The input data will naturally remain essentially unchanged- it will be the input I want to train on. I have experimented a little with using Restricted Boltzmann Machines (and other autoencoders) to preprocess inputs, but my experience has been that the time to train the encoder, then the network on the encoded input, rarely improves direct efficiency. But maybe I'm not trying the right kinds of problems. I have a sneaking suspicion that the pairing could be extremely useful, especially considering some observations we'll make in a little bit.

That leaves the desired output data. The way a neural network operates (in brief) is to apply a successive adjustment of weight adjustments to its layers, which are computed so as to reduce the error between the desired output and the computed output for the given input. Essentially, it generates a map from the input data onto the output data. To eliminate the need for known output data, and convert the algorithm to run on reinforcement, I'm going to construct the desired output based on the reinforcement.

Here's how this is going to work. I'm going to consider the Network to have one output for each action the agent can take. The output of the network due to a given input will be a 'score' associated with the value of taking the corresponding action in response to that input. I'll then pick the action based on that score (by either taking the maximum score directly, or by making the probability of selecting the action proportional to its score, just like in Murin). With this interpretation, I know then that the 'desired output' just has the correct action as the greatest output value. All I need to do then, is construct the desired output so that it trains to make the proper output more likely. For negative feedback, this means reducing the value of the selected action and potentially increasing (a little bit) the values of the other choices, so they'll be more likely to be picked. For positive feedback, this means retaining the current relation of chosen action as most probable.

This can be a little tricky, as the Neural Network's global response is changed every time you make a single update. We want to reinforce each correct answer, so that it's harder for other training to dislodge it, but not so strongly that it disturbs other training work itself. Likewise, when training down for negative reinforcement, too strong of an impulse will destabilize other cases. We must keep this in mind as we develop our imaginary desired output generator.