__Experiment 5: GRID__

The GRID experiment is one of my favorites, because it illustrates and highlights the utility of the algorithm to solve a novel problem with a very weak, goal-oriented critic like we saw in the obstacle avoidance problem, but in a much more complicated situation. It is especially exciting for me, because I did not know an effective solution for this problem at the outset- I specifically applied the Murin agent to the problem to get a first solution, rather than trying to compare the algorithm's performance to a pre-existing one. What's more, not only does it find a highly efficient policy to a new problem, but the problem domain is based on a real-world robotics problem I encountered while working experimenting with Mordax control and tracking methods.

The domain is like this- the agent acts on a world which is a grid. Rather than occupying cells, though, it moves along the edges connecting intersections. The basis of this domain is a line-following world where the agent follows lines to remain bounded. The idea was to localize the agent on a grid, to simplify self-reported tracking. The way the robot did this was based on a simple permutation group rule that allowed the limited action set to dictate the new location of the robot, relative to its previous location, after it took a given movement action. Most concretely, the robot's registered 'location' was the intersection immediately behind it, labeled with Cartesian coordinates. The agent tracks its cardinal direction, and based on that orientation, and what action it takes at an intersection, can determine the coordinates of the new intersection once it completes the maneuver. The specifics of this transform set are discussed in the Mordax section.

The actions that the agent can take are very limited- at an intersection it may turn right or left, or turn about across the line to face the opposite direction. This is where the tricky part of the problem lies- we want to direct the agent to navigate from its current location on the grid to a specified coordinate, but it

*cannot go straight through an intersection*. This makes the action space far more peculiar than a plain Cartesian navigation problem, like the Taxi problem. The agent must learn how to efficiently combine turns and turnabouts to efficiently reach its goal. Like solving the maze problem, this requires a very procedural approach, making use of the linking property.

What makes this particular experiment so special is the total lack of a known solution (on the part of the designer) when implementing the critic. This of course leads us to ask what kind of critic we could use. To get a handle on this, let's first look at the input and action spaces. Since we are supplying to the agent a destination, and the current location is effectively traceable locally, we're going to supply the agent with its current orientation (NESW), and the relative direction to the goal location (8x divisions: N, NW, E, SE, etc.):

At any given time, the agent has three actions available to it- move to the next intersection and take a left, move and take a right, or turn about the line 180 degrees to face the opposite direction, as shown below:

Note that each action puts the same intersection behind the agent (i.e, gets set to the same 'current' location), but with a different final orientation.

With these behaviors, then, we're going to try and design a critic. My first instinct in this, as in most navigational tasks, it to minimize the distance to the goal. However, in this case, as in the maze problem, physical euclidean or city-block distance is not a monotonically decreasing quantity as it would be in an open-grid world with regular Cartesian motion. Let's look at a simple example. We'll consider a subsection of the grid, and say we want the robot to traverse from location (0,0) to (0,2), with initial orientation of northbound:

With these behaviors, then, we're going to try and design a critic. My first instinct in this, as in most navigational tasks, it to minimize the distance to the goal. However, in this case, as in the maze problem, physical euclidean or city-block distance is not a monotonically decreasing quantity as it would be in an open-grid world with regular Cartesian motion. Let's look at a simple example. We'll consider a subsection of the grid, and say we want the robot to traverse from location (0,0) to (0,2), with initial orientation of northbound:

By examination, it is easy to see that a maximally efficient traversal can be accomplished with four actions. Consider a turn, followed by a rotation, another turn of the same type, and then a rotation (4 actions), as in the middle image. This maneuver takes us from our initial state to the goal. Another possible route (depicted on the right) with four actions takes a sequence of turns, exclusively. The purpose of this exercise is to illustrate the the most efficient path (and in this case, any possible path) does not demonstrate a pattern of monotonic decrease in distance to the goal, in either city-block or euclidean measures, as the agent progresses along the path.

This is naturally due to the fact that the state-space map does not match the physical map of the operating space. That is to say, if you were to make a graph of the state-space, it would not be a grid like the physical world. A basic deviation is that the state-space would include a node for each orientation which can be possessed at each physical location; there would be 4 (0,0) nodes in the problem space, one for N, S, E and W, respectively.

Clearly, then, reduction of distance is not a reliable metric. Indeed, when we attempt to implement it, we do not observe learning at all. However, we make the observation that the state-changes are, though not completely determined by the local shape of the grid, constrained by it to a certain extent. Unlike in the maze problem, where distance minimization could continue until the agent was even one cell from the goal, and yet potentially dozens of moves from it, the connectivity in this world means that if you are physically moving closer, you must be getting closer in the state-space, in general, with the exclusion of a typically small amount of back tracking.

Using this observation, we implement a modified reduction of distance critic, in which there is some gap between positive and negative reinforcement, with intermediate changes in distance neither rewarded nor punished. In practice, I have found that the optimal structure gives reward for reductions in distance of more than 0.4 grid-width grid-width grid-width units, negative reinforcement for greater than 0.3 units increase in distance, and no reward in between those limits. For the training case, we will set the agent in a 5x5 world, labeled on each axis from -2 to 2, and allow it to explore with this critic. Using this procedure, we observe very effective learning:

This is naturally due to the fact that the state-space map does not match the physical map of the operating space. That is to say, if you were to make a graph of the state-space, it would not be a grid like the physical world. A basic deviation is that the state-space would include a node for each orientation which can be possessed at each physical location; there would be 4 (0,0) nodes in the problem space, one for N, S, E and W, respectively.

Clearly, then, reduction of distance is not a reliable metric. Indeed, when we attempt to implement it, we do not observe learning at all. However, we make the observation that the state-changes are, though not completely determined by the local shape of the grid, constrained by it to a certain extent. Unlike in the maze problem, where distance minimization could continue until the agent was even one cell from the goal, and yet potentially dozens of moves from it, the connectivity in this world means that if you are physically moving closer, you must be getting closer in the state-space, in general, with the exclusion of a typically small amount of back tracking.

Using this observation, we implement a modified reduction of distance critic, in which there is some gap between positive and negative reinforcement, with intermediate changes in distance neither rewarded nor punished. In practice, I have found that the optimal structure gives reward for reductions in distance of more than 0.4 grid-width grid-width grid-width units, negative reinforcement for greater than 0.3 units increase in distance, and no reward in between those limits. For the training case, we will set the agent in a 5x5 world, labeled on each axis from -2 to 2, and allow it to explore with this critic. Using this procedure, we observe very effective learning:

Here, we have a rapid (the characteristic exponential curve) coalescence to an average of about 5 or 6 steps to reach the goal. As this is very near the average linear distance between two points on the grid space, we can infer that it is indeed very close to an optimal policy (a Monte Carlo process finds an average best performance of 5.4 steps to goal).

This is of course an exciting case because we used a metric which is, by all rights, very weak and not directly related to the solution space, to teach a very effective policy quickly. However, we have some additional points of interest to look at. First, this is another case in which the QLearning agent (taking the same input) simply fails to learn a functional policy. If the learning rate is made very small, it can make some temporary gains, but invariably diverges eventually, typically 'learning' behaviors which trap it in a loop. What's more, if the Murin agent has a memory depth of less than 2, it also fails to learn.

These observations, together, indicate a dependence on the ability to chain actions together: the failure of the QL agent, in particular the failure by way of settling into a loop (which occurs for any memory depth assigned to it), indicates that non-sequential learning can only account for a certain portion of the state-space path necessary for resolution of this problem. The failure of the Murin agent with a shallow memory, then, suggests that not only are the links between behaviors necessary, but that the historical update, through which we seek to reward behaviors which have positive effects at later time steps, is also a necessity.

The second of the additional interest in this experiment is due to the observation that the information we supply to the agent as input- the relative orientation to the goal, and the orientation, are both locally computed- independent of the actual distance between the agent's starting position and the goal, and therefor the learning in this case can be applied to a world of much broader size than the one in which the testing was done. This is notably a separate advantage from the 'knowledge transference' of the Maze experiment- here, we are mentioning that the training in a small subsection of the world applies to the whole world:

This is of course an exciting case because we used a metric which is, by all rights, very weak and not directly related to the solution space, to teach a very effective policy quickly. However, we have some additional points of interest to look at. First, this is another case in which the QLearning agent (taking the same input) simply fails to learn a functional policy. If the learning rate is made very small, it can make some temporary gains, but invariably diverges eventually, typically 'learning' behaviors which trap it in a loop. What's more, if the Murin agent has a memory depth of less than 2, it also fails to learn.

These observations, together, indicate a dependence on the ability to chain actions together: the failure of the QL agent, in particular the failure by way of settling into a loop (which occurs for any memory depth assigned to it), indicates that non-sequential learning can only account for a certain portion of the state-space path necessary for resolution of this problem. The failure of the Murin agent with a shallow memory, then, suggests that not only are the links between behaviors necessary, but that the historical update, through which we seek to reward behaviors which have positive effects at later time steps, is also a necessity.

The second of the additional interest in this experiment is due to the observation that the information we supply to the agent as input- the relative orientation to the goal, and the orientation, are both locally computed- independent of the actual distance between the agent's starting position and the goal, and therefor the learning in this case can be applied to a world of much broader size than the one in which the testing was done. This is notably a separate advantage from the 'knowledge transference' of the Maze experiment- here, we are mentioning that the training in a small subsection of the world applies to the whole world:

In the graph above, the agent initially trains on a 5x5 map, and then at time = 150 epochs, is transferred to a 40x40 map, on which the approximate optimal average is around 28 steps to goal. As we can clearly see, the immediate performance of the agent is at this near-optimal level. Additionally, I remark that this is the same average performance of an agent trained on a 40x40 grid from the start (though reaching this level does take a bit longer that way). This kind of adaptation is not well defined in a 'classical' QL agent, which would have each location on the grid as a state.

To cap off our discussion of this experiment, let's look at the three main things we observed: we saw that the agent can learn to solve the Grid navigation problem highly efficiently, even though we did not have a solution from which we could teach it- it learned to do so from a very weak approximation of improvement, and did so even when a QL agent cannot properly learn a policy. Essentially, we used the Murin agent to develop a policy for a novel problem (at least, novel with respect to the designer). We also observed, from the the particular failure mode the QL agent adopted, and from the memory-parameter dependent nature of the learning the Murin agent did, that this problem requires the historical linking property, and thereby validate its inclusion in the algorithm. Finally, we demonstrated that because the learning was local in nature, along with the input being locally calculable, that the training performed on a small section of the grid could be satisfactorily applied to the entire grid, verifying for us that we aren't actually seeing the algorithm learning to correlate the states provided via input with individual cells in the grid, and how to navigate from them.

To cap off our discussion of this experiment, let's look at the three main things we observed: we saw that the agent can learn to solve the Grid navigation problem highly efficiently, even though we did not have a solution from which we could teach it- it learned to do so from a very weak approximation of improvement, and did so even when a QL agent cannot properly learn a policy. Essentially, we used the Murin agent to develop a policy for a novel problem (at least, novel with respect to the designer). We also observed, from the the particular failure mode the QL agent adopted, and from the memory-parameter dependent nature of the learning the Murin agent did, that this problem requires the historical linking property, and thereby validate its inclusion in the algorithm. Finally, we demonstrated that because the learning was local in nature, along with the input being locally calculable, that the training performed on a small section of the grid could be satisfactorily applied to the entire grid, verifying for us that we aren't actually seeing the algorithm learning to correlate the states provided via input with individual cells in the grid, and how to navigate from them.