## Dijkstra based Planning Algorithm (Ongoing)

I have recently learned that the modification of Dijkstra's algorithm I present here is in use in networking protocols, which is pretty interesting. That makes the technique described below a synthesis of networking theory and Machine learning- parallels between cortical networks abound, I'm sure!

I've extended the planning and deduction concepts from this work further, in this post; and applied another modification to solve time-varying versions of planning problems, as discussed here.

I recently happened to be reading a book on combinatorial algorithms which included a description of Dijkstra's algorithm. I first encountered it years ago in undergrad, quite likely in a computer science course (but maybe in graph theory), but I hadn't really done anything interesting with it. The central notion is you have a graph which has edges labeled with the 'distance' between the two nodes the edge connects, and the algorithm lets you find the path of minimal distance between two given nodes. It does this by examining all edges that connect unmarked nodes to the currently selected path (initially containing only the starting node), computing the net distances to those nodes, and adding to the path the edge whose addition creates the least such path: min(u,u+v), and the node connected by including this edge. Probably the best intuitive explanation I've ever heard for Dijkstra's algorithm is to imagine a bunch of ants starting from the initial node, walking outwards, all at a constant rate. As an ant arrives at a node for the first time, you mark it and add that node and the edge the ant used to get there to the path:

I've extended the planning and deduction concepts from this work further, in this post; and applied another modification to solve time-varying versions of planning problems, as discussed here.

I recently happened to be reading a book on combinatorial algorithms which included a description of Dijkstra's algorithm. I first encountered it years ago in undergrad, quite likely in a computer science course (but maybe in graph theory), but I hadn't really done anything interesting with it. The central notion is you have a graph which has edges labeled with the 'distance' between the two nodes the edge connects, and the algorithm lets you find the path of minimal distance between two given nodes. It does this by examining all edges that connect unmarked nodes to the currently selected path (initially containing only the starting node), computing the net distances to those nodes, and adding to the path the edge whose addition creates the least such path: min(u,u+v), and the node connected by including this edge. Probably the best intuitive explanation I've ever heard for Dijkstra's algorithm is to imagine a bunch of ants starting from the initial node, walking outwards, all at a constant rate. As an ant arrives at a node for the first time, you mark it and add that node and the edge the ant used to get there to the path:

This most recent exposure gave me the idea of incorporating it into a learning context- in working through an example problem to get a better feel for the execution of the algorithm (which is truly brilliant- well worth learning just for the sake of knowing it), I was reminded of the similar action of the STRIPS planning, which is a planning paradigm for robotics.

To illustrate, STRIPS is a description of the world of the agent in which there are places, objects, and actions (essentially, there is some nuance, but this gets the idea across). Actions manipulate the states of objects and the robot, allowing it to change the states of the world and achieve a goal. Planning is typically accomplished by a depth-first search for the goal state using the actions to establish transitions. It can be seen how if the world were represented as a graph, and transitions were labeled by edges, Dijkstra's algorithm can be used to determine the plan with the least number of steps. It is probably important to mention that such a graph will be of the directed sort, since there is no guarantee that any particular action is reversible.

The first thing that occurred to me was that the world model on which the planning occurs does not necessarily need to be designed by a human. We can set up a broad array to probabilistically learn transitions by experimentation.

Let's start by imagining the problem space as it is represented in the STRIPS model as a directed graph (the world graph, W, we'll call it). Each possible state of the world is an individual node, and action that can be taken from that state is represented by an edge from the current state node to the final state after that action. Note that actions from certain states will result in the agent remaining in its current state. For instance, consider a robot which can drop an object that it is carrying. If it is currently carrying no objects, and attempts the 'drop item' action, it will simply remain in the same state, as it cannot drop anything. This is notable because actions which do not change the state of the robot lead to sparsity of the world graph- in practice we will simply truncate such edges from the world graph and not process them, which improves the run time of the planning stage.

As a simple example, let's look at a robot which can travel between two places (A and B), pick up an object at one and drop it at the other. Let's say the state of the world is represented by three things: the location of the robot, whether it has the item, and whether there is an item at the current location. The transition graph for this world will look like this:

To illustrate, STRIPS is a description of the world of the agent in which there are places, objects, and actions (essentially, there is some nuance, but this gets the idea across). Actions manipulate the states of objects and the robot, allowing it to change the states of the world and achieve a goal. Planning is typically accomplished by a depth-first search for the goal state using the actions to establish transitions. It can be seen how if the world were represented as a graph, and transitions were labeled by edges, Dijkstra's algorithm can be used to determine the plan with the least number of steps. It is probably important to mention that such a graph will be of the directed sort, since there is no guarantee that any particular action is reversible.

The first thing that occurred to me was that the world model on which the planning occurs does not necessarily need to be designed by a human. We can set up a broad array to probabilistically learn transitions by experimentation.

Let's start by imagining the problem space as it is represented in the STRIPS model as a directed graph (the world graph, W, we'll call it). Each possible state of the world is an individual node, and action that can be taken from that state is represented by an edge from the current state node to the final state after that action. Note that actions from certain states will result in the agent remaining in its current state. For instance, consider a robot which can drop an object that it is carrying. If it is currently carrying no objects, and attempts the 'drop item' action, it will simply remain in the same state, as it cannot drop anything. This is notable because actions which do not change the state of the robot lead to sparsity of the world graph- in practice we will simply truncate such edges from the world graph and not process them, which improves the run time of the planning stage.

As a simple example, let's look at a robot which can travel between two places (A and B), pick up an object at one and drop it at the other. Let's say the state of the world is represented by three things: the location of the robot, whether it has the item, and whether there is an item at the current location. The transition graph for this world will look like this:

On the left, we've got a loose, very linguistic representation of this scenario. On the right, a more symbolic, but completely equivalent representation of the same graph, to illustrate the point of symbolic representation of the states.

So then, to represent this graph, say we have S many possible states, and can take A possible actions (so far, it's starting to sound like Q Learning- it's not that different, I'll admit it) Then we can craft an <SxAxS> array (call it T), in which the (i,a,j) slot corresponds to the link between states

So then, to represent this graph, say we have S many possible states, and can take A possible actions (so far, it's starting to sound like Q Learning- it's not that different, I'll admit it) Then we can craft an <SxAxS> array (call it T), in which the (i,a,j) slot corresponds to the link between states

*i*and*j*via action*a*. That is to say, if there exists an edge on W from node i to node j, then there exists an*a*such that T[i,a,j] = 1. Then from a given state i, the <1xAxS> slice of T given by T[i,:,:] represents the set of outward-directed edges from node i. Each 1 in the array represents an edge from the node i to the node j. The graphic above illustrates this T array, and the mapping from the i-slice onto a portion of the graph for a hypothetical 5-action, 4 state system. It is simple to see that Dijkstra's algorithm can be used to search this representative space and find a path which will lead to the goal state.

At this point, really all we've done is translate the problem space from a postulate-based linguistic formulation into a graphical model. However, remember that our goal was to adapt the model to a learning system, and now we have the means to do so- the world model is now embedded in the T array, rather than a predicate language. Consequently, the agent can learn the transitions via experimentation. As the agent moves through the world, it will take actions from various states. So in state i, it may take action a, and end up in state j. It can then record a '1' in the corresponding (i,a,j) entry in T.

In this model, there may be multiple edges between any pair of nodes, but each action will link one state to another, exclusively. This is due to the deterministic nature of the world model (remember, we are working backwards from STRIPS). In highly controlled environments, this may be a fair assumption. However, we can imagine a circumstance in which the agent finds itself in a state from which a given action may have two or more potential consequences. Robots in the real world rarely have as clean of environments or as reliable of behaviors as the STRIPS model predicates. The net effect of this uncertainty will be that an agent will have on its world model an edge which essentially bifurcates, leading to multiple states. Such a structure seems to imply the existence of another, hidden, node.

At this point, really all we've done is translate the problem space from a postulate-based linguistic formulation into a graphical model. However, remember that our goal was to adapt the model to a learning system, and now we have the means to do so- the world model is now embedded in the T array, rather than a predicate language. Consequently, the agent can learn the transitions via experimentation. As the agent moves through the world, it will take actions from various states. So in state i, it may take action a, and end up in state j. It can then record a '1' in the corresponding (i,a,j) entry in T.

In this model, there may be multiple edges between any pair of nodes, but each action will link one state to another, exclusively. This is due to the deterministic nature of the world model (remember, we are working backwards from STRIPS). In highly controlled environments, this may be a fair assumption. However, we can imagine a circumstance in which the agent finds itself in a state from which a given action may have two or more potential consequences. Robots in the real world rarely have as clean of environments or as reliable of behaviors as the STRIPS model predicates. The net effect of this uncertainty will be that an agent will have on its world model an edge which essentially bifurcates, leading to multiple states. Such a structure seems to imply the existence of another, hidden, node.

The problem with this hypothetical 'hidden node' is that the state space of the world model was constructed deliberately to be spanning with respect to the robot's ability to perceive its environment. This means that there is an additional decision made about which result the action causes, and that that decision is beyond the knowledge of the agent (the potential use of this mechanism to attempt to grow the perception space automatically is for an entirely different project, but noted here). The intuitive approach to correcting this shortfall is to note, instead of simply the existence of a transition, the probability of such transition between the two given states given the action taken. That is to say, T[i,a,j] = P(i->j | a,s=i).

Now, using this updated world model, we have a graph for which each edge represents the probability of that transition due to a given action, and we want to find the path which is most likely to take us from the current state to the goal. Towards this end, we could now apply a cost function to each edge so that the new 'distance' is inversely proportional to the probability associated with edge, and compute the path. When we try to make up such a function, though, we quickly run into problems with global consistency. For instance consider a section of the graph in which two nodes are connected by two paths. One consists of two edges with probability 0.8, the other of a single edge with probability 0.6. If we make the cost function the reciprocal of the probability, then the first path has a total distance of 1.25+1.25, or 2.5. The single edge path, on the other hand, has a score of 1.67, so it wins. But we know this is wrong, because the net probability of the two edge section is 0.64, which is greater than 0.6.

This is only one possible cost function, but it illustrates the problem well. It basically comes down to the fact that the costs are combined additively, and the probabilities are combined multiplicatively. It's also problematic that we want to maximize the probability, but Dijkstra's algorithm minimizes the score of the path (which makes the log function -a natural choice to convert a product to a sum- unattractive). To this end, let's make a couple modifications- we'll work directly with the probabilities. So, now, when we compute the 'score' for including a given edge, we'll take the product of that edge's probability with the net probability of the prior node: P(x->y + w) = P(x->y)*P(y->w). And then, when we go to make the choice as to which edge to add to the path, we'll select the one which has the greatest net probability out of all candidates.

These changes seem, intuitively, to be sensible. However, it begs allusions to the longest path problem, in which you want to find the path between two nodes of greatest distance- to solve this problem, you cannot just change the 'min(u,u+v)' term in Dijkstra's algorithm to a 'max'. The core issue is that if you have a loop in the graph, an initially added 'maximal' path may not be such when the entire loop is considered. What's worse- the maximal path on a loop between two nodes may by disjoint with maximal paths between other nodes on that loop.

This certainly seems to be problematic, but there is a crucial piece of information we need to look at. Because we are working with probabilities, we are combining by multiplication. Furthermore, all probabilities are less than or equal to one. As a result, for any path the addition of an edge will never increase the net probability. That is, the net probability for any path is monotonically decreasing:

For x->y, we have P(x->y). Then:

x->w = x->y + y->w and,

P(x->w) = P(x->y)*P(y->w) <= P(x->y), P(y->w)

Consequently, when we examine the graph for maxima using this technique, the principle on which Dijkstra's algorithm is based: "Any subtree of a minimal spanning tree is itself a minimal tree" is permuted into the related statement: "Any subtree of a maximal probability spanning tree itself is a maximal probability tree". In essence, we are looking for unique bounding, and we can either look for the maximum of a decreasing sequence, or the minimum of an increasing sequence.

Now that we can compute the most likely path from our probabilistic world model, there's another change we can make to reduce the amount of work we need to do. Consider that there may be, between any two nodes i and j, a number of edges (actions) with various transition probabilities. Additionally, let's say that in the computed path, the i to j transition is included. Then we know that of all actions that can take i to j, it will be the single action which has the highest transition probability. We are sure of this because any path containing and i->j of a non-maximal P(i->j | a,i) can be made more likely, without effecting any other part of the path, by changing P(i->j | a,i) to max{P(i->j)}. So when we go to build our path, we can project the <SxAxS> shaped T array onto an <SxS> array (call it C) which contains the values for the most probable transitions for given (i,j) pairs, and an additional <SxS> array containing the actions which correspond to those maximal probabilities. Using this new, compressed, array allows us to skip the elimination of redundant actions which need not be considered. Additionally, if the goal state is different from the current state, we can ignore edges from a node to itself, as these only matter in terms of the proxy effect on other consequences of that action.

Using this modified version of Dijkstra's algorithm, we can experimentally build world models as basis from which maximally likely paths can be computed. The final procedure, then, is to first compress the T array to C. Then, use Dijkstra's algorithm to find the greatest likelihood path between the current node and the goal, given the known C array. Take the action corresponding to the first edge in the optimal path to the goal. Update T[i,a,j] based on the new state, and repeat until the current state equals the goal state. Alternatively, the world model can be learned by a random walking exploration state prior to any planning, if the circumstances allow it- it works online or offline. Naturally, deliberate exploration builds the world model faster, but online learning will tend to solve problems faster.

Another potential application might be for a human to fill in some apparently deterministic aspects manually, and then to let the agent tune this model as it acts. That is to say, training can be performed on a simulation, and then the T matrix updated online as real world data becomes available. Of course, this simple version could assume uniform probabilities in the world, and 'explore' actively by computing paths based on that naive T map, and correct it iteratively as acting reveals limitations of the world.

Though the theoretical founding is quite straight forward, an exploration of this model does seem to be in order. The first test is on a Cartesian navigation problem, in which movement in the cardinal directions is permitted. This task is first because it is simple to implement, mostly to check that the actual code, as written, works. The agent is allowed to random walk for a short period of time to study the world, and then looks to solve the traversal problem. It can successfully identify the shortest path on a Cartesian grid with a short random walk exploration period sufficient to cover the grid once.

For a second, more sophisticated test, I have implemented a STRIPS-style simulated world for the robot to act in. The world consists of a set of locations, partitioned by a 'door' type location into two rooms. The agent starts in the first room, and in the second there is a 'vending machine' type object, from which the agent must gather a soda and return to its starting location. Doors and Vending machines have an interaction option- opening or closing the door and receiving a soda from the machine. The door must be in its 'open' state in order to be traversed. Each location can have up to 4 links to another location, and thus the robot has 5 possible actions- 4 movement directions and an 'interact' action. Graphically, we illustrate this little world like so:

Now, using this updated world model, we have a graph for which each edge represents the probability of that transition due to a given action, and we want to find the path which is most likely to take us from the current state to the goal. Towards this end, we could now apply a cost function to each edge so that the new 'distance' is inversely proportional to the probability associated with edge, and compute the path. When we try to make up such a function, though, we quickly run into problems with global consistency. For instance consider a section of the graph in which two nodes are connected by two paths. One consists of two edges with probability 0.8, the other of a single edge with probability 0.6. If we make the cost function the reciprocal of the probability, then the first path has a total distance of 1.25+1.25, or 2.5. The single edge path, on the other hand, has a score of 1.67, so it wins. But we know this is wrong, because the net probability of the two edge section is 0.64, which is greater than 0.6.

This is only one possible cost function, but it illustrates the problem well. It basically comes down to the fact that the costs are combined additively, and the probabilities are combined multiplicatively. It's also problematic that we want to maximize the probability, but Dijkstra's algorithm minimizes the score of the path (which makes the log function -a natural choice to convert a product to a sum- unattractive). To this end, let's make a couple modifications- we'll work directly with the probabilities. So, now, when we compute the 'score' for including a given edge, we'll take the product of that edge's probability with the net probability of the prior node: P(x->y + w) = P(x->y)*P(y->w). And then, when we go to make the choice as to which edge to add to the path, we'll select the one which has the greatest net probability out of all candidates.

These changes seem, intuitively, to be sensible. However, it begs allusions to the longest path problem, in which you want to find the path between two nodes of greatest distance- to solve this problem, you cannot just change the 'min(u,u+v)' term in Dijkstra's algorithm to a 'max'. The core issue is that if you have a loop in the graph, an initially added 'maximal' path may not be such when the entire loop is considered. What's worse- the maximal path on a loop between two nodes may by disjoint with maximal paths between other nodes on that loop.

This certainly seems to be problematic, but there is a crucial piece of information we need to look at. Because we are working with probabilities, we are combining by multiplication. Furthermore, all probabilities are less than or equal to one. As a result, for any path the addition of an edge will never increase the net probability. That is, the net probability for any path is monotonically decreasing:

For x->y, we have P(x->y). Then:

x->w = x->y + y->w and,

P(x->w) = P(x->y)*P(y->w) <= P(x->y), P(y->w)

Consequently, when we examine the graph for maxima using this technique, the principle on which Dijkstra's algorithm is based: "Any subtree of a minimal spanning tree is itself a minimal tree" is permuted into the related statement: "Any subtree of a maximal probability spanning tree itself is a maximal probability tree". In essence, we are looking for unique bounding, and we can either look for the maximum of a decreasing sequence, or the minimum of an increasing sequence.

Now that we can compute the most likely path from our probabilistic world model, there's another change we can make to reduce the amount of work we need to do. Consider that there may be, between any two nodes i and j, a number of edges (actions) with various transition probabilities. Additionally, let's say that in the computed path, the i to j transition is included. Then we know that of all actions that can take i to j, it will be the single action which has the highest transition probability. We are sure of this because any path containing and i->j of a non-maximal P(i->j | a,i) can be made more likely, without effecting any other part of the path, by changing P(i->j | a,i) to max{P(i->j)}. So when we go to build our path, we can project the <SxAxS> shaped T array onto an <SxS> array (call it C) which contains the values for the most probable transitions for given (i,j) pairs, and an additional <SxS> array containing the actions which correspond to those maximal probabilities. Using this new, compressed, array allows us to skip the elimination of redundant actions which need not be considered. Additionally, if the goal state is different from the current state, we can ignore edges from a node to itself, as these only matter in terms of the proxy effect on other consequences of that action.

Using this modified version of Dijkstra's algorithm, we can experimentally build world models as basis from which maximally likely paths can be computed. The final procedure, then, is to first compress the T array to C. Then, use Dijkstra's algorithm to find the greatest likelihood path between the current node and the goal, given the known C array. Take the action corresponding to the first edge in the optimal path to the goal. Update T[i,a,j] based on the new state, and repeat until the current state equals the goal state. Alternatively, the world model can be learned by a random walking exploration state prior to any planning, if the circumstances allow it- it works online or offline. Naturally, deliberate exploration builds the world model faster, but online learning will tend to solve problems faster.

Another potential application might be for a human to fill in some apparently deterministic aspects manually, and then to let the agent tune this model as it acts. That is to say, training can be performed on a simulation, and then the T matrix updated online as real world data becomes available. Of course, this simple version could assume uniform probabilities in the world, and 'explore' actively by computing paths based on that naive T map, and correct it iteratively as acting reveals limitations of the world.

Though the theoretical founding is quite straight forward, an exploration of this model does seem to be in order. The first test is on a Cartesian navigation problem, in which movement in the cardinal directions is permitted. This task is first because it is simple to implement, mostly to check that the actual code, as written, works. The agent is allowed to random walk for a short period of time to study the world, and then looks to solve the traversal problem. It can successfully identify the shortest path on a Cartesian grid with a short random walk exploration period sufficient to cover the grid once.

For a second, more sophisticated test, I have implemented a STRIPS-style simulated world for the robot to act in. The world consists of a set of locations, partitioned by a 'door' type location into two rooms. The agent starts in the first room, and in the second there is a 'vending machine' type object, from which the agent must gather a soda and return to its starting location. Doors and Vending machines have an interaction option- opening or closing the door and receiving a soda from the machine. The door must be in its 'open' state in order to be traversed. Each location can have up to 4 links to another location, and thus the robot has 5 possible actions- 4 movement directions and an 'interact' action. Graphically, we illustrate this little world like so:

This little world is pulled from one of my graduate courses, in which STRIPS was studied in the context of robot planning algorithms. I like it because it's a neat toy problem, which lends itself to simulation and also making 90s style icons in paint, as per the above graphic.The idea is that the robot has to navigate from the section on the right to the door, open it, travel to the room on the left, get a soda from the vending machine, and return to its starting location. The state of the door and the possession of the soda are both embedded in the state.

The agent has an exploration phase during which it acts randomly to learn the patterns, as described above, and once learning is complete uses the modified path finding algorithm to generate a solution action set. As expected, it is easily able to find an optimal solution path (i.e. one which contains no unnecessary steps). As mentioned above, it does recompute after each action, though due to the discrete nature of this problem, this is technically unnecessary in this domain. In more dynamic systems, it would be an essential path-correction element of the algorithm, however, as the probabilistic element of action choice would potentially lead to displacement off the calculated path, necessitating a new path. In this sense, the procedure is fairly robust at error correcting, as introduction of closed-loop feedback is very simple.

Essentially, what this algorithm does is take a state space, and transform it to an isomorphic graph representing the world. In that new graphic space, the procedure of path-finding is analogous to deduction in the state space. As a result, we end up with an efficient deduction and planning method which is amenable to probabilistic models, discrete models, and simulation-based analysis. To elaborate on that last point, consider this: in Dijkstra's algorithm, we consider the entire 'neighborhood' of the currently examined optimal subgraph (the nodes and edges in the minimum spanning tree it generates). If we then say we have some kind of simulation or model of the world, we can acquire, from any state, the neighborhood of the tree by computing the potential results of actions taken throughout the explored space. Such data can then be inserted into a 'growing' T array and used to build up the necessary graph as the deduction is being made. An advantage of this approach would be that, supposing the agent can feed sensed data into the 'simulation', it can use a pre-existing model to build a new T matrix with every path-adjustment, taking into account new data.

There are several different continuances I want to apply to this approach. The first effort I am currently engaged in is a formalization of an approach to apply Dijkstra's algorithm to time varying data, to allow for the creation of paths which account for time dependent changes in path lengths or probabilities. I also want to explore the use of autoencoders and/or Restricted Boltzmann Machines for state space abstraction, using the map of probabilistic transitions in the T matrix as criteria for compression. I've also considered implementing the graph representation as a neural network, but quite honestly consider that to be a more independent project, focusing not on automated planning but on functional aspects of the world model, though doing so might bridge the gap to implementing a less-discretized version eventually.

The agent has an exploration phase during which it acts randomly to learn the patterns, as described above, and once learning is complete uses the modified path finding algorithm to generate a solution action set. As expected, it is easily able to find an optimal solution path (i.e. one which contains no unnecessary steps). As mentioned above, it does recompute after each action, though due to the discrete nature of this problem, this is technically unnecessary in this domain. In more dynamic systems, it would be an essential path-correction element of the algorithm, however, as the probabilistic element of action choice would potentially lead to displacement off the calculated path, necessitating a new path. In this sense, the procedure is fairly robust at error correcting, as introduction of closed-loop feedback is very simple.

Essentially, what this algorithm does is take a state space, and transform it to an isomorphic graph representing the world. In that new graphic space, the procedure of path-finding is analogous to deduction in the state space. As a result, we end up with an efficient deduction and planning method which is amenable to probabilistic models, discrete models, and simulation-based analysis. To elaborate on that last point, consider this: in Dijkstra's algorithm, we consider the entire 'neighborhood' of the currently examined optimal subgraph (the nodes and edges in the minimum spanning tree it generates). If we then say we have some kind of simulation or model of the world, we can acquire, from any state, the neighborhood of the tree by computing the potential results of actions taken throughout the explored space. Such data can then be inserted into a 'growing' T array and used to build up the necessary graph as the deduction is being made. An advantage of this approach would be that, supposing the agent can feed sensed data into the 'simulation', it can use a pre-existing model to build a new T matrix with every path-adjustment, taking into account new data.

There are several different continuances I want to apply to this approach. The first effort I am currently engaged in is a formalization of an approach to apply Dijkstra's algorithm to time varying data, to allow for the creation of paths which account for time dependent changes in path lengths or probabilities. I also want to explore the use of autoencoders and/or Restricted Boltzmann Machines for state space abstraction, using the map of probabilistic transitions in the T matrix as criteria for compression. I've also considered implementing the graph representation as a neural network, but quite honestly consider that to be a more independent project, focusing not on automated planning but on functional aspects of the world model, though doing so might bridge the gap to implementing a less-discretized version eventually.