__Experiment 8:__

__Tic Tac Toe player__

This one is a little different from the others. The title is pretty straight-forward: the agent, in this experiment, learns how to play Tic Tac Toe (3T). There are two tricky parts which (aside from the fact that playing a game against an AI is neat) make this experiment interesting. First, the driver of the learning process is competition- rather than try to generate a metric for what constitutes a good move, the agent gets rewarded when it wins (strong positive feedback), or when it ties a game (weak positive feedback- ties are better than losing, but not as good as winning) and punished when it loses (strong negative feedback). The second point of interest is the use of injected noise to improve training. We'll get to how this works in a bit, but basically, we use a corruption of an optimal agent to improve the learning that the agent undergoes.

To start, we set up a simulation of the game. Tic Tac Toe is simple- a 3x3 grid, with each cell holding a -1,0, or 1 to represent player 1, an empty cell, or player 2, respectively. From that, we can make a simple check which sums along either diagonal, each column, and each row. If any of these sums is -3 or 3, the game has been won by the player whose sign matches the winning sum. Otherwise, each column and row, at the end of the game must have either 2 1s and 1 -1, or vice versa, so if all sums are 1 or -1, the game is over, and tied.

Agents playing can make one of 9 moves, which means placing a mark in one of the given slots. Of course, over time, moves become illegal if their slots are filled. Making an illegal move nets the agent a small punishment, and it must re-pick. The game itself, then, has 3^9 total states (just about 20,000), though when considering that most states can collapse via rotation, the effective depth of the space is dramatically reduced. We'll start by supplying all possible states individually, and then decreasing the space size. This is an interesting procedure because we know that QL can theoretically learn the game from all states, since Tic Tac Toe is completely soluble- it is possible to determine optial play for all board states, and the next optimal move is determined exclusively by the current board state.

Given that it is solvable, this is sort of an oddball problem for Murin. There are two things going on here. First, I wanted to make something that a human could interact with in a meaningful way. That's mostly irrelevant for this report. Secondly, I wanted to experiment with using reduced state spaces with the Murin learner to improve training performance and speed. For instance, even though the QL agent can learn the full state-space version, what kind of compressed spaces could be used in lieu of the full 3^9 space with Murin that might be inaccessible to QL. Such spaces might also allow for more rapid training, too, if functionally viable.

For the continuance of this experiment.

To start, we set up a simulation of the game. Tic Tac Toe is simple- a 3x3 grid, with each cell holding a -1,0, or 1 to represent player 1, an empty cell, or player 2, respectively. From that, we can make a simple check which sums along either diagonal, each column, and each row. If any of these sums is -3 or 3, the game has been won by the player whose sign matches the winning sum. Otherwise, each column and row, at the end of the game must have either 2 1s and 1 -1, or vice versa, so if all sums are 1 or -1, the game is over, and tied.

Agents playing can make one of 9 moves, which means placing a mark in one of the given slots. Of course, over time, moves become illegal if their slots are filled. Making an illegal move nets the agent a small punishment, and it must re-pick. The game itself, then, has 3^9 total states (just about 20,000), though when considering that most states can collapse via rotation, the effective depth of the space is dramatically reduced. We'll start by supplying all possible states individually, and then decreasing the space size. This is an interesting procedure because we know that QL can theoretically learn the game from all states, since Tic Tac Toe is completely soluble- it is possible to determine optial play for all board states, and the next optimal move is determined exclusively by the current board state.

Given that it is solvable, this is sort of an oddball problem for Murin. There are two things going on here. First, I wanted to make something that a human could interact with in a meaningful way. That's mostly irrelevant for this report. Secondly, I wanted to experiment with using reduced state spaces with the Murin learner to improve training performance and speed. For instance, even though the QL agent can learn the full state-space version, what kind of compressed spaces could be used in lieu of the full 3^9 space with Murin that might be inaccessible to QL. Such spaces might also allow for more rapid training, too, if functionally viable.

For the continuance of this experiment.