I am going through Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto and porting the lisp examples to Haskell in order to learn RL mainly but also to hone my functional programming skills. I chose the lisp examples over C examples obviously because I am coding Haskell. Not that lisp and Haskell are similar. Far from it. They are dissimilar but I find that it is helpful to work off a language that is not procedural.
The lisp code for this example I referred to is this
Except the Haskell code everything I explain is based on the references cited at the end.
What is reinforcement learning ?
State Monad
I started with a proper Haskell state management workflow but soon abandoned it in favour of mutable IOarrays. Global mutable state is virtually non-existent in Haskell code.
This code just manages the board size but is unused now. The board is small and is of fixed size.
State of the board
Haskell Gloss
The UI toolkit Gloss was used to initially display a grid. I hoped that eventually I could debug this visually. That hope turned out to be false as I misjudged the number of times one trains an algorithm like this. So this code is not useful for debugging.
Value table
The representation of the board state when Player X has moved to position 1 and Player O has learned to move to position 1 is [1] and [1]. This means that the two lists, representing the positions played by the two players X and O, each contain 1
This is stored as a set of higher-order bits and a set of lower-order bits like this in the value table. The value of this set of bits shown in the image is 513.
How do we know that a Player has won ?
Just store all the winning combinations and check because we have less board positions.
The ReaderT Monad transformer for reading and writing to arrays.
The representation of a Player
Calculate the next state in the board.
Get a list of empty positions in the board.
```
Select an empty position randomly
Greedy move
Abandoning the functional approach with this function
This is basically the original Lisp converted line by line to Haskell. The Haskell programmers who I consulted dissuaded me from doing this but at this time my Haskell knowledge does not measure up to the task.
Equivalent of ‘run’ function in the example
Equivalent of ‘runs’ function in the example
Main function
Number of times Player O wins over X
The Reinforcement Learning agent should become unbeatable. That is what I used to think. But my code makes Player O learn and win roughly more than 50 % but less than or equal to 60%. But I have only played 100 games, 30 times each.
I think the winning rate of Player O should be more. It is less because somewhere in my code there may be a bug. I will continue to test it and the latest code will be in my repository
"Played 100 times 45.0 0.45"
"Played 100 times 53.0 0.53"
"Played 100 times 51.0 0.51"
"Played 100 times 52.0 0.52"
"Played 100 times 44.0 0.44"
"Played 100 times 46.0 0.46"
"Played 100 times 52.0 0.52"
"Played 100 times 51.0 0.51"
"Played 100 times 38.0 0.38"
"Played 100 times 39.0 0.39"
"Played 100 times 39.0 0.39"
"Played 100 times 43.0 0.43"
"Played 100 times 48.0 0.48"
"Played 100 times 49.0 0.49"
"Played 100 times 52.0 0.52"
"Played 100 times 56.0 0.56"
"Played 100 times 50.0 0.5"
"Played 100 times 43.0 0.43"
"Played 100 times 50.0 0.5"
"Played 100 times 51.0 0.51"
"Played 100 times 64.0 0.64"
"Played 100 times 50.0 0.5"
"Played 100 times 61.0 0.61"
"Played 100 times 58.0 0.58"
"Played 100 times 46.0 0.46"
"Played 100 times 60.0 0.6"
"Played 100 times 55.0 0.55"
"Played 100 times 51.0 0.51"
"Played 100 times 51.0 0.51"
"Played 100 times 56.0 0.56"