I started off my summer project with some selected readings from
Reinforcement Learning: An Introduction (Sutton and Barto).
In particular I read the introductory chapters 1 and 2, some sections from chapters 3 and 4, and chapter 6 on temporal
difference learning, which focuses more on what we will be working on during the summer.
For the project, we will be focusing on gridworlds, and need an environment to run our tests, so I am writing code
for the example given in the book, Section 6.5,
Example 6.6, cliff-walking in a gridworld. The idea of this example is that the agent is traversing a gridworld
that gives it rewards. The start state and goal state are separated by a 'cliff', a series of states that give very high
negative rewards and send it back to the start.
I completed the first version of the environment, but it was not practical to use, and was not as flexible as I would
have wanted, so I scrapped my several hours work and started from scratch, on a much better path.
The second version of the GridWorld is on its way. The grid the agent is to traverse is implemented (class GridWorld)
, as well as functions setCliff and setGoal, which assign rewards to an area or single square respectively. Up until now I
have a function for E-greedy move selection, and a representation for the state-action values. I started but did not yet
the update function for the state-action values.