Summer 2004 - Research Project: Reinforcement Learning

I finished the first version of the updateQ (state-action value update) and coded any missing helper function or accessor function, as well as a main function for testing purposes. I got disgusted with Windows when trying to compile so I switched to Linux.

Several hours of debugging have finally produced something useful. The agent selects its moves according to an E-greedy policy and updates its state-action values according to Sarsa learning. The code still needs to be cleaned up to make access to functions clearer and just more user-friendly in general.

After cleaning up the code (and adding a Pos class to make dealing with the agent's position niceer), and debuging to clean up the 'clean' code, I got it working again. The agent is learning to avoid the cliff, but unlike the results predicted in the book, the agent skims very close to the edge rather than taking a slightly longer, safer route.

I spent a few more hours tweaking everything to be more readable, and to have be possible to stick as closely as possible to the algorithm in the book (this meant improving my updateQ function). The agent still likes to go close to the edge, but more a good part of the time it now selects the safer path after 100 episodes.

I made a few big changes to the structure of the code. Rather than just SarsaAgent I have an EGreedyAgent as a base class, which SarsaAgent extends (so only updateQ and constructors need to be included in it). I then coded a Q-learning agent QAgent, that also extends EGreedyAgent. It works perfectly, always choosing the optimal path (except when exploration makes it do sub-optimal moves). I also obsessively commented the code the way I like it. This part is done for now, and we can run some tests using both agents

All in all I had a lot of fun coding these two agents, and seeing them actually learn to avoid the cliff in different ways really made it worth the whole effort. It went pretty well, with lots of debugging, but nothing horrible and drastic (like searching for the -same- bug for hours. I was fortunate enoguh to have a seemingly endless series of small, easy to spot bugs in the code). It was a good experience in coding in Java from scratch, and really reinforced (so to speak) the ideas behind Sarsa and Q-learning, as well as what I read in the earlier chapters of the book. I am planning on posting the results of the experiments probably next week.

As my final work for the week, I built the first version of this website. Sections included are personal information, some information about my mentor, Professor Doina Precup, information about the project and this journal. It is for the moment a very minimal site, but will hopefully improve as I get some inspiration for the layout.