I finished the first version of the updateQ (state-action value update) and coded any missing helper function
or accessor function, as well as a main function for testing purposes. I got disgusted with Windows when trying to
compile so I switched to Linux.
Several hours of debugging have finally produced something useful. The agent selects its moves according to an
E-greedy policy and updates its state-action values according to Sarsa learning. The code still needs to be cleaned up
to make access to functions clearer and just more user-friendly in general.
After cleaning up the code (and adding a Pos class to make dealing with the agent's position niceer), and debuging to
clean up the 'clean' code, I got it working again. The agent is learning to avoid the cliff, but unlike the results predicted
in the book, the agent skims very close to the edge rather than taking a slightly longer, safer route.
I spent a few more hours tweaking everything to be more readable, and to have be possible to stick as closely as possible
to the
algorithm in the book (this meant improving
my updateQ function). The agent still likes to go close to the edge, but more a good part of the time it now selects the
safer path after 100 episodes.
I made a few big changes to the structure of the code. Rather than just SarsaAgent I have an EGreedyAgent as a base class,
which SarsaAgent extends (so only updateQ and constructors need to be included in it). I then coded a Q-learning agent
QAgent, that also extends EGreedyAgent. It works perfectly, always choosing the optimal path (except when exploration makes
it do sub-optimal moves). I also obsessively commented the code the way I like it. This part is done for now, and we can run
some tests using both agents
All in all I had a lot of fun coding these two agents, and seeing them actually learn to avoid the cliff in different
ways really made it worth the whole effort. It went pretty well, with lots of debugging, but nothing horrible and drastic
(like searching for the -same- bug for hours. I was fortunate enoguh to have a seemingly endless series of small, easy to spot
bugs in the code). It was a good experience in coding in Java from scratch, and really reinforced (so to speak) the ideas
behind Sarsa and Q-learning, as well as what I read in the earlier chapters of the book. I am planning on posting the
results of the experiments probably next week.
As my final work for the week, I built the first version of this website. Sections included are personal information, some
information about my mentor,
Professor Doina Precup, information about the project and this journal.
It is for the moment a very minimal site, but will hopefully improve as I get some inspiration for the layout.