Summer 2004 - Research Project: Reinforcement Learning

More and more test results are coming in, and they are not really what I hoped for. The tests with temperature T=15 have finished, as well as those with T=10 and higher values for lambda.
I graphed some of the results: we can see how little difference there is between lambda values (0.1 vs. 0.5), on episodes 1 to 1000 and 1000 to 70000. We can also see the same pattern for T=15, with lambda=0.1, 0.2 and 0.3.

There are a few interesting things to note:
* There is almost no difference between T=10 and T=15 (see graphs below)
* For some reason, lambda=0.6 does not work for either T=10 or T=15. lambda=0.5 works fast enough, but after a few days, not even a single trial for lambda=0.6 had finished.

The first part seems to indicate that there's something wrong other than the parameters T, lambda and so on. As more results come in, I see that T=20 is not learning anything either. The second part seems to indicate some sort of limit on the value of lambda. I can't see exactly what the agent is doing, but I'm under the impression that when the trace is kept too long, it completely warps the learning, and that the agent is probably going around in circles somewhere in a corner of the grid.

I think I may have figured out the problem. I tried running some tests with a higher reward, and the agent is performing much better. Only a few tests have finshed at this point, but for an increase of just 1000, I'm getting positive rewards for T=10, for which the rewards stagnated around -5000 with the smaller reward. I think my plan for the rest of this week and next week will be to see how increasing the reward affects the agent's learning.

I also worked a bit more on my website this week. Just a little bit of tweaking, as well as changing the navigation system in the journal. So now, instead of no navigation, I have the nice bar with roman numerals before and after each journal entry, which allows you to go any other journal entry, or the main journal page. I think this is much better, since I myself don't get annoyed anymore having to press back all the time to double check what I wrote.