Summer 2004 - Research Project: Reinforcement Learning

Well, this is it, the last week of work. I don't have time to start anything new, but I do have to time to wrap up all that I wanted to test. This includes comparing various values of temperature, with the same goal reward and value for lambda, as well as various values for lambda with the same temperature and goal reward, and finally seeing whether a higher reward really does tends to stabilize the total reward the agent manages to collect.

First of all, different temperatures: I compared T=10, T=15 and T=20. In all other respects the agents were the same, with lambda=0.5, fast disaster learning, and the goal reward was +2500. The trend is that smaller temperatures are better. They learn faster in the beginning and have much higher values in the 1st episodes (see 1st graph below), they have a higher maximum total reward obtained (see 2nd graph below), and less variance (see 3rd graph below).

This is pretty much the results that were expected. In the 24x15 world, the same happened, but the higher temperature managed to find a hidden goal that the lower ones couldn't find, and eventually the reward was greater (see week 4). The bigger world has no hidden goal, so there's no advantage in the higher temperature.

Secondly, different values of lambda. I've compared L=0.1, L=0.3 and L=0.5. Once again, the agents are the same in all other respects; in this case, the temperature was 15, and the goal reward was +3500. The results are got weren't very inspiring. There was hardly any difference at all between the three values. The only difference I can see is that higher values of lambda (0.6, for instance) don't work at all. This is strange behaviour considering how little difference there seems to be.

The results, though not spectacular, were also expected. In the smaller world, eligibility traces made no difference at all in stabilizing the final reward, or increasing it. They helped find the second goal faster, or just helped find it when the agent did not explore enough to find it on its own (see week 5).

Lastly, I have another comparison between different values for the goal reward. As I noticed last week, the higher the reward, the more stable the final reward collected by the agent. It doesn't really increase the total reward, but the extra stability is nice, giving a higher average. The reason for this is most likely that having the higher reward increases the probability of taking the path to the goal; this is obviously an advantage in the 30x30 world, but it reduces exploration, which, as we saw in the 24x15 world, is critical for finding hidden goals.

Well, that is the essence of my findings over the whole summer of work. I learned a lot about reinforcement learning, and it was all very interesting. I used some neat techniques that sometimes really helped the agent out. All in all, I think the results were interesting, but that something isn't quite right about it all. The fact is, it really depends a lot on the world the agent is working in. In the larger world, lower temperatures were much better, no question, but it was the opposite in the smaller world, on which there was an optimal temperature, neither too high nor too low. The use of eligibility traces didn't have this kind of problem. It was useless but not problematic in the larger world. However, the use eligibility trace slows down considerably the computing time, which will be a problem as the environment gets larger and larger. My conclusion is that, while this is all very useful in small, slightly controlled cases, in a large, completely unknown environment, there is just too much fine-tuning to do on all the parameters. Most likely real progress will be done in another direction; probably clustering states together will be something to look at next.