next up previous
Next: Q-learning: Simultaneous and Sequential Up: Evolutionary Dynamics of Four Previous: Derivative-following (DF)

Q-learning (QL)

The Q-learning (QL) algorithm is a reinforcement learning algorithm [3]. It learns an optimal policy without first learning a model[3]. The QL algorithm itself is below. References for the QL algorithm below are [3],[4].

1. Find the state, s: this is the minimum price of the last round. Each pricebot excludes its own price from this calculation. 2. Find the action, a: this action is chosen from state s. The action corresponding to the maximum value for s is chosen with probability $1-\epsilon$ and a random action in s is chosen with probability $\epsilon$. For any given state, there are different actions, that is different prices, that can be chosen. A value for a state and some action is a number, originally 0, which is updated according to the QL update rule, see 9. The maximum value for a state is the maximum of all the values of all the actions for that state. 3. Find the new price, p: the price is a direct mapping from the action chosen.

4. Find the old value from the Q table indexed by s and a as given by Q(s,a) which is a function that maps a given state and action to the correct Q value.

5. Find the reward for p as given by r(s,a) which is the reward for choosing action a in state s.

6. Find the new decayed $\alpha_n$using,


\begin{displaymath}\alpha_n = \frac{\alpha_0 * (n_0 + 1.0)}{n_0 + currentRound}\end{displaymath}

where $\alpha_0 = 1.0$ and n0 = NumberOfRounds/10. $\alpha_n$ is a decaying value that determines how much to change the current Q value by. Note that the $\epsilon$ decay function is the same as the $\alpha$ decay function.

7. Find the next state, s': this is the minimum price of the current round.

8. Find the maximum value over the actions for state s' as given by maxQ(s', a) which is a function that finds the maximum Q value over the the next state, s' and current action.

9. Find the new value for state s and action a using the update rule [4],

$Q(s,a) = Q(s,a) + \alpha_n*[r(s,a) + \gamma*maxQ(s', a) - Q(s,a)]$
and save it at Q(s,a). $\gamma$ is a constant value determining how much the effects of the current action in the future should be taken into account. As mentioned, the state is the current minimum price. If there are only two pricebots then the current minimum price ends up being the other pricebot's price. The action is some price. Only six prices were used, from 0.5 to 1.0 going by increments of 0.1.

These values being updated are referred to as Q values [3].



 
next up previous
Next: Q-learning: Simultaneous and Sequential Up: Evolutionary Dynamics of Four Previous: Derivative-following (DF)
Victoria Manfredi
2001-08-02