The Q-learning (QL) algorithm is a reinforcement learning algorithm [3]. It learns an optimal policy without first learning a model[3]. The QL algorithm itself is below. References for the QL algorithm below are [3],[4].
1. Find the state, s: this is the minimum price of the last round. Each pricebot excludes its own price from this calculation. 2. Find the action, a: this action is chosen from state s. The action corresponding to the maximum value for s is chosen with probability and a random action in s is chosen with probability . For any given state, there are different actions, that is different prices, that can be chosen. A value for a state and some action is a number, originally 0, which is updated according to the QL update rule, see 9. The maximum value for a state is the maximum of all the values of all the actions for that state. 3. Find the new price, p: the price is a direct mapping from the action chosen.
4. Find the old value from the Q table indexed by s and a as given by Q(s,a) which is a function that maps a given state and action to the correct Q value.
5. Find the reward for p as given by r(s,a) which is the reward for choosing action a in state s.
6. Find the new decayed using,
7. Find the next state, s': this is the minimum price of the current round.
8. Find the maximum value over the actions for state s' as given by maxQ(s', a) which is a function that finds the maximum Q value over the the next state, s' and current action.
9. Find the new value for state s and action a using the update rule [4],
These values being updated are referred to as Q values [3].