Q-learning

What is Q-learning?

Q-learning is a reinforcement learning algorithm that falls under the umbrella of machine learning, an area closely related to artificial intelligence.

Unlike a neural network which is a supervised learning algorithm, Q-learning learns independently by acting on observations of its state. For example: we might send a mars rover-esque robot to another planet which has an unknown surface. With reinforcement learning the robot could potentially learn about the surface through trial and error and use this new found knowledge in an effort to best reach its goals (for example: the most efficient way to move across the surface).

In my opinion this type of learning best resembles how we as humans learn.

How does Q-learning work?

Q-Learning formula

The simplest explanation of the equation is that the agent chooses actions based on its policy when in a given state to best maximise its reward.

It works by storing a Q-Value which is a numerical value representing for a given state and action, based on experience, whether it was good or bad.

A policy determines the chance that it will take the action that it has found to be the best, or if it will take a different action. If we only took the best action then we could never find the optimal solution or adapt to a changing environment.

The learning rate as shown in the equation controls how fast it learns, more precisely how much weight it places on new information.

Finally the discount rate determines how concerned it is with the short term reward or the long term reward. For example; we might get a good reward moving into a state, but in the bigger scheme of things it may be poor choice.

 Q-learning in action

To demonstrate Q-learning I have built a small tool. The problem presented to the agent is to find the optimal path from the start square (white) to the checkered finish square, whilst avoiding any hazards (toggled by clicking on a square). Each square represents a state which is the x,y coordinate of the agent. Thus in this 5×5 grid we have 25 different states. The agent has up to 4 possible actions for each state, with those being, up, down, left and right.

By default, the reward for moving onto a blank square is -3. The reward for moving onto a hazard is -200, the reward for moving onto the finish square is 100.

We can also see that each square has 4 numerical values. This represents the Q-Value for that state and action. So, for a given state the up action Q-Value is the number at the top of the square and so forth.

Before we begin you should know the default Q-Value for each action/state has been set to 0 so the agent has no prior knowledge.

  1. Start by clicking on the step button. This represents 1 iteration of the algorithm. The agent will have seemingly chosen a random legal action to make.
  2.  Click step again. This time we see the Q-Value for the state it was on prior and the action it took. It should become clear that the reward set for any square was negative to ensure it would find the optimal path. If we didn’t, then there is no incentive to find the best path.
  3.  If you step through a few more times, we can see the agent is choosing arbitrary movements but learning  about those actions based on its reward. If it has encountered a hazard then it will be moved back to the start position and have learnt that acting upon its prior state /action pair  results in a bad reward. To save you some clicking, press the train button until the numbers converge – that is until they do not change much.
  4. After training we can step through to view what the agent has learned. This time drag the exploration probability to 0. This will force it to choose what it has learnt to be the best path but in doing so inhibits its ability to learn anything new. You should find that if it has been trained for 10,000-15,000 iterations, it will have found the optimal path. Now select one of the squares on the optimal path and turn it into a hazard (make sure that there is still a possible path from start to finish).
  5. If we step through it again then it will move onto the hazard square and record a new Q-Value. The negative reinforcement it received might just have been enough for the other actions to now have a higher Q-Value and they will be chosen in future otherwise slide the probability of exploration back up and train it or keep stepping through it manually. When trained enough it should have now adjusted itself to learn the new path. Pretty cool huh.

When to use Q-learning?

The biggest downfall of Q Learning is the memory requirements and the exponential time for it to converge. The agent can only learn things that are encapsulated in the state. So the more information, the more states we need. A 100×100 grid = 10,000 different states. If you were to then add a moving enemy, it’s state would need to be included, resulting in 100,000,000 states now needing to be stored. That’s going to be pretty heavy going on memory usage and take a while for it to converge to an optimal solution.

This can make Q Learning ill suited for a problem where the state contains a lot of information. That being said, techniques can be employed to help  reduce  the number of states.

Grid type games lend themselves well to Q-learning, as demonstrated with this tool: in fact an agent has been trained to play backgammon successfully against the top players in the world. Apart from grid based games it can also be used for decision making, for example Strategy AI or controlling the logic of an enemy in a First Person Shooter.

So that is pretty much Q-learning in a nutshell.  For more information on Q-Learning you should have a read of the very detailed book by Sutton and Barto that’s available online at http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

In the upcoming weeks I should hopefully be releasing a Q-learning library :)

  • Pingback: Shodan a Q-learning library | Allan Bishop's Developer Blog

  • Pingback: Box2D and Shodan demo of a crawling robot | Allan Bishop's Developer Blog

  • jss

    Doesn’t Q-learning fal foul of the hill-climbing problem, where it will converge on a partial solution and miss the big picture?

  • Anonymous

    I am no expert on Q-learning but I don’t think it suffers from being stuck in a local minima in the same way a neural network can.

    I imagine that as long as it balances current immediate reward with possible future reward, and that it explores random actions now and then, given enough time it should not get permanently stuck.

    That being said, it sounds reasonable that its learning could be slowed down by a local minima. If you step through the flash app, you can see that at times it spends a little while in an area that is not the optimal solution, but it will at some point discover a better path (as long as there is some degree of exploration). That’s the beauty of Q-learning, in that in can adapt if the environment changes and not be stuck on the one solution – so long as it’s being trained sufficiently.

  • http://www.arosph.dk/index.php?/OM-os/din-vej-til-hurtig-og-kompetent-behandling.html Aros

    Beautiful 

  • Ainbpr

    hi i need you source code for this flash.
    can u help me?
    i am ali nayebpour from iran.
    email: ainbpr@yahoo:disqus .com