Multi-Armed Bandit Strategies

A multi-armed bandit strategy or solution is an advanced way to decide the most profitable or best outcome out of a number of different choices. 

Mosteller and Bush coined the phrase “multi armed bandits” in 1952 after running trials on mice in a T-shaped maze, and placing food at only one end of the “T”. They then observed whether mice would turn right or left to look for the food when starting at the bottom of the “T”.

The formulae which we will explore here will explain how maximum reward can be achieved which lends itself perfectly to being used in stock market analysis and stock selection. By being able to use different methods and formulae we are able to choose which stocks are likely to be the best investment. 

The Slot Machine Dilemma

In order to see how humans would respond in a similar scenario, a “two armed bandit” machine was created, similar to a slot machine in a casino, but instead of one arm this machine had two arms. People then had the choice of pulling the left arm or the right arm, and there would be a payout. The formula which we will describe below is intended to maximize the reward chances and minimize the chances of regret. 

They way this would be tested was to pull each lever a certain number of times, in this case 15 times. For each pull there would either be a payout or no result. If we take the example that after 10 pulls the left lever has given a total of 4 wins but the right lever has given a total of 6 wins, the question being, for the remainder 5 pulls of each lever how does one decide which one to pull? For example, does one decide that the right lever is the most successful or that it may not deliver any more wins? Or, does one decide that the left lever has been more successful so stay with that one, but it could also mean that it has had its run and luck is about to run out. 

Explore vs. Exploit

The above scenario is an example of the multi armed bandit problem. In other words in a given scenario should one exploit what seems to be a winning formula or option, or explore the choice which may be due a successful run due to the law of probability.

Strategies & Formulae

There are several formulae and strategies that can be used in order to decide whether the explore or exploit options should be considered in a given situation.

An example would be investing in single stocks or even exchange traded funds (ETFs). When choosing an investment, the future is impossible to predict, however by using the strategies below and applying algorithms it can help to increase the chances of an investment being successful and a “win”.

The Epsilon-Greedy Algorithm

Determining whether to explore an option or exploit an already successful option can be difficult as the mistake can be costly, especially in the case of gambling large sums of money. The Epsilon-Greedy algorithm helps to decide whether the explore or exploit option is more likely to produce a winning result. The formula for this algorithm is as follows:

P = (1- epsilon) + (epsilon/k)
P = probability 
Epsilon = a value such as 10% and presented in decimal format therefore in the case of 10% this would be 0.10.
K = different choices or “arms” to select

The methodology behind this algorithm is that based on the results of the formula and a number of “pulls”, the most successful arm is chosen and pulled again, hence choosing to be “greedy” and continuing the process.

The Upper Confidence Bound (UCB) Algorithm

This formula is another method to determine probability. This algorithm uses empirical reward estimates and the number of times the arm has been played as its main criteria. In simple terms, the strategy bases itself on being optimistic in an uncertain situation, and therefore the exploit option should be the way forward as the assumption is that there is a strong likelihood that there will be a positive outcome. 

The formula is outlined below: 

Source: Shipra Agrawal 

Thompson Sampling

Thompson sampling was introduced by William Thompson in 1933. As quoted by Thompson it is whereby one needs to “randomly take action according to the probability you believe it is the optimal action”. 

It is commonly used in AB testing such as website design and advertising whereby it can help ascertain which layout/advert would be the most likely to convert to sales or attract customers. 

If we use the slot machine example where we took a machine with two arms, by keeping track of the beta distribution results from each number of pulls on each arm and plotting them on a chart we will have a curve representing the total distribution. A random point in both graphs for both arms is selected, and whichever has the highest x co-ordinate value effectively “wins” and would then be selected to carry on. 

This methodology is effective due to using beta distribution to accurately determine which machine is likely to produce a higher result therefore higher likelihood to produce a better outcome. It also allows room for maneuver when testing unexplored machines as it creates a “success buffer”.


As we have seen, the multi armed bandit problem can be solved by applying different formulae in order to select the “machine” with the highest probability of success. In the examples used we looked at a slot machine, however this could also be applied to medicine in the trial of experimental drugs and also advertising, marketing and general A/B testing. 

The key is to maximize reward and minimize regret in a given situation and the formula which is most effective will be dependent on the specification scenario. In many cases Thompson Sampling, for example, is effective when assessing binary metrics and Epsilon Greedy is better suited for numeric metrics. Also, we could also argue that the Epsilon-Greedy is the most simple overall, especially compared to say, UCB sampling and has been proven to be a popular method in achieving a result and outcome quickly and accurately.