Jul 12, 2024 - 5 min
Multi Armed Bandits

- Multi Armed Bandits: the name comes from casino slot machines bandits
- the idea is to maximize the reward or in other words minimize the regret
- if you knew whats the best machine you would go ahead and
exploitit to get the best reward. But in order to do that you need toexplorethe machines - exploitation is about maximising the reward from the existing knowledge you have at hand whereas exploration is about trying out different options in order to find better options
- in multi armed bandits, each option is a bandit
- there are two components of MAB system – the environment and the agent. The agent is the MAB model that will take an action. For example in recommendations it will be suggesting title A. The environment then gives a reward. For example did the user click on title A or not or did they buy title A etc
- once the agent receives a reward they update the internal model/distribution
- MAB can also be used for A/B testing and hyperparameter optimization
- they are designed for real time predictions and online learning so can be dynamically adapted
- contextual MAB dont just consider historical data they also consider contextual data like user information etc. By incorporating context, contextual bandits aim to learn how different actions interact with specific contexts and their impact on the expected reward. This enables more personalized and opti
Algorithms
There are multiple ways of implementing the Multi Armed Bandits models
- Epsilon Greedy
- Decay Greedy
- Thompson Sampling
- Upper Confidence Bound
Epsilon greedy
in this case we define epsilon – the exploration parameter. The model selects exploitation with 1-e probability. In case of exploitation, the selection is done randomly
As we get a reward for the choice we make, we update the Q values (accumulated reward values) for the choice. The update is done via
here
This helps us incrementally update the reward for the selected arm at each timestep.
The term
the
Note
In some cases we decay the exploration rate since we start to get stable estimates of the rewards thus reducing the need for exploration. That variant is called Decay Epsilon Greedy
Thompson Sampling
- Here instead of selecting the current candidate via a threshold we use posteriors and likelihoods to select the arm and calculate the posterior to update the our rewards model
- Using the priors/posteriors, we select the candidate with the highest reward. Based on our observations, we updater the posterior
- The key idea is to maintain a probability distribution of rewards over the bandits and sample from these to chose the arm
- the posteriors are updated as we train the model
- here’s how the model works
- we sample the probability of click for each arm using their beta distribution. the
and values are set as the number of successes+1 and (number of pulls-number of scueess+1) to get the probability - we choose the arm with highest prob
- we assign a reward for the action
- we update counter npulls and nsuccess
- we sample the probability of click for each arm using their beta distribution. the
Upper Confidence Bound
- This strategy is also called Optimism in the Face Uncertainity principle
- Its called UCB because it uses an upper confidence bound to decide between exploration and exploitation
- UCB works by selecting the arm that maximises the sim of two terms- average reward, a confidence bound that decreases as more pulls are made from that arm
- the confidence bound allows us to explore unseen or relatively unseen options as well
- here the first term on RHS is the average reward received for arm i upto time t
is the number of times arm i has been pulled upto time t is the exploration parameter that controls the degree of exploration. its often set to or is the natural logarithm of the number of total pulls up to time t
- The arm with the highest UCB is selected as the next action
- The algorithm works as below
- Initialize: try everything once to get the reward values for each function
- Compute: at each time step calculate the UCB for all values
- Action: select the arm with the highest UCB
- Update: update the average reward and the number of pulls for the selected arm
- Repeat until stopping condition