Multi Armed Bandits | shubham's notes

Multi Armed Bandits: the name comes from casino slot machines bandits
the idea is to maximize the reward or in other words minimize the regret
if you knew whats the best machine you would go ahead and exploit it to get the best reward. But in order to do that you need to explore the machines
exploitation is about maximising the reward from the existing knowledge you have at hand whereas exploration is about trying out different options in order to find better options
in multi armed bandits, each option is a bandit
there are two components of MAB system – the environment and the agent. The agent is the MAB model that will take an action. For example in recommendations it will be suggesting title A. The environment then gives a reward. For example did the user click on title A or not or did they buy title A etc
once the agent receives a reward they update the internal model/distribution
MAB can also be used for A/B testing and hyperparameter optimization
they are designed for real time predictions and online learning so can be dynamically adapted
contextual MAB dont just consider historical data they also consider contextual data like user information etc. By incorporating context, contextual bandits aim to learn how different actions interact with specific contexts and their impact on the expected reward. This enables more personalized and opti

Algorithms

There are multiple ways of implementing the Multi Armed Bandits models

Epsilon Greedy
Decay Greedy
Thompson Sampling
Upper Confidence Bound

Epsilon greedy

in this case we define epsilon – the exploration parameter. The model selects exploitation with 1-e probability. In case of exploitation, the selection is done randomly

As we get a reward for the choice we make, we update the Q values (accumulated reward values) for the choice. The update is done via

here is the Q value for action at time step and is the reward earned for action a at time t

This helps us incrementally update the reward for the selected arm at each timestep.

The term acts as a learning rate, which decreases as the number of observations for arm a increases. This leads to early observations having a larger impact on the estimate, while later observations make smaller adjustments, leading to a stable estimate over time.

the term represents the difference between the current reward and the reward estimate. Large differences lead to larger updates.

Note

In some cases we decay the exploration rate since we start to get stable estimates of the rewards thus reducing the need for exploration. That variant is called Decay Epsilon Greedy

Thompson Sampling

Here instead of selecting the current candidate via a threshold we use posteriors and likelihoods to select the arm and calculate the posterior to update the our rewards model
Using the priors/posteriors, we select the candidate with the highest reward. Based on our observations, we updater the posterior
The key idea is to maintain a probability distribution of rewards over the bandits and sample from these to chose the arm
the posteriors are updated as we train the model
here’s how the model works
- we sample the probability of click for each arm using their beta distribution. the and values are set as the number of successes+1 and (number of pulls-number of scueess+1) to get the probability
- we choose the arm with highest prob
- we assign a reward for the action
- we update counter npulls and nsuccess

Upper Confidence Bound

This strategy is also called Optimism in the Face Uncertainity principle
Its called UCB because it uses an upper confidence bound to decide between exploration and exploitation
UCB works by selecting the arm that maximises the sim of two terms- average reward, a confidence bound that decreases as more pulls are made from that arm
the confidence bound allows us to explore unseen or relatively unseen options as well
- here the first term on RHS is the average reward received for arm i upto time t
- is the number of times arm i has been pulled upto time t
- is the exploration parameter that controls the degree of exploration. its often set to or
- is the natural logarithm of the number of total pulls up to time t
The arm with the highest UCB is selected as the next action
The algorithm works as below
- Initialize: try everything once to get the reward values for each function
- Compute: at each time step calculate the UCB for all values
- Action: select the arm with the highest UCB
- Update: update the average reward and the number of pulls for the selected arm
- Repeat until stopping condition