Huber Loss | shubham's notes

In machine learning, handling outliers effectively is crucial for building robust models. Traditional loss functions like Mean Squared Error (MSE) can be overly sensitive to outliers, while Mean Absolute Error (MAE) may not provide smooth gradients for optimization. Enter Huber Loss—a powerful alternative that blends the best of both worlds. By switching between squared and absolute loss based on a threshold , Huber Loss minimizes the influence of outliers while maintaining a smooth optimization landscape. Curious to see how it works? Let’s dive in!

What is Huber Loss?

Huber loss combines the best of Mean Squared Error (MSE) and Mean Absolute Error (MAE).

Huber Loss is a piecewise loss function typically used in regression. It’s a mix of squared loss and absolute loss. For smaller errors, it behaves like a squared loss providing smooth gradients and for larger errors it behaves as absolute loss, reducing the impact of outliers. Its particularly useful for tasks containing outliers.

Huber loss provides robustness by switching to absolute loss when the error term crosses a predefined threshold/parameter . This helps in preventing outsized adjustments to the weights in case of outliers.

Samples for which the is less than , we use the quadratic squared loss else we use the linear absolute loss.

Huber Loss Formulation

Huber Loss is defined as such

$Invalid environment name '\{'L_\{\delta\}(y, f(x)) = \begin\{cases\} \frac\{1\}\{2\} (y – f(x))^2 & \text\{if \} |y – f(x)| \leq \delta \\ \delta |y – f(x)| – \frac\{1\}\{2\} \delta^2 & \text\{otherwise\} \end\{cases\}$

where, parameter

You’ll notice that when is greater than , we don’t simply switch to absolute loss. Instead, we see the error term multiplied by and we also subtract from the loss. The reason behind it is that if we simply switched to when is greater than , the Huber loss wouldn’t be differentiable when

Below is the graph we get if we simply switch to when is greater than . Notice how the function is not differentiable

Huber Loss when we don’t ensure differentiability

Its clear from the graph above that a simple switch wont give us a smooth differentiable function. In order to ensure differentiability let us first calculate the value of the squared loss when

We substitute . This gives us:

For the loss to be smooth throughout, we need to ensure that our loss functions returns when . This is can be easily achieved by adjusting the form of the absolute loss by using

for larger errors

Huber Loss after we modify the absolute loss to ensure differentiability

Gradient of Huber Loss

The gradient of the Huber loss with respect to the predictions comes out to be

$Invalid environment name '\{'\nabla L_\delta(y, f(x)) = \begin\{cases\} -(y – f(x)) & \text\{if \} |y – f(x)| \leq \delta \\ -\delta \cdot \text\{sign\}(y – f(x)) & \text\{if \} |y – f(x)| > \delta \end\{cases\}$

If you see the loss term for cases when it is a fix value of . In the graph below, we set hence the loss term for when is either or . And it scales linearly in between.

The Huber loss caps the loss for outliers which prevents outsized adjustments to the weights. This helps reducing the model’s sensitivity to outliers.

Properties

Combination of and loss: Huber loss is for smaller losses and for larger losses. It ends up penalizes smaller errors strongly compared to larger losses
Convexity: Huber Loss is a convex function which means it is well suited for gradient optimization as it can converge to global minima effectively without getting stuck at a local minima
Differentiability: Huber Loss is differentiable everywhere. The transition from square to linear is smooth ensuring continuous first derivatives
Robustness to Outliers: Huber Loss is designed to be robust against outliers. It achieves this by capping the loss for large loss terms
The Parameter: The parameter is the point where the loss switches from quadratic to linear. Tuning the parameter allows for adjusting the model’s sensitivity towards outliers. If is large then the model tends towards being squared loss heavy. If is small then it leans towards being more of an absolute loss Choosing a suitable is important. Too large values will fail to mitigate the impact of outliers. Too small and it may overly desensitize the model and cause underfitting.

How to select Delta?

Selecting is more art than exact science. One way to select is to fit a simple regression model and find the distribution of the residuals. You can to a percentile of the absolute residuals from the baseline model

Compute for all training points
Pick as 75th, 90th, or 99th percentile to cover most inliers while treating the tail as outliers
You can also pass as a hyperparameter to be tuned and selected

What is Huber Loss?

Huber Loss Formulation

Gradient of Huber Loss

Properties

How to select Delta?

Further Reading