Jul 19, 2024 - 7 min

Huber Loss

ML Fundamentals

In machine learning, handling outliers effectively is crucial for building robust models. Traditional loss functions like Mean Squared Error (MSE) can be overly sensitive to outliers, while Mean Absolute Error (MAE) may not provide smooth gradients for optimization. Enter Huber Loss—a powerful alternative that blends the best of both worlds. By switching between squared and absolute loss based on a threshold , Huber Loss minimizes the influence of outliers while maintaining a smooth optimization landscape. Curious to see how it works? Let’s dive in!

What is Huber Loss?

Huber loss combines the best of Mean Squared Error (MSE) and Mean Absolute Error (MAE).

Huber Loss is a piecewise loss function typically used in regression. It’s a mix of squared loss and absolute loss. For smaller errors, it behaves like a squared loss providing smooth gradients and for larger errors it behaves as absolute loss, reducing the impact of outliers. Its particularly useful for tasks containing outliers.

Huber loss provides robustness by switching to absolute loss when the error term crosses a predefined threshold/parameter . This helps in preventing outsized adjustments to the weights in case of outliers.

Samples for which the is less than , we use the quadratic squared loss else we use the linear absolute loss.

Huber Loss Formulation

Huber Loss is defined as such

Invalid environment name '\{'L_\{\delta\}(y, f(x)) = \begin\{cases\} \frac\{1\}\{2\} (y – f(x))^2 & \text\{if \} |y – f(x)| \leq \delta \\ \delta |y – f(x)| – \frac\{1\}\{2\} \delta^2 & \text\{otherwise\} \end\{cases\}

where, parameter

You’ll notice that when is greater than , we don’t simply switch to absolute loss. Instead, we see the error term multiplied by and we also subtract from the loss. The reason behind it is that if we simply switched to when is greater than , the Huber loss wouldn’t be differentiable when

Below is the graph we get if we simply switch to when is greater than . Notice how the function is not differentiable

A 2D plot showing the Huber Loss function without differentiability adjustment (δ = 1). The x-axis represents error, and the y-axis represents loss. Unlike the adjusted version, this loss function has a sharp change at ±1 where the quadratic and linear loss components meet, causing a kink in the curve. The red dashed line represents the unadjusted Huber Loss, and the blue vertical dashed lines indicate the transition boundary at δ = 1.

Huber Loss when we don’t ensure differentiability

Its clear from the graph above that a simple switch wont give us a smooth differentiable function. In order to ensure differentiability let us first calculate the value of the squared loss when

We substitute . This gives us:

For the loss to be smooth throughout, we need to ensure that our loss functions returns when . This is can be easily achieved by adjusting the form of the absolute loss by using

for larger errors

A 2D plot showing the Huber Loss function with differentiability adjustment (δ = 1). The x-axis represents error (prediction error), and the y-axis represents loss. The loss function smoothly transitions from a quadratic region in the center (for small errors) to a linear region for larger errors. The blue solid line represents the adjusted Huber Loss, ensuring smooth differentiability at the transition points. The vertical blue dashed lines mark δ = 1, indicating where the loss shifts from quadratic to linear behavior.

Huber Loss after we modify the absolute loss to ensure differentiability

Gradient of Huber Loss

The gradient of the Huber loss with respect to the predictions comes out to be

Invalid environment name '\{'\nabla L_\delta(y, f(x)) = \begin\{cases\} -(y – f(x)) & \text\{if \} |y – f(x)| \leq \delta \\ -\delta \cdot \text\{sign\}(y – f(x)) & \text\{if \} |y – f(x)| > \delta \end\{cases\}

If you see the loss term for cases when it is a fix value of . In the graph below, we set hence the loss term for when is either or . And it scales linearly in between.

A plot showing the gradient of Huber Loss with respect to predictions. The x-axis represents the prediction error (y_true - y_pred) ranging from -3 to 3, while the y-axis represents the gradient. The plot has a piecewise linear shape, with a constant gradient of 1 for large negative errors and -1 for large positive errors, indicating that the Huber loss behaves like the absolute error in these regions. In the central region between -1 and 1 (highlighted by blue dashed lines for δ = 1), the gradient smoothly transitions in a linear fashion, showing the squared loss behavior. The red line represents the gradient of the Huber Loss, and the legend indicates δ = 1

The Huber loss caps the loss for outliers which prevents outsized adjustments to the weights. This helps reducing the model’s sensitivity to outliers.

Properties

  1. Combination of and loss: Huber loss is for smaller losses and for larger losses. It ends up penalizes smaller errors strongly compared to larger losses
  2. Convexity: Huber Loss is a convex function which means it is well suited for gradient optimization as it can converge to global minima effectively without getting stuck at a local minima
  3. Differentiability: Huber Loss is differentiable everywhere. The transition from square to linear is smooth ensuring continuous first derivatives
  4. Robustness to Outliers: Huber Loss is designed to be robust against outliers. It achieves this by capping the loss for large loss terms
  5. The Parameter: The parameter is the point where the loss switches from quadratic to linear. Tuning the parameter allows for adjusting the model’s sensitivity towards outliers. If is large then the model tends towards being squared loss heavy. If is small then it leans towards being more of an absolute loss Choosing a suitable is important. Too large values will fail to mitigate the impact of outliers. Too small and it may overly desensitize the model and cause underfitting.

How to select Delta?

Selecting is more art than exact science. One way to select is to fit a simple regression model and find the distribution of the residuals. You can to a percentile of the absolute residuals from the baseline model

  • Compute for all training points
  • Pick as 75th, 90th, or 99th percentile to cover most inliers while treating the tail as outliers
  • You can also pass as a hyperparameter to be tuned and selected

Further Reading

Read more here