the humble sum of the squared errors

As part of my effort to master statistical theory, I’m deconstructing basic statistics principles in blog posts, on the idea that writing about the principles is the best way to learn them more deeply.

The humble sum of the squared errors (SSE) calculation has been a workhorse of statistics for the past 200 years. Here I examine it in detail, showing how the calculation appears in basic statistical models and how to interpret it. Examples are given in Python.

Squared Errors and Sum of the Squared Errors

The squared error (SE) is the square of the difference between a statistical model’s predicted output for a given input and the actual measured output for that input. It quantifies the error between a prediction and reality. In equation form, the SE looks like:


The first thing we notice about this equation is that it only generates positive values. We could imagine using


as a measure of error instead, but the absolute value component would be difficult to work with algebraically. The fact that the equation only produces positive values ensures that positive and negative errors do not cancel each other out when summed together, which we plan to do.

The second thing we notice is that by squaring the difference, higher errors are “amplified” more than smaller errors. An optimization function working to minimize the SE would penalize such higher errors more severely as a result.

Given a set of observations and a set of predicted outcomes for each observation, we can calculate the SE for each observation. If we then sum the computed SE values across all the observations we produce the sum of the square error (SSE) for that set of observations and predictions:


By quantifying the error across all the data in this manner, the SSE becomes a measure of the whole predictive performance of the statistical model that generated the predictions.

SSE and the Mean

Suppose our first statistical “model” that we consider is simply selecting the mean of a set of observations and calling it the predicted value for all observations. Then the SSE equation for this model looks like:


We will see that using the mean in this way produces the lowest possible SSE for the data set, when compared to the SSE produced by any other single predicted value for the whole set.

Suppose we have 100 samples from Normal(20, 5):

import numpy.random as nprand
import numpy as np
import matplotlib.pyplot as plt

sample = nprand.normal(20, 5, 100)
sample_mean = float(np.mean(sample))

plt.plot(range(len(sample)), sample, 'o')
plt.axhline(y=sample_mean, color='green')
plt.ylabel('Y'); plt.xlabel('Sample #')
plt.title('100 Samples from Normal(20, 5), with Mean Line')

print sample_mean    # 20.3029142604 at random seed 1


If we iterate through alternative values for the predicted value and calculate the SSE across the data using each alternative predicted value (instead of using the mean), we see that the SSE is minimized at the mean value for the data (mean = 20.3):

alternative_means = np.arange(19., 21.6, 0.05)
alt_SSE = []
for mu in alternative_means:
    SSE = 0.
    for s in sample:
        SE = (s - mu)**2
        SSE += SE

plt.plot(alternative_means, alt_SSE)
plt.ylabel('SSE'); plt.xlabel('Y_predicted')
plt.title('SSE Minimized at Mean')


It is therefore clear that the mean and SSE of a data set are related by the fact that the mean minimizes the SSE of the data set when the mean is used as the statistical “model” of the data instead of another single value.

SSE and Linear Regression

In the last section, we were motivated to find a single model parameter that minimized the SSE (in that case, the model parameter was the mean). In a simple linear regression, we calculate two model parameters, a slope and an intercept. However, we still seek the minimum SSE. In fact, fitting a linear model is an optimization routine that minimizes the SSE given the parameters of slope and intercept.

Consider the following data and its regression line:

import numpy.random as nprand
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

noise = nprand.normal(0, 0.1, 100)
x = [float(x) / 100. for x in range(0, 100)]
y = []
for x_i, noise_i in zip(x, noise):  y.append(x_i + noise_i)

slope, intercept, r_value, p_value, slope_std_error = stats.linregress(x, y)
regression_line_y = [intercept + slope * i for i in x]
SSE_regression_line = sum( [(y[i] - regression_line_y[i])**2 for i in range(0, 100)] )

plt.plot(x, y, 'o')
plt.plot(x, regression_line_y)
plt.title('Simulated Data With Regression Line')

print slope, intercept, r_value   # 1.03341865281 -0.0104839479333 0.959146803148 at seed 1


Here the computed slope is 1.03 and the computed intercept is -0.01. Now consider what happens to the SSE if we use alternative values for the slope and intercept when computing the SSE’s value:

slope_variations = np.arange(0.8, 1.2, 0.025)
intercept_variations = np.arange(-0.2, 0.2, 0.025)
SSE_list = []
for s in slope_variations:
    SSE_sub_list = []
    for i in intercept_variations:
        rl_y = [i + s * n for n in x]
        SSE = 0.
        for y_i, rl_y_i in zip(y, rl_y):
            SSE += (y_i - rl_y_i)**2

S, I = np.meshgrid(slope_variations, intercept_variations)
Z = np.array(np.matrix(SSE_list))
V = np.arange(0.7, 5.4, 0.05)

plt.contour(S, I, Z.transpose(), V)
plt.xlabel('Slope'); plt.ylabel('Intercept')
plt.plot(slope, intercept, 'o')
plt.title('SSE as Function of Alternative Slopes and Intercepts')


In this contour plot, the lowest point is at the slope and intercept stated above; at all other values of the slope and intercept the SSE is higher.

SSE and Model Comparison

The SSE may be used to compare competing models, as long as the models are tested on the same data. This is because the magnitude of the SSE depends on the data being considered, particularly on the number of data points in the test set. To generalize between different test sets–I’m not sure you’d want to–use the mean squared error (MSE) instead:


Alternatively you can use the root mean squared error (RMSE) equation, which is simply the square root of the MSE:


The RMSE gives the advantage of being in the same units that the predicted values are in, rather than the square of them, much like the standard deviation expresses spread in the same units as the mean. However the disadvantage of the MSE and RMSE is one or two extra computation steps, which might slow down an optimization procedure.

Select the model with the lowest SSE (or MSE/RMSE), after you have prevented overfitting. Overfitting is described here on this blog, and a future post will detail how to prevent it.

Note: When comparing linear models, use R2 instead (below) which offers a more intuitive scale between zero and one.

SSE and R Squared

The SSE is used to calculate R2, an indicator of linear model fitness:


Here SSEmu is the SSE of the “model” where the mean is used as the sole predictor, while SSEmodel is the SSE of the linear model. This metric quantifies how much greater the SSEmu is than SSEmodel. If the regression has predictive value over the mean of the data by itself, the ratio will be low and therefore R2 will be closer to one.

In code:

mean_y = np.mean(y)
SSE_mean_y = 0.
for yi in y:  SSE_mean_y += (yi - mean_y)**2.
print 1. - (SSE_regression_line / SSE_mean_y)
print r_value**2


Mixing “Squared Error” with “Standard Error”

A long time ago in a class I mixed up squared error with standard error. Don’t do this.

Post Author: badassdatascience

Leave a Reply

Your email address will not be published.