Thursday, June 12, 2025

Machine Learning Model Errors

In this post, I describe different machine learning model errors and conduct a simulation to illustrate their behavior as model complexity, or flexibility, increases. Note: Because of my use of \(\LaTeX\) typesetting, this post is best viewed in "web" mode (as opposed to "mobile").

1. Setup and Motivating Problem

Consider two random variables \(X\) and \(Y\) with the following joint probability distribution \(F_{X, Y}\):

\begin{align} X &\sim Uniform(0, 10) \\ Y | X &\sim Normal(f(X), \sigma^2), \end{align}

where \(f(x) = x^2 - 8x + 20\) and \(\sigma^2 = 25\). The quadratic polynomial \(f(x)\) represents the "signal" in the relationship between \(X\) and \(Y\), whereas the variance parameter \(\sigma^2\) quantifies the "noise." Let \(\mathcal{T} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\) be a random sample, or training set, of size \(n = 25\) drawn from \(F_{X, Y}\).

The motivating problem is to use \(\mathcal{T}\) and least squares polynomial regression to build a model that predicts \(Y\) based on \(X\). The figure below shows a scatter plot of one possible training set. Also plotted are \(f(x)\) and regression fits \(\hat{f}(x)\) for six different polynomial models: constant, linear, quadratic, cubic, quartic, and quintic. As the degree of the polynomial increases, so does the number of model parameters (i.e., regression coefficients). Consequently, the model becomes more flexible and fits the training set better.

2. Training Error

The difference between the observed and predicted \(Y\) values on \(\mathcal{T}\) can be summarized by the model's training error. Training error is defined by

\begin{equation} Err_{\mathcal{T}}^{Train} = \frac{1}{n} \sum_{i=1}^{n} L \left(y_i, \hat{f}(x_i)\right),\tag{1} \end{equation}

where \(L\) is a loss function. Throughout this post, I use squared error loss:

\[L \left(y_i, \hat{f}(x_i)\right) = \left(y_i - \hat{f}(x_i)\right)^2.\]

Training error is an optimistic measure of the predictive performance of \(\hat{f}\) because the model is fit and evaluated on the same dataset.

3. In-Sample Error

Another type of model error is in-sample error, which quantifies how well \(\hat{f}\) predicts new \(Y\) values for the same \(X\) values observed in \(\mathcal{T}\). This provides insight into the optimism of training error described above. In-sample error is defined by

\begin{equation} Err_{\mathcal{T}}^{In} = \frac{1}{n} \sum_{i=1}^{n} E_{Y_i^{New} \mid \mathcal{T}} \left[ L \left(Y_i^{New}, \hat{f}(x_i)\right) \mid \mathcal{T} \right],\tag{2} \end{equation}

where \(Y_i^{New} \overset{ind}{\sim} F_{Y \mid X=x_i}\) \((i=1, \ldots, n)\). To emphasize, in-sample error is conditional on the training set \(\mathcal{T}\) and the fitted model \(\hat{f}\). In my setup, \(F_{X, Y}\) is both known and simple, so I can calculate \(Err_{\mathcal{T}}^{In}\) exactly.

4. Test Error

A superior measure of predictive performance comes from applying \(\hat{f}\) to new, previously unseen observations from \(F_{X, Y}\). Test error, also known as generalization error or out-of-sample error, is defined by

\begin{equation} Err_{\mathcal{T}}^{Test} = E_{X^{New}, Y^{New} \mid \mathcal{T}} \left[ L \left(Y^{New}, \hat{f}(X^{New})\right) \mid \mathcal{T} \right],\tag{3} \end{equation}

where \((X^{New}, Y^{New}) \sim F_{X, Y}\). As with in-sample error, test error is conditional on the training set \(\mathcal{T}\) and the fitted model \(\hat{f}\). In my setup, I can calculate \(Err_{\mathcal{T}}^{Test}\) exactly. In practice, however, test error is typically estimated by the average loss on an independent test set.

5. Expected Errors

Training, in-sample, and test errors in machine learning are defined in terms of a particular training set \(\mathcal{T}\) and fitted model \(\hat{f}\). To understand the expected behavior of these errors, one can consider averaging over all possible training sets of size \(n\) from \(F_{X, Y}\). The theoretical expected training, in-sample, and test errors are defined by the following:

\begin{equation} Err^{Train} = E_{\mathcal{T}} \left[Err_{\mathcal{T}}^{Train} \right]\tag{4} \end{equation} \begin{equation} Err^{In} = E_{\mathcal{T}} \left[Err_{\mathcal{T}}^{In} \right]\tag{5} \end{equation} \begin{equation} Err^{Test} = E_{\mathcal{T}} \left[Err_{\mathcal{T}}^{Test} \right]\tag{6} \end{equation}

In practice, these expected errors are estimated via resampling methods such as cross-validation and the bootstrap.

6. Simulation

To study all of these conditional and expected model errors, I conduct a simulation using my setup. I begin by randomly drawing \(B = 5{,}000\) training sets of size \(n = 25\) from \(F_{X, Y}\). For each training set \(\mathcal{T}_b\) \((b = 1, \ldots, B)\), I fit the six polynomial regression models (constant, linear, quadratic, cubic, quartic, and quintic) and calculate their training, in-sample, and test errors (\(Err_{\mathcal{T_b}}^{Train}\), \(Err_{\mathcal{T_b}}^{In}\), and \(Err_{\mathcal{T_b}}^{Test}\)). To estimate the expected errors (\(Err^{Train}\), \(Err^{In}\), and \(Err^{Test}\)), I average over the \(B\) corresponding conditional values. The table below displays these estimates of expected error.


Model
Flexibility
Estimate of Expected Error
Training In-Sample Test
Constant \(109.93\) \(111.51\) \(118.27\)
Linear \(73.24\) \(76.70\) \(89.35\)
Quadratic \(22.07\) \(27.99\) \(28.50\)
Cubic \(21.03\) \(29.03\) \(30.79\)
Quartic \(20.04\) \(30.03\) \(35.28\)
Quintic \(19.04\) \(31.02\) \(53.71\)

The figure below plots training and test error as a function of model flexibility for the first \(100\) training sets. Curves for the estimated expected errors are plotted with extra thickness. For reference, I also include a horizontal line at \(\sigma^2 = 25\), the noise variance. Observe that test error (in orange) always lies above this line.

My simulation illustrates the following points:

  • Training error decreases as model flexibility increases.
  • The constant and linear models are simplistic and suffer from underfitting.
  • The minimum expected test error is achieved by the quadratic model. This was anticipated as the true relationship between \(X\) and \(Y\) is quadratic \(\left(f(x) = x^2 - 8x + 20\right)\).
  • With increased flexibility, the cubic model starts to fit the noise in the data. However, it does not perform too badly in terms of test error.
  • The most flexible models, quartic and quintic, fit the noise even more closely and exhibit overfitting. They do not generalize well when predicting \(Y\) for new values of \(X\).