Learning Curves in Machine Learning

Dr. Jahed Naghipoor
5 min readNov 1, 2020

--

In this article, I want to explain learning curves in training and dev sets.

A learning curve shows a measure of predictive performance on a given domain as a function of some measure of varying amounts of learning effort. The most common form of learning curves in the general field of machine learning shows predictive accuracy on the test examples as a function of the number of training examples.

The learning curve can be used to detect whether the model has a high bias or high variance. for more information see the bias-variance tradeoff

A learning curve plots your dev set error against the number of training examples. To plot it, you would run your algorithm using different training set sizes. For example, if you have 1,000 examples, you might train separate copies of the algorithm on 100, 200, 300, …, 1000 examples. Then you could plot how dev set error varies with the training set size. Here is an example:

The normal behavior of a machine learning algorithm is that, when the training set size increases, the dev set error should decrease.

It is possible to consider the desired level of performance to your learning curve, like this:

But if the dev error curve has flattened out, then you can immediately tell that adding more data won’t get you to your goal:

Looking at the learning curve might therefore help you avoid spending months collecting twice as much training data, only to realize it does not help.

Takeaway: Dev set (and test set) error should decrease as the training set size grows.

One downside of this process is that if you only look at the dev error curve, it can be hard to extrapolate and predict exactly where the red curve will go if you had more data. There is one additional plot that can help you estimate the impact of adding more data: the training error.

Dev error vs. Training error

You can see that the blue “training error” curve increases with the size of the training set. Furthermore, your algorithm usually does better on the training set than on the dev set; thus the red dev error curve usually lies strictly above the blue training error curve.

Suppose we add the training error curve to the plot and get the following:

If we have such a plot, we can be absolutely sure that adding more data will not, by itself, be sufficient. why? there are two main reasons for that:

  1. As we add more training data, training error can only get worse. Thus, the blue training error curve can only stay the same or go higher, and thus it can only get further away from the (green line) level of desired performance.
  2. The red dev error curve is usually higher than the blue training error. Thus, there’s almost no way that adding more data would allow the red dev error curve to drop down to the desired level of performance when even the training error is higher than the desired level of performance.

Takeaway: Examining both the dev error curve and the training error curve on the same plot allows us to more confidently extrapolate the dev error curve.

Consider this learning curve:

The blue training error curve is relatively low, and the red dev error curve is much higher than the blue training error. Thus, the bias is small, but the variance is large. Adding more training data will probably help close the gap between dev error and training error

Now, consider this:

This time, the training error is large, as it is much higher than the desired level of performance. The dev error is also much larger than the training error. Thus, you have significant bias and significant variance. You will have to find a way to reduce both bias and variance in your algorithm.

Plotting a learning curve may be computationally expensive: Thus, instead of evenly spacing out the training set sizes on a linear scale as above, you might train models with 1,000, 2,000, 4,000, 6,000, and 10,000 examples. This should still give you a clear sense of the trends in the learning curves. Of course, this technique is relevant only if the computational cost of training all the additional models is significant.

Conclusions:

  1. If the model suffers from a high bias problem, as the sample size increases, training error will increase and the cross-validation (dev) error will decrease and at last, they will be very close to each other but still at a high error rate for both training and dev error. Increasing the sample size will not help much with the high bias problem.
  2. If the model suffers from high variance, as they keep increasing the sample size, the training error will keep increasing and cross-validation (dev) error will keep decreasing and they will end up at a low training and dev error rate. So more samples will help to improve the model prediction performance if the model suffers from high variance.

--

--

Dr. Jahed Naghipoor

Applied Mathematician, Software Developer and Data Scientist