math math h1# The problem of overfitting in machine learning algorithms

Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

By now in my First approach to machine learning series I have written about two of the most famous building block algorithms: linear regression and logistic regression. We know they can be very powerful yet, as the number of features increase, they might both suffer of a common problem: overfitting. Let's take a closer look at it.

h2## Overfitting: what is it about?

Overfitting, or high variance, happens when your hypothesis function math$h_\theta(x)$ tries too hard to fit the training set. The result is that the learned hypothesis function will work great with the training set you initially provided, but will fail to generalize new examples.

Your algorithm can also face the opposite problem, called called underfitting, or high bias: the hypothesis function doesn't fit much well the data you have provided in the training stage.

math 1. Examples of underfitting and overfitting in linear regression.

Picture 1. above depicts two cases of wrong fitting during a linear regression task (the famous house prediction problem we saw in earlier articles). The top-left plot shows underfitting: the hypothesis function is a linear one, which doesn't fit so well the data. The top-right plot shows a quadratic function: visually it seems to generalize data pretty well. This is the way to go. The last plot at the bottom shows instead an example of overfitting, where the curve does touch all the points in the training examples (which is good), but with too much juggling. Such hypothesis function will struggle to predict new prices if fed with new examples outside the training set.

Logistic regression too can suffer from bad fitting. As in linear regression, the problem here lies in the hypothesis function trying to separate too much the training examples.

math 2. Examples of underfitting and overfitting in logistic regression.

The top-left plot in image 2. above shows an underfitting logistic regression hypothesis function. The data is not well separated by such simplistic linear function. Conversely, a slightly more complex function in the top-right figure does a good job in fitting the data. The third bottom plot shows overfitting: the hypothesis function tries to hard to separate elements in the training set by producing a highly twisted decision boundary.

h2## How to limit overfitting

In a perfect universe your machine learning problem deals with low-dimensional data, that is few input parameters. This is something we saw earlier in our house price prediction examples: we only had "size" and "price" parameters back then, producing a straightforward 2-D plot. Finding overfitting is easy here: just plot the hypothesis function and adjust the formula accordingly.

However this solution is impractical when your learning problem deals with lots (hundreds) of features. There are two main options here:

• Reduce the number of features — you manually look through the list of features and decide what are the most important ones to keep. A class of algorithms called model selection algorithms automatically select the most relevant features: we will take a look at it in the future chapters. Either way, reducing the number of features fixes the overfitting problem, but it is a less than ideal solution. The disadvantage is that you are throwing away precious information you have about the problem.

• Regularization — you keep all the features, but reduce the magnitude (i.e. the value) of each parameter math$\theta_j$. This method works well when you have a lot of features, each of which contributes a bit to predicting the output math$y$. Let's take a look at this technique and apply it to the learning algorithms we already know, in order to keep overfitting at bay.

h2## A regularized cost function

Say we have two hypothesis functions from the same data set (take a look at picture 1. as a reference): the first one is math$h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^{2}$ and it works well; the second one is math$h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4$ and it suffers from overfitting. The training data is the same, so the second function must have something wrong in its formula. It turns out that those two parameters math$\theta_3$ and math$\theta_4$ contribute too much to the curliness of the function.

The core idea: penalize those additional parameters and make them very small, so that they will contribute less, or even don't contribute at all to the function shape. If we set math$\theta_3 \approx 0$ and math$\theta_4 \approx 0$ (in words: set them to very small values, next to zero) we would basically end up with the first function, which fits well the data.

A real implementation of a regularized cost function pushes the idea even further: all parameters values are reduced by some amount, producing a somehow simpler, or smoother hypothesis function (and it can be proven mathematically).

For example, say we are working with the usual linear regression problem of house price prediction. We have:

• math$100$ features: math$x_1, x_2, \cdots, x_{100}$ (size, number of floors, ...) that produce...
• math$100$ parameters: math$\theta_0, \theta_1, \cdots, \theta_{100}$

Of course is nearly impossible to know which parameter contributes more or less to the overfitting issue. So in regularization we modify the cost function to shrink all parameters by some amount.

The original cost function for linear regression is:

math$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

The regularized version adds an extra term, called regularization term that shrinks all the parameters:

math$$J_{reg}(\theta) = \frac{1}{2m} \bigg[\sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{m} \theta_j^2\bigg]$$

The lambda symbol (math$\lambda$) is called the regularization parameter and it is responsible for a trade-off between fitting the training set well and keeping each parameter small. By convention the first parameter math$\theta_0$ is left unprocessed, as the loop in the regularization term starts from 1 (i.e. math$j=1$).

The regularization parameter must be chosen carefully. If its too large, it will crush all the parameters except the first one, ending up with a hypothesis function like math$h_\theta(x) = \theta_0$ where all other math$\theta$s are next to zero. Such function is a simple horizontal line, which of course doesn't fit well the data (and suffers from underfitting).

There are some advanced techniques to find the regularization parameter automatically. We will take a look at some of those in future episodes. For now, let's see how to apply the regularization process to linear regression and logistic regression algorithms.

h2## Regularized linear regression

In order to build a regularized linear regression, we have to tweak the gradient descent algorithm. The original one, outlined in a previous chapter looked like this:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \cdots \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} & \newline \rbrace \end{align*}

Nothing new, right? The derivative at this point has been already computed and we plugged it into the formula. In order to upgrade this into a regularized algorithm, we have to figure out the derivative for the new regularized cost function seen above.

As always I will lift you the burden of the manual computation. This is how it looks like:

math$$\frac{\partial}{\partial \theta_j} J_{reg}(\theta) = \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j$$

And now we are ready to plug it into the gradient descent algorithm. Note how the first math$\theta_0$ is left unprocessed, as we said earlier:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \cdots \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j & \newline \rbrace \end{align*}

We can do even better. Let's group all the terms together that depends on math$\theta_j$ (and math$\theta_0$ left untouched as always):

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \cdots \newline \; & \theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} & \newline \rbrace \end{align*}

I have simply rearranged things around in each math$\theta_j$ update, except for math$\theta_0$. This alternate way of writing shows the regularization in action: as you may see, the term math$(1 - \alpha \frac{\lambda}{m})$ multiplies math$\theta_j$ and it's responsible for its shrinkage. The rest of the equation is the same as before the whole regularization thing.

h2## Regularized logistic regression

As seen in the previous article, the gradient descent algorithm for logistic regression looks identical to the linear regression one. So the good news is that we can apply the very same regularization trick as we did above in order to shrink the math$\theta$s parameters.

The original, non-regularized cost function for logistic regression looks like:

math$$J(\theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))\right]$$

As we did in the regularized linear regression, we have to add the regularization term that shrinks all the parameters. It is slightly different here:

math$$J_{reg}(\theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2$$

What is the derivative of this baby?

math$$\frac{\partial}{\partial \theta_j} J_{reg}(\theta) = \theta_j - \alpha \left[\dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j\right]$$

We are now ready to plug it into the gradient descent algorithm for the logistic regression. As always the first item math$\theta_0$ is left unprocessed:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)} \newline \; & \cdots \newline \; & \theta_j := \theta_j - \alpha \left[\dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j\right] & \newline \rbrace \end{align*}

h2## Sources

Machine Learning @ Coursera - The Problem of Overfitting
Machine Learning @ Coursera - Cost Function
Machine Learning @ Coursera - Regularized Linear Regression
Machine Learning @ Coursera - Regularized Logistic Regression)

previous article
on January 04, 2019 at 10:02
A simple and clear explanation. Watch out for some garbled formulas at the end of the article...
on January 04, 2019 at 10:07
@a.smith fixed, thank you!
on August 10, 2019 at 14:06
can you upload the rest of the course as well? these explanations are incredible
on August 14, 2019 at 12:20
@Edward thanks, I'm planning to add more articles on neural networks soon!
on October 22, 2019 at 18:12
Thank you so much for all your effort and for your clear explanation. I hope you complete your series and explain neural networks and so on. As soon as possible please:( I depend on you in ML course
on November 07, 2019 at 18:14
I just read your message about you will complete the course :) and I'm so excited and waiting you!
on December 12, 2019 at 13:04
Thoroughly benefiting from the understanding that I've acquired from your clear coverage. Thanks! 1st question: Why does update expression for theta_0 not have a regularization term? I get that the predictor value for that dimension is 1, but that doesn't necessarily mean that the partial derivative of Jreg with respect to theta_0 is zero. The coursera video doesn't explain this either. 2nd question: Who is Triangles, and how can they afford to do this for free? Thanks again!
on December 24, 2019 at 10:18
@F.M. Thanks!

Question 1: I'm a bit rusty on this topic, however I believe it is related to the x_0 = 1 trick explained here: https://www.internalpointers.com/post/multivariate-linear-regression

Question 2: https://i.kym-cdn.com/photos/images/newsfeed/000/183/103/alens.jpg
on April 18, 2020 at 06:27
Very good and clear explain. I am waiting for further topic of machine learning like neural network and many more.
on May 03, 2020 at 09:09
Thanks for writing this type of efficient and useful blogs.
on August 28, 2020 at 12:02
Very nicely explained
on September 16, 2020 at 15:54
Simple and Clear Explanation thanks
on May 22, 2021 at 01:58
With all due respect, just wanted to say that, upload the rest of the course :(