Like it!

Join us on Facebook!

Like it!

How to optimize the gradient descent algorithm

A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

Other articles from this series

Real-world data can come up in different orders of magnitude. For example, your age ranges from 0 to 100 years, while your yearly income from €10,000 to €10,000,000 (and more). Using such unprocessed data as input features for a linear regression system might slow down the gradient descent algorithm to a crawl.

It happens because — as we will see shortly — such not normalized data warps the cost function the gradient descent has to process, making the minimum point really difficult to reach.

Because of that, an important trick in machine learning and in linear regression is to make sure that all the input features are on a similar scale. This is a preparatory step you do in order to optimize the input data, known as feature scaling.

Feature scaling

In feature scaling you basically normalize your input values. For example, say you have two features:

  • [texi]x_1[texi] as the yearly income (10,000-10,000,000);
  • [texi]x_2[texi] as the age (0-100).

Below you will find a contour plot for the cost function [texi]J(\theta_1, \theta_2)[texi] as if we were using the raw, unprocessed values. As you may see the result is a very thin and stretched version of it. The gradient descent algorithm would oscillate a lot back and forth, taking a long time before finding its way to the minimum point.

Stretched version of the contour plot.
1. A stretched contour plot, due to missing input feature scaling.

With feature scaling we will bring back the original bowl-shaped figure in order to let the gradient descent algorithm do its job efficiently. You have to options here: min-max scaling or standardization.

Min-max scaling

The idea is to get every input feature into approximately a [texi][-1,1][texi] range. The name comes from the use of [texi]\min[texi] and [texi]\max[texi] functions, namely the smallest and greatest values in your dataset. It requires dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable:

[tex] x_i^{\prime} = \frac{x_i - \min(x_i)}{\max(x_i) - \min(x_i)} [tex]

where [texi]x_i[texi] is the original [texi]i[texi]-th input value, [texi]x_i^{\prime}[texi] is the normalized version.

For example, say I'm dealing with the yearly income [texi]x_1[texi] and in particular I want to normalize the value of $30,000:

[tex] x_1^{\prime} = \frac{30,000 - 10,000}{10,000,000 - 10,000} \approx 0.002 [tex]

Just rinse and repeat such normalization for every value in your dataset. Of course if you are in a multivariate scenario remember to skip feature [texi]x_0[texi], since [texi]x_0 = 1[texi] as seen in the previous episode.


This technique goes also under the name of z-score normalization and many other confusing aliases I wish I could forget. In brief, you transform your data set so that the values follow the property of a normal distribution, namely with mean 0 ([texi]\mu = 0[texi]) and standard deviation 1 ([texi]\sigma = 1[texi]). Unlike min-max scaling, with standardization you are thinking in terms of how many standard deviations a value is far from the mean of the entire data set.

The general formula for standardization:

[tex] x_i = \frac{x_i - \mu_i}{\sigma_i} [tex]

Following the links to my previous articles above I'm able to compute the mean and the standard deviation on my data set. I'll show you an example with the yearly income [texi]x_1[texi]. Suppose I have collected five samples: $10,000, $30,000, $32,000, $35,000, $150,000. The mean and the standard deviation are:

[tex] \begin{equation} \mu_1 = 51,400 \qquad \sigma_1 \approx 50,078 \end{equation} [tex]

Now, let's apply the standardization to the value of $30,000 as I did before with the min-max scaling:

[tex] x_1 = \frac{30,000 - 51,400}{50,078} \approx -0.4 [tex]

You can read it as [texi]-0.4[texi] standard deviations ([texi]-0.4\text{STD}[texi]) from the mean.

Rinse and repeat the procedure for every value in your dataset as for the min-max scaling, and remember to skip [texi]x_0[texi] in multivariate problems.

Using standardization is important when you are comparing measurements that have different units, like years and dollars. It is also a general requirement for many machine learning algorithms besides linear regression. As a rule of thumb I'd say: when in doubt, just standardize the data, it shouldn't hurt.

Debug the gradient descent to make sure it is working properly

You want to know if the gradient descent is working correctly. Since the job of the gradient descent is to find the value of [texi]\theta[texi]s that minimize the cost function, you could plot the cost function itself (i.e. its output) and see how it behaves as the algorithm runs.

The image below shows what I mean. The number of iterations on the horizontal axis, the cost function output on the vertical one. On each iteration the gradient descent churns out new [texi]\theta[texi]s values: you take those values and evaluate the cost function [texi]J(\theta)[texi]. You should see a descending curve if the algorithm behaves well: it means that it's minimizing the value of [texi]\theta[texi]s correctly.

More generally, the gradient descent works properly when [texi]J(\theta)[texi] decreases after every iteration.

Cost function plot.
2. Plot of the cost function as it gets minimized by the gradient descent algorithm.

Plotting [texi]J(\theta)[texi] also tells you whether or not the gradient descent has converged. Different problems require different number of iterations until convergence, so in general you can assume that the algorithm has found a minimum when [texi]J(\theta)[texi] decreases less than some small value [texi]\epsilon[texi] in one iteration.

Choosing a proper value [texi]\epsilon[texi] is not an easy task. Some people set it to value [texi]10^-3[texi] and also automatize the task in what is called automatic convergence test: their algorithm stops when [texi]J(\theta)[texi] has decreased less than [texi]\epsilon[texi] in one iteration.

Choose the best values for [texi]\alpha[texi]

If your [texi]J(\theta)[texi] plot seen before starts to look weird — upward curves, dramatically slow decreasing, ... — the gradient descent is not working properly: it is time to fix [texi]\alpha[texi], by using a smaller value.

It has been proved mathematically that for sufficiently small [texi]\alpha[texi], [texi]J(\theta)[texi] decreases on every iteration. On the other hand if [texi]\alpha[texi] is too small the gradient descent can be slow to converge.

The rule of thumb here is to try a range of [texi]\alpha[texi] values. Start with [texi]\alpha = 0,001[texi] and look at the [texi]J(\theta)[texi] plot. Does it decrease properly and rapidly? You are done with it. Otherwise, switch to [texi]\alpha = 0,01[texi] ([texi]\times 10[texi] scale), rinse and repeat until the algorithm works fine.


Machine Learning @ Coursera - Gradient Descent in Practice I - Feature Scaling (link)
Machine Learning @ Coursera - Gradient Descent in Practice II - Learning Rate (link)
Wikipedia - Feature Scaling (link) - About Feature Scaling and Normalization (link)

previous article
Multivariate linear regression
next article
Introduction to classification and logistic regression
Kevin McLaughlin on May 02, 2019 at 20:04
I love the article. The one thing I would ask is, how do you transform the parameters found in the normalized domain back to the original space. I know how to transform the observations and z-scores back and for, but what about the parameters?