math math h1# How to optimize the gradient descent algorithm

A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

Real-world data can come up in different orders of magnitude. For example, your age ranges from 0 to 100 years, while your yearly income from €10,000 to €10,000,000 (and more). Using such unprocessed data as input features for a linear regression system might slow down the gradient descent algorithm to a crawl.

It happens because — as we will see shortly — such not normalized data warps the cost function the gradient descent has to process, making the minimum point really difficult to reach.

Because of that, an important trick in machine learning and in linear regression is to make sure that all the input features are on a similar scale. This is a preparatory step you do in order to optimize the input data, known as feature scaling.

h2## Feature scaling

In feature scaling you basically normalize your input values. For example, say you have two features:

• math$x_1$ as the yearly income (10,000-10,000,000);
• math$x_2$ as the age (0-100).

Below you will find a contour plot for the cost function math$J(\theta_1, \theta_2)$ as if we were using the raw, unprocessed values. As you may see the result is a very thin and stretched version of it. The gradient descent algorithm would oscillate a lot back and forth, taking a long time before finding its way to the minimum point.

math 1. A stretched contour plot, due to missing input feature scaling.

With feature scaling we will bring back the original bowl-shaped figure in order to let the gradient descent algorithm do its job efficiently. You have to options here: min-max scaling or standardization.

h3### Min-max scaling

The idea is to get every input feature into approximately a math$[-1,1]$ range. The name comes from the use of math$\min$ and math$\max$ functions, namely the smallest and greatest values in your dataset. It requires dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable:

math$$x_i^{\prime} = \frac{x_i - \min(x_i)}{\max(x_i) - \min(x_i)}$$

where math$x_i$ is the original math$i$-th input value, math$x_i^{\prime}$ is the normalized version.

For example, say I'm dealing with the yearly income math$x_1$ and in particular I want to normalize the value of math$30,000: math$$x_1^{\prime} = \frac{30,000 - 10,000}{10,000,000 - 10,000} \approx 0.002$$ Just rinse and repeat such normalization for every value in your dataset. Of course if you are in a multivariate scenario remember to skip feature math$x_0$, since math$x_0 = 1$ as seen in the previous episode. h3### Standardization This technique goes also under the name of z-score normalization and many other confusing aliases I wish I could forget. In brief, you transform your data set so that the values follow the property of a normal distribution, namely with mean 0 (math$\mu = 0$) and standard deviation 1 (math$\sigma = 1$). Unlike min-max scaling, with standardization you are thinking in terms of how many standard deviations a value is far from the mean of the entire data set. The general formula for standardization: math$$x_i = \frac{x_i - \mu_i}{\sigma_i}$$ Following the links to my previous articles above I'm able to compute the mean and the standard deviation on my data set. I'll show you an example with the yearly income math$x_1$. Suppose I have collected five samples:$10,000, math$30,000,$32,000, math$35,000,$150,000. The mean and the standard deviation are:

math$$\begin{equation} \mu_1 = 51,400 \qquad \sigma_1 \approx 50,078 \end{equation}$$

Now, let's apply the standardization to the value of \$30,000 as I did before with the min-max scaling:

math$$x_1 = \frac{30,000 - 51,400}{50,078} \approx -0.4$$

You can read it as math$-0.4$ standard deviations (math$-0.4\text{STD}$) from the mean.

Rinse and repeat the procedure for every value in your dataset as for the min-max scaling, and remember to skip math$x_0$ in multivariate problems.

Using standardization is important when you are comparing measurements that have different units, like years and dollars. It is also a general requirement for many machine learning algorithms besides linear regression. As a rule of thumb I'd say: when in doubt, just standardize the data, it shouldn't hurt.

h2## Debug the gradient descent to make sure it is working properly

You want to know if the gradient descent is working correctly. Since the job of the gradient descent is to find the value of math$\theta$s that minimize the cost function, you could plot the cost function itself (i.e. its output) and see how it behaves as the algorithm runs.

The image below shows what I mean. The number of iterations on the horizontal axis, the cost function output on the vertical one. On each iteration the gradient descent churns out new math$\theta$s values: you take those values and evaluate the cost function math$J(\theta)$. You should see a descending curve if the algorithm behaves well: it means that it's minimizing the value of math$\theta$s correctly.

More generally, the gradient descent works properly when math$J(\theta)$ decreases after every iteration.

math 2. Plot of the cost function as it gets minimized by the gradient descent algorithm.

Plotting math$J(\theta)$ also tells you whether or not the gradient descent has converged. Different problems require different number of iterations until convergence, so in general you can assume that the algorithm has found a minimum when math$J(\theta)$ decreases less than some small value math$\epsilon$ in one iteration.

Choosing a proper value math$\epsilon$ is not an easy task. Some people set it to value math$10^-3$ and also automatize the task in what is called automatic convergence test: their algorithm stops when math$J(\theta)$ has decreased less than math$\epsilon$ in one iteration.

h2## Choose the best values for math$\alpha$

If your math$J(\theta)$ plot seen before starts to look weird — upward curves, dramatically slow decreasing, ... — the gradient descent is not working properly: it is time to fix math$\alpha$, by using a smaller value.

It has been proved mathematically that for sufficiently small math$\alpha$, math$J(\theta)$ decreases on every iteration. On the other hand if math$\alpha$ is too small the gradient descent can be slow to converge.

The rule of thumb here is to try a range of math$\alpha$ values. Start with math$\alpha = 0,001$ and look at the math$J(\theta)$ plot. Does it decrease properly and rapidly? You are done with it. Otherwise, switch to math$\alpha = 0,01$ (math$\times 10$ scale), rinse and repeat until the algorithm works fine.

h2## Sources

Machine Learning @ Coursera - Gradient Descent in Practice I - Feature Scaling (link)
Machine Learning @ Coursera - Gradient Descent in Practice II - Learning Rate (link)