Like it!

Join us on Facebook!

Like it!

The gradient descent function

How to find the minimum of a function using an iterative algorithm.

Other articles from this series

During the previous episode of this Machine Learning series we took a look at the theory behind linear regression. We also outlined some paper formulas for minimizing the cost function [texi]J[texi]. Or, in other words, finding the best values of [texi]\theta[texi]'s, the parameters of the hypothesis function.

During that stage we basically guessed those best [texi]\theta[texi]'s by staring at the graphical plot on the screen. That was definitely not an automated task: an algorithm called gradient descent will come in handy.

Gradient descent is a more generic algorithm, used not only in linear regression problems and cost functions. In the following example I will minimize an arbitrary function [texi]J[texi], then in the next chapter I'll apply it to the original house pricing task.

The gradient descent algorithm in a nutshell

You have a generic function [texi]J(\theta_0, \theta_1, ... \theta_n)[texi] with an undefined number of parameters. You want to minimize it, that is

[tex]\displaystyle{\stackrel{\text{minimize}}{{\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}}}}\ {J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}[tex]

What to do:

  1. start with some initial guesses for [texi]\theta_0, \theta_1, ... \theta_n[texi]. The common choice is to set them to [texi]0[texi] but sometimes you want to initialize them to other values as well;

  2. keep changing those [texi]\theta_0, \theta_1, ... \theta_n[texi] values to try to reduce [texi]J[texi] until you find a minimum for that function.

The picture 1. below displays a generic three-dimensional function [texi]J(\theta_0, \theta_1)[texi]. There are only two [texi]\theta[texi]'s there, in order to generate an understandable plot. The height, or the vertical axis shows [texi]J[texi], that is the output. Minimizing that function means to find the lowest values of [texi]J[texi] that correspond to the blue depressed areas.

A generic 3-dimensional function.
1. A generic, 3-dimensional function [texi]J[texi].

Applying the descent algorithm in the real world

Suppose that the function in picture 1. represents a hilly landscape with yourself standing on the central mountain. Your task is to walk downhill as quickly as possible by taking small baby steps until you cannot go down any further. You start looking around you, asking yourself: what direction should I go in order to take a baby step downhill? You find the best one given your current position, you take the step and then ask yourself the same question again. Repeat the process until you cannot go down any further, that is until you find a minimum.

Picture 2. below shows the baby steps experiment with two different routes. Your starting position on the hill corresponds to the initial values given to [texi]\theta_0, \theta_1[texi]. Black route has a slightly different starting point compared to the white one, which reveals an interesting property of the gradient descent algorithm: changing the initial value of theta's might lead you to a different minimum.

Gradient descent in a three-dimensional function.
2. Two different downhill routes in a three-dimensional function, determined by the starting point.

Gradient descent algorithm definition

The complete gradient descent formula looks like the following:

[tex]\displaystyle\theta_{{j}}:=\theta_{{j}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{j}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}\ \ \ \ \ \ \ \text{for}\ {j}={0},{j}={1},\ldots{j}={n}\ \text{until convergence}[tex]

Let me dissect it a little. First of all the [texi]:=[texi] symbol means assignment. It works like when you assign a value to a variable in a programming language, for example int x = 3.

The term [texi]\alpha[texi] is a number called learning rate. It basically controls the step size while descending the hill. Larger [texi]\alpha[texi] means larger steps. The remaining part is a derivative of function [texi]J[texi]; I will delve into it shortly.

To put it briefly, the gradient descent works as follows: for every [texi]\theta_j[texi] of the [texi]J[texi] function (that is [texi]\theta_0, \theta_1, ... \theta_n[texi]), compute the value of [texi]\theta_j[texi] by subtracting from itself the derivative of the function at point [texi]\theta_j[texi] mupliplied by a number [texi]\alpha[texi]. Rinse and repeat until convergence. We will see the concept of convergence later. For now think of it as a point in your downhill walk in which you haven't reached the optimal result yet, but you are really quite quite near, if not on it.

The correct way to implement a gradient descent algorithm requires simultaneous updates of [texi]\theta_j[texi]. That is, store each value in a temporary variable:





Then assign those temporary values to the actual [texi]\theta_j[texi]:





A gradient descent algorithm at work

Let's try now to understand what each part of the gradient descent algorithm does. I'm going to use a simpler example with only one parameter [texi]\theta_0[texi]: it will help a lot with visualization.

I have a generic function [texi]J(\theta_0)[texi] to minimize, that is [texi]\displaystyle{\stackrel{\text{min}}{{\theta_{{0}}}}}\ {J}{\left(\theta_{{0}}\right)}[texi], with [texi]\displaystyle\theta_{{0}}\in\mathbb{R}[texi] (a real number). The function might look like the one you see in the picture 3. below.

Generic parabola function with shifted points.
3. A generic function [texi]J(\theta_0)[texi] that looks like a parabola. On the left, a randomly choosen point [texi]\theta_0[texi] before the gradient descent. On the right, the point is closer to the minimum after an algorithm iteration.

You may also notice the black point on the curve: that's the initial value of [texi]\theta_0[texi] set during the gradient descent initialization. It's just a random value. The gradient descent function will shift that point until it reaches the minimum, that is the bottom of the parabola. Let's see how.

The core part of the algorithm is the derivative: it basically churns out the slope of the tangent line to the black point. Once the value of the slope is known, the algorithm adds or subtracts that value in order to move the point closer to the minimum.

In the picture 3. the slope of the tangent line is positive, so its derivative will be positive. The derivative at a point is just a number, let's call it [texi]\displaystyle{d},\ {d}\ge{0}[texi]. Then the update to [texi]\theta_0[texi] will be:


We are subtracting a positive number from [texi]\theta_0[texi], which makes it smaller. The point will shift to the left, as you may see in the rightmost plot in picture 3.

Of course the algorithm works also in case you have a tangent with a negative slope, like when you initialize [texi]\theta_0[texi] so that it lands in the leftmost part of the plot. Look at the picture below for an example:

Generic parabola function with shifted points.
4. The same function [texi]J(\theta_0)[texi] seen before, with a different starting point. Here the tangent line has negative slope.

Up there the slope of the tangent line is negative, so its derivative will be negative. Let's call it [texi]\displaystyle{d},\ {d}\le{0}[texi] as before. Then the update to [texi]\theta_0[texi] will be:


that is


We are adding a positive number to [texi]\theta_0[texi], which makes it larger. The point will shift to the right, as you may see in the rightmost plot in picture 4.

The process continues until the algorithm has reached the minimum. But what does it mean? The minimum, namely the bottom of the parabola, is the point where the tangent has slope zero: a perfectly horizontal line. The derivative of such horizontal line is [texi]0[texi] (let's call it [texi]\displaystyle{d},\ {d}={0}[texi]), so the algorithm will eventually add nothing to [texi]\theta_0[texi], like so:


that is


At this point, [texi]\theta_0[texi] does not shift anymore: the minimum has been reached.

The gotchas of [texi]\alpha[texi], the learning rate

The parameter [texi]\alpha[texi] (the learning rate) defines how big the step will be during a gradient descent. It's the number that multiplies the derivative during the [texi]\theta_0[texi] updates. The greater the learning rate, the faster the algorithm will descent to the minimum point.

Defining a good value for [texi]\alpha[texi] requires some planning. If it's too small, the algorithm will take many tiny baby steps to get to the minimum, as shown in picture 5. below, left side. Thus gradient descent can be slow.

If [texi]\alpha[texi] is too large, the algorithm can miss the minimum point, namely it fails to converge, as seen in picture below, right side. Worse, it could diverge, that is going further and further away from the minimum, taking up an infinite amount of time.

Alpha, or the step size in gradient descent.
5. The effects of different sizes for the learning rate [texi]\alpha[texi]. Baby steps on the left (slow processing), failure to converge on the right (divergence).

The good news is that once you've found a proper value for the learning rate, the algorithm will reach the minimum naturally, step by step. As [texi]\theta_0[texi] slides toward the minimum, the slope of the tangent line keeps getting smaller, until it reaches [texi]0[texi]. So there's no need to adjust [texi]\alpha[texi], which can remain fixed for the entire process.

In the next episode I will try to apply the gradient descent algorithm to our initial house pricing and churn out some useful values from it.


StackOverflow - Gradient descent convergence How to decide convergence? (link)
Machine Learning @ Coursera - Gradient Descent (link)
Machine Learning @ Coursera - Gradient Descent Intuition (link)

previous article
Linear regression with one variable
next article
The gradient descent in action
Spora23 on February 24, 2018 at 19:41
Thanks, thanks, thanks. Finally an adequate introduction to the gradient descent formula.
KetanK on February 06, 2019 at 20:39
Awesome article mate. Really Helpful. Many thanks.
Fred on December 02, 2019 at 02:10
Shouldn't the step in theta be smaller in size if the gradient is larger? It seems to me that a steeper gradient means a small change in theta yields a huge change in "J".