h1# The gradient descent function

How to find the minimum of a function using an iterative algorithm.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

During the previous episode of this Machine Learning series we took a look at the theory behind linear regression. We also outlined some paper formulas for minimizing the cost function math$J$. Or, in other words, finding the best values of math$\theta$'s, the parameters of the hypothesis function.

During that stage we basically guessed those best math$\theta$'s by staring at the graphical plot on the screen. That was definitely not an automated task: an algorithm called gradient descent will come in handy.

Gradient descent is a more generic algorithm, used not only in linear regression problems and cost functions. In the following example I will minimize an arbitrary function math$J$, then in the next chapter I'll apply it to the original house pricing task.

h2## The gradient descent algorithm in a nutshell

You have a generic function math$J(\theta_0, \theta_1, ... \theta_n)$ with an undefined number of parameters. You want to minimize it, that is

math$$\displaystyle{\stackrel{\text{minimize}}{{\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}}}}\ {J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}$$

What to do:

1. start with some initial guesses for math$\theta_0, \theta_1, ... \theta_n$. The common choice is to set them to math$0$ but sometimes you want to initialize them to other values as well;

2. keep changing those math$\theta_0, \theta_1, ... \theta_n$ values to try to reduce math$J$ until you find a minimum for that function.

The picture 1. below displays a generic three-dimensional function math$J(\theta_0, \theta_1)$. There are only two math$\theta$'s there, in order to generate an understandable plot. The height, or the vertical axis shows math$J$, that is the output. Minimizing that function means to find the lowest values of math$J$ that correspond to the blue depressed areas.

1. A generic, 3-dimensional function math$J$.

h2## Applying the descent algorithm in the real world

Suppose that the function in picture 1. represents a hilly landscape with yourself standing on the central mountain. Your task is to walk downhill as quickly as possible by taking small baby steps until you cannot go down any further. You start looking around you, asking yourself: what direction should I go in order to take a baby step downhill? You find the best one given your current position, you take the step and then ask yourself the same question again. Repeat the process until you cannot go down any further, that is until you find a minimum.

Picture 2. below shows the baby steps experiment with two different routes. Your starting position on the hill corresponds to the initial values given to math$\theta_0, \theta_1$. Black route has a slightly different starting point compared to the white one, which reveals an interesting property of the gradient descent algorithm: changing the initial value of theta's might lead you to a different minimum.

2. Two different downhill routes in a three-dimensional function, determined by the starting point.

h2## Gradient descent algorithm definition

The complete gradient descent formula looks like the following:

math$$\displaystyle\theta_{{j}}:=\theta_{{j}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{j}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}\ \ \ \ \ \ \ \text{for}\ {j}={0},{j}={1},\ldots{j}={n}\ \text{until convergence}$$

Let me dissect it a little. First of all the math$:=$ symbol means assignment. It works like when you assign a value to a variable in a programming language, for example code`int x = 3`.

The term math$\alpha$ is a number called learning rate. It basically controls the step size while descending the hill. Larger math$\alpha$ means larger steps. The remaining part is a derivative of function math$J$; I will delve into it shortly.

To put it briefly, the gradient descent works as follows: for every math$\theta_j$ of the math$J$ function (that is math$\theta_0, \theta_1, ... \theta_n$), compute the value of math$\theta_j$ by subtracting from itself the derivative of the function at point math$\theta_j$ mupliplied by a number math$\alpha$. Rinse and repeat until convergence. We will see the concept of convergence later. For now think of it as a point in your downhill walk in which you haven't reached the optimal result yet, but you are really quite quite near, if not on it.

The correct way to implement a gradient descent algorithm requires simultaneous updates of math$\theta_j$. That is, store each value in a temporary variable:

math$$\displaystyle\text{temp}_{{0}}=\theta_{{0}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{0}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}$$

math$$\displaystyle\text{temp}_{{1}}=\theta_{{1}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{1}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}$$

math$$\displaystyle\text{...}$$

math$$\displaystyle\text{temp}_{{n}}=\theta_{{n}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{n}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}$$

Then assign those temporary values to the actual math$\theta_j$:

math$$\displaystyle\theta_{{0}}=\text{temp}_{{0}}$$

math$$\displaystyle\theta_{{1}}=\text{temp}_{{1}}$$

math$$\displaystyle\text{...}$$

math$$\displaystyle\theta_{{n}}=\text{temp}_{{n}}$$

h2## A gradient descent algorithm at work

Let's try now to understand what each part of the gradient descent algorithm does. I'm going to use a simpler example with only one parameter math$\theta_0$: it will help a lot with visualization.

I have a generic function math$J(\theta_0)$ to minimize, that is math$\displaystyle{\stackrel{\text{min}}{{\theta_{{0}}}}}\ {J}{\left(\theta_{{0}}\right)}$, with math$\displaystyle\theta_{{0}}\in\mathbb{R}$ (a real number). The function might look like the one you see in the picture 3. below.

3. A generic function math$J(\theta_0)$ that looks like a parabola. On the left, a randomly choosen point math$\theta_0$ before the gradient descent. On the right, the point is closer to the minimum after an algorithm iteration.

You may also notice the black point on the curve: that's the initial value of math$\theta_0$ set during the gradient descent initialization. It's just a random value. The gradient descent function will shift that point until it reaches the minimum, that is the bottom of the parabola. Let's see how.

The core part of the algorithm is the derivative: it basically churns out the slope of the tangent line to the black point. Once the value of the slope is known, the algorithm adds or subtracts that value in order to move the point closer to the minimum.

In the picture 3. the slope of the tangent line is positive, so its derivative will be positive. The derivative at a point is just a number, let's call it math$\displaystyle{d},\ {d}\ge{0}$. Then the update to math$\theta_0$ will be:

math$$\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{d}$$

We are subtracting a positive number from math$\theta_0$, which makes it smaller. The point will shift to the left, as you may see in the rightmost plot in picture 3.

Of course the algorithm works also in case you have a tangent with a negative slope, like when you initialize math$\theta_0$ so that it lands in the leftmost part of the plot. Look at the picture below for an example:

4. The same function math$J(\theta_0)$ seen before, with a different starting point. Here the tangent line has negative slope.

Up there the slope of the tangent line is negative, so its derivative will be negative. Let's call it math$\displaystyle{d},\ {d}\le{0}$ as before. Then the update to math$\theta_0$ will be:

math$$\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{\left(-{d}\right)}$$

that is

math$$\displaystyle\theta_{{0}}\:=\theta_{{0}}+\alpha\cdot{d}$$

We are adding a positive number to math$\theta_0$, which makes it larger. The point will shift to the right, as you may see in the rightmost plot in picture 4.

The process continues until the algorithm has reached the minimum. But what does it mean? The minimum, namely the bottom of the parabola, is the point where the tangent has slope zero: a perfectly horizontal line. The derivative of such horizontal line is math$0$ (let's call it math$\displaystyle{d},\ {d}={0}$), so the algorithm will eventually add nothing to math$\theta_0$, like so:

math$$\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{d}$$

that is

math$$\displaystyle\theta_{{0}}\:=\theta_{{0}}-{0}$$

At this point, math$\theta_0$ does not shift anymore: the minimum has been reached.

h2## The gotchas of math$\alpha$, the learning rate

The parameter math$\alpha$ (the learning rate) defines how big the step will be during a gradient descent. It's the number that multiplies the derivative during the math$\theta_0$ updates. The greater the learning rate, the faster the algorithm will descent to the minimum point.

Defining a good value for math$\alpha$ requires some planning. If it's too small, the algorithm will take many tiny baby steps to get to the minimum, as shown in picture 5. below, left side. Thus gradient descent can be slow.

If math$\alpha$ is too large, the algorithm can miss the minimum point, namely it fails to converge, as seen in picture below, right side. Worse, it could diverge, that is going further and further away from the minimum, taking up an infinite amount of time.

5. The effects of different sizes for the learning rate math$\alpha$. Baby steps on the left (slow processing), failure to converge on the right (divergence).

The good news is that once you've found a proper value for the learning rate, the algorithm will reach the minimum naturally, step by step. As math$\theta_0$ slides toward the minimum, the slope of the tangent line keeps getting smaller, until it reaches math$0$. So there's no need to adjust math$\alpha$, which can remain fixed for the entire process.

In the next episode I will try to apply the gradient descent algorithm to our initial house pricing and churn out some useful values from it.