Join us on Facebook!
— Written by Triangles on February 05, 2017 • updated on December 04, 2019 • ID 48 —
How to find the minimum of a function using an iterative algorithm.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
During the previous episode of this Machine Learning series we took a look at the theory behind linear regression. We also outlined some paper formulas for minimizing the cost function [texi]J[texi]. Or, in other words, finding the best values of [texi]\theta[texi]'s, the parameters of the hypothesis function.
During that stage we basically guessed those best [texi]\theta[texi]'s by staring at the graphical plot on the screen. That was definitely not an automated task: an algorithm called gradient descent will come in handy.
Gradient descent is a more generic algorithm, used not only in linear regression problems and cost functions. In the following example I will minimize an arbitrary function [texi]J[texi], then in the next chapter I'll apply it to the original house pricing task.
You have a generic function [texi]J(\theta_0, \theta_1, ... \theta_n)[texi] with an undefined number of parameters. You want to minimize it, that is
[tex]\displaystyle{\stackrel{\text{minimize}}{{\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}}}}\ {J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}[tex]
What to do:
start with some initial guesses for [texi]\theta_0, \theta_1, ... \theta_n[texi]. The common choice is to set them to [texi]0[texi] but sometimes you want to initialize them to other values as well;
keep changing those [texi]\theta_0, \theta_1, ... \theta_n[texi] values to try to reduce [texi]J[texi] until you find a minimum for that function.
The picture 1. below displays a generic three-dimensional function [texi]J(\theta_0, \theta_1)[texi]. There are only two [texi]\theta[texi]'s there, in order to generate an understandable plot. The height, or the vertical axis shows [texi]J[texi], that is the output. Minimizing that function means to find the lowest values of [texi]J[texi] that correspond to the blue depressed areas.
Suppose that the function in picture 1. represents a hilly landscape with yourself standing on the central mountain. Your task is to walk downhill as quickly as possible by taking small baby steps until you cannot go down any further. You start looking around you, asking yourself: what direction should I go in order to take a baby step downhill? You find the best one given your current position, you take the step and then ask yourself the same question again. Repeat the process until you cannot go down any further, that is until you find a minimum.
Picture 2. below shows the baby steps experiment with two different routes. Your starting position on the hill corresponds to the initial values given to [texi]\theta_0, \theta_1[texi]. Black route has a slightly different starting point compared to the white one, which reveals an interesting property of the gradient descent algorithm: changing the initial value of theta's might lead you to a different minimum.
The complete gradient descent formula looks like the following:
[tex]\displaystyle\theta_{{j}}:=\theta_{{j}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{j}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}\ \ \ \ \ \ \ \text{for}\ {j}={0},{j}={1},\ldots{j}={n}\ \text{until convergence}[tex]
Let me dissect it a little. First of all the [texi]:=[texi] symbol means assignment. It works like when you assign a value to a variable in a programming language, for example int x = 3
.
The term [texi]\alpha[texi] is a number called learning rate. It basically controls the step size while descending the hill. Larger [texi]\alpha[texi] means larger steps. The remaining part is a derivative of function [texi]J[texi]; I will delve into it shortly.
To put it briefly, the gradient descent works as follows: for every [texi]\theta_j[texi] of the [texi]J[texi] function (that is [texi]\theta_0, \theta_1, ... \theta_n[texi]), compute the value of [texi]\theta_j[texi] by subtracting from itself the derivative of the function at point [texi]\theta_j[texi] mupliplied by a number [texi]\alpha[texi]. Rinse and repeat until convergence. We will see the concept of convergence later. For now think of it as a point in your downhill walk in which you haven't reached the optimal result yet, but you are really quite quite near, if not on it.
The correct way to implement a gradient descent algorithm requires simultaneous updates of [texi]\theta_j[texi]. That is, store each value in a temporary variable:
[tex]\displaystyle\text{temp}_{{0}}=\theta_{{0}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{0}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}[tex]
[tex]\displaystyle\text{temp}_{{1}}=\theta_{{1}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{1}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}[tex]
[tex]\displaystyle\text{...}[tex]
[tex]\displaystyle\text{temp}_{{n}}=\theta_{{n}}-\alpha\cdot\frac{\partial}{{\partial\theta_{{n}}}}{J}{\left(\theta_{{0}},\theta_{{1}},\ldots\theta_{{n}}\right)}[tex]
Then assign those temporary values to the actual [texi]\theta_j[texi]:
[tex]\displaystyle\theta_{{0}}=\text{temp}_{{0}}[tex]
[tex]\displaystyle\theta_{{1}}=\text{temp}_{{1}}[tex]
[tex]\displaystyle\text{...}[tex]
[tex]\displaystyle\theta_{{n}}=\text{temp}_{{n}}[tex]
Let's try now to understand what each part of the gradient descent algorithm does. I'm going to use a simpler example with only one parameter [texi]\theta_0[texi]: it will help a lot with visualization.
I have a generic function [texi]J(\theta_0)[texi] to minimize, that is [texi]\displaystyle{\stackrel{\text{min}}{{\theta_{{0}}}}}\ {J}{\left(\theta_{{0}}\right)}[texi], with [texi]\displaystyle\theta_{{0}}\in\mathbb{R}[texi] (a real number). The function might look like the one you see in the picture 3. below.
You may also notice the black point on the curve: that's the initial value of [texi]\theta_0[texi] set during the gradient descent initialization. It's just a random value. The gradient descent function will shift that point until it reaches the minimum, that is the bottom of the parabola. Let's see how.
The core part of the algorithm is the derivative: it basically churns out the slope of the tangent line to the black point. Once the value of the slope is known, the algorithm adds or subtracts that value in order to move the point closer to the minimum.
In the picture 3. the slope of the tangent line is positive, so its derivative will be positive. The derivative at a point is just a number, let's call it [texi]\displaystyle{d},\ {d}\ge{0}[texi]. Then the update to [texi]\theta_0[texi] will be:
[tex]\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{d}[tex]
We are subtracting a positive number from [texi]\theta_0[texi], which makes it smaller. The point will shift to the left, as you may see in the rightmost plot in picture 3.
Of course the algorithm works also in case you have a tangent with a negative slope, like when you initialize [texi]\theta_0[texi] so that it lands in the leftmost part of the plot. Look at the picture below for an example:
Up there the slope of the tangent line is negative, so its derivative will be negative. Let's call it [texi]\displaystyle{d},\ {d}\le{0}[texi] as before. Then the update to [texi]\theta_0[texi] will be:
[tex]\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{\left(-{d}\right)}[tex]
that is
[tex]\displaystyle\theta_{{0}}\:=\theta_{{0}}+\alpha\cdot{d}[tex]
We are adding a positive number to [texi]\theta_0[texi], which makes it larger. The point will shift to the right, as you may see in the rightmost plot in picture 4.
The process continues until the algorithm has reached the minimum. But what does it mean? The minimum, namely the bottom of the parabola, is the point where the tangent has slope zero: a perfectly horizontal line. The derivative of such horizontal line is [texi]0[texi] (let's call it [texi]\displaystyle{d},\ {d}={0}[texi]), so the algorithm will eventually add nothing to [texi]\theta_0[texi], like so:
[tex]\displaystyle\theta_{{0}}\:=\theta_{{0}}-\alpha\cdot{d}[tex]
that is
[tex]\displaystyle\theta_{{0}}\:=\theta_{{0}}-{0}[tex]
At this point, [texi]\theta_0[texi] does not shift anymore: the minimum has been reached.
The parameter [texi]\alpha[texi] (the learning rate) defines how big the step will be during a gradient descent. It's the number that multiplies the derivative during the [texi]\theta_0[texi] updates. The greater the learning rate, the faster the algorithm will descent to the minimum point.
Defining a good value for [texi]\alpha[texi] requires some planning. If it's too small, the algorithm will take many tiny baby steps to get to the minimum, as shown in picture 5. below, left side. Thus gradient descent can be slow.
If [texi]\alpha[texi] is too large, the algorithm can miss the minimum point, namely it fails to converge, as seen in picture below, right side. Worse, it could diverge, that is going further and further away from the minimum, taking up an infinite amount of time.
The good news is that once you've found a proper value for the learning rate, the algorithm will reach the minimum naturally, step by step. As [texi]\theta_0[texi] slides toward the minimum, the slope of the tangent line keeps getting smaller, until it reaches [texi]0[texi]. So there's no need to adjust [texi]\alpha[texi], which can remain fixed for the entire process.
In the next episode I will try to apply the gradient descent algorithm to our initial house pricing and churn out some useful values from it.
StackOverflow - Gradient descent convergence How to decide convergence? (link)
Machine Learning @ Coursera - Gradient Descent (link)
Machine Learning @ Coursera - Gradient Descent Intuition (link)