Join us on Facebook!
— Written by Triangles on February 18, 2017 • ID 49 —
It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
In the previous article I outlined the theory behind the gradient descent algorithm. It's now time to apply it to our original problem: the house pricing task. Let's make some good use out of it!
In words, I'm going to apply gradient descent to the cost function [texi]J(\theta_0, \theta_1)[texi] found in the 2nd episode, in order to minimize it. Let me recap all the formulas found so far. We have a linear hypothesis function
[tex] h_\theta(x) = \theta_0 + \theta_1 (x) [tex]
and its cost function
[tex] J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 [tex]
.
I will apply the gradient descent algorithm
[tex] \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1, ... \theta_n) \\ \text{ for } j=0, j=1, ... j=n \text{ until convergence} [tex]
to the cost function in order to minimize it, that is
[tex] \min_{\theta_0, \theta_1} J(\theta_0, \theta_1) [tex]
Once we have found the minimized (i.e. best) values for [texi]\theta_0[texi] and [texi]\theta_1[texi] we can plug them into the linear hypothesis function and obtain the best fitting line through our data set.
The key part of out task is the derivative inside the gradient descent function, that is
[tex] \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 [tex]
Nothing fancy here: I just plugged in the definition of the cost function.
According to the gradient descent algorithm, we have to figure out what is the derivative in two cases, one for each [texi]\theta[texi] in our cost function ([texi]\theta_0[texi] and [texi]\theta_1[texi] of course). I don't want to harass you with the step-by-step development of those derivatives, so trust the following results:
[tex] \begin{align} j = 0 : \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ j = 1 : \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \end{align} [tex]
Remember: each of those derivatives gives us the slope of a tangent line, which is just a plain number.
Now that we have found those derivatives, we can plug them back into the gradient descent function:
[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \theta_0 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ \theta_1 & := \theta_1 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \\\\ \text{\}} \end{align} [tex]
This is the heart of the linear regression. You may not know that we have implicitly solved a nasty problem that might arise in other applications besides linear regression. In the previous article I mentioned that, while initializing the gradient descent, different initial values of [texi]\theta[texi]'s might lead to a different minimum of the cost function. Such thing will never happen in linear regression: here all cost functions are bowl-shaped, with only one minimum point at the bottom. They are said to be convex functions.
Now that I've packed all things together, I am able to run the algorithm and see how it behaves with the initial data set. The picture below briefly shows how the hypothesis function and the cost function progress as the whole thing runs.
In the first step, I've initialized the parameter of the cost function at a random value (the white dot), say for example [texi]\theta_0 = 100, \theta_1 = 3[texi]. Those values correspond to [texi]h(x) = 100 + 3x[texi], definitely not a good outcome for the hypothesis function, as you may see on the left.
As the algorithm takes further steps in the gradient descent, new values for [texi]\theta[texi]s pop up in the contour plot. Those new values corresponds to improved hypotheses that better fit to the data. Eventually the algorithm discovers the minimum on the cost function, which yields a hypothesis function that fits best the data set. We can now proudly use that hypothesis function to predict new values, namely new house prices given new sizes.
The gradient descent algorithm used so far is called batch gradient descent, because it runs on the entire data set. You may notice that by looking at those two derivatives we've found earlier and their [texi]\sum_{i=1}^{m} ...[texi], the summation part. The summation forces the algorithm to parse the entire dataset from [texi]1[texi] to [texi]m[texi], where [texi]m[texi] is the total number of inputs. There are other non-batch versions of the gradient descent that look at a small subset of the training set in order to speed up the computation. We will see some examples in the future.
Finally, our algorithm is clearly an iterative one. There are different, non-iterative versions based on normal equations that find the minimum without multiple steps. However the iterative ones scale better on large data sets. We will take a look at those versions soon.
Machine learning @ Coursera - Gradient Descent For Linear Regression (link)
StackOverflow - Why gradient descent when we can solve linear regression analytically (link)