Join us on Facebook!

h1# The gradient descent in action

It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

In the previous article I outlined the theory behind the gradient descent algorithm. It's now time to apply it to our original problem: the house pricing task. Let's make some good use out of it!

In words, I'm going to apply gradient descent to the cost function math$J(\theta_0, \theta_1)$ found in the 2nd episode, in order to minimize it. Let me recap all the formulas found so far. We have a linear hypothesis function

math$$h_\theta(x) = \theta_0 + \theta_1 (x)$$

and its cost function

math$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2$$

.

I will apply the gradient descent algorithm

math$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1, ... \theta_n) \\ \text{ for } j=0, j=1, ... j=n \text{ until convergence}$$

to the cost function in order to minimize it, that is

math$$\min_{\theta_0, \theta_1} J(\theta_0, \theta_1)$$

Once we have found the minimized (i.e. best) values for math$\theta_0$ and math$\theta_1$ we can plug them into the linear hypothesis function and obtain the best fitting line through our data set.

h2## Finding the derivative

The key part of out task is the derivative inside the gradient descent function, that is

math$$\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2$$

Nothing fancy here: I just plugged in the definition of the cost function.

According to the gradient descent algorithm, we have to figure out what is the derivative in two cases, one for each math$\theta$ in our cost function (math$\theta_0$ and math$\theta_1$ of course). I don't want to harass you with the step-by-step development of those derivatives, so trust the following results:

math\begin{align} j = 0 : \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ j = 1 : \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \end{align}

Remember: each of those derivatives gives us the slope of a tangent line, which is just a plain number.

h2## The linear regression algorithm in all its splendor

Now that we have found those derivatives, we can plug them back into the gradient descent function:

math\begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \theta_0 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ \theta_1 & := \theta_1 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \\\\ \text{\}} \end{align}

This is the heart of the linear regression. You may not know that we have implicitly solved a nasty problem that might arise in other applications besides linear regression. In the previous article I mentioned that, while initializing the gradient descent, different initial values of math$\theta$'s might lead to a different minimum of the cost function. Such thing will never happen in linear regression: here all cost functions are bowl-shaped, with only one minimum point at the bottom. They are said to be convex functions.

h3### Running the algorithm

Now that I've packed all things together, I am able to run the algorithm and see how it behaves with the initial data set. The picture below briefly shows how the hypothesis function and the cost function progress as the whole thing runs.

1. The gradient descent algorithm at work on the linear regression.

In the first step, I've initialized the parameter of the cost function at a random value (the white dot), say for example math$\theta_0 = 100, \theta_1 = 3$. Those values correspond to math$h(x) = 100 + 3x$, definitely not a good outcome for the hypothesis function, as you may see on the left.

As the algorithm takes further steps in the gradient descent, new values for math$\theta$s pop up in the contour plot. Those new values corresponds to improved hypotheses that better fit to the data. Eventually the algorithm discovers the minimum on the cost function, which yields a hypothesis function that fits best the data set. We can now proudly use that hypothesis function to predict new values, namely new house prices given new sizes.

h2## Some final words on the algorithm

The gradient descent algorithm used so far is called batch gradient descent, because it runs on the entire data set. You may notice that by looking at those two derivatives we've found earlier and their math$\sum_{i=1}^{m} ...$, the summation part. The summation forces the algorithm to parse the entire dataset from math$1$ to math$m$, where math$m$ is the total number of inputs. There are other non-batch versions of the gradient descent that look at a small subset of the training set in order to speed up the computation. We will see some examples in the future.

Finally, our algorithm is clearly an iterative one. There are different, non-iterative versions based on normal equations that find the minimum without multiple steps. However the iterative ones scale better on large data sets. We will take a look at those versions soon.

h2## Sources

Machine learning @ Coursera - Gradient Descent For Linear Regression (link)
StackOverflow - Why gradient descent when we can solve linear regression analytically (link)

previous article
next article
comments