— Written by Triangles on March 05, 2017 • updated on March 05, 2017 • ID 50 —

Multivariate linear regression

How to upgrade a linear regression algorithm from one to many input variables.

The house pricing problem with multiple features

You surely need more than one feature in order to better predict the price of a house, like for example the number of rooms, the number of floors, the age of the house itself and so on. Those are your new input features [texi]x[texi]. Being more than one, we need to update the notation a little bit.

size	# of bedrooms	# of floors	age	price($)
2104	5	1	40	460,000
1416	3	2	30	230,000
1534	3	2	30	315,000
...	...	...	...	...

Improving the notation

This is what I'm going to use:

[texi]n[texi] — number of input features;
[texi]m[texi] — number of training examples;
[texi]x_n[texi] — [texi]n[texi]-th input feature;
[texi]x^{(i)}[texi] — input feature of [texi]i[texi]-th training example;
[texi]x^{(i)}_j[texi] — value of feature [texi]j[texi] in [texi]i[texi]-th training example;

Let me explain it a little bit. The lower-case [texi]n[texi] is the number of input features, the number of columns in the table above. There are four input features here, so [texi]n = 4[texi]. As in the univariate version, [texi]m[texi] denotes the number of training examples, that is the number of rows in the table above.

The [texi]n[texi]-th input feature is written as [texi]x_n[texi]. For example: [texi]x_1[texi] is the size of the house, [texi]x_4[texi] is the age of the house. The output variable is still one, so it remains [texi]y[texi] with no subscript.

I will be writing [texi]x^{(i)}[texi] to refer to all the values of a specific training example, i.e. the [texi]i[texi]-th row in the table. Being more than one input values, the result turns out to be a vector. For example, I grab the values of the third training example:

[tex] \vec{x}^{(3)} = \begin{bmatrix} 1534 \\ 3 \\ 2 \\ 30 \end{bmatrix} [tex]

Being [texi]x^{(i)}[texi] a vector, I'll use [texi]x^{(i)}_{j}[texi] to refer to a specific value of that vector. For example [texi]x^{(3)}_4 = 30[texi].

A multivariate version of the hypothesis function

The original, univariate hypothesis function was like:

[tex] h_\theta(x) = \theta_0 + \theta_1 x [tex]

with one input variable [texi]x[texi]. Obviously it has to be updated in order to work with multiple inputs. It's going to be:

[tex] h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n [tex]

Quite annoying when you have tons of parameters, isn't it? We can simplify it. First of all let me add a fake value [texi]x_0 = 1[texi]:

[tex] h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n [tex]

The magic is going to happen, thanks to that trick: any linear algebra wizard will now recognize the formula above as the inner product between two vectors [texi]\vec{\theta}[texi] and [texi]\vec{x}[texi]. In particular:

[tex] \begin{equation} \vec{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} \qquad \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \end{equation} [tex]

By definition of the inner product, the first argument ([texi]\vec{\theta}[texi]) must be a row vector. However our [texi]\vec{\theta}[texi] is a column vector. That's not a problem: just transpose it, that is make it a row (lay it down). A transposed vector is written as [texi]\vec{\theta}^\top[texi], so I'm ready to beautifully compress the hypothesis function as follows:

[tex] h_\theta(x) = \vec{\theta}^{\top} \vec{x} [tex]

This notation will make the implementation way easier: computing the hypothesis function it's now just a matter of an inner product between two vectors, a simple task you can accomplish with any mathematical package of your favorite programming language.

A multivariate version of the gradient descent function

I've updated the hypothesis function to work with multiple input parameters. Both the gradient descent function and the cost function need some tweaks as well.

Improving the cost function

We know that the cost function takes in input all the parameters [texi]\theta_0, \theta_1, ... \theta_n[texi], but let's now plug in the vector [texi]\vec{\theta}[texi] instead of writing each parameter separately:

[tex] J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 [tex]

It's just a matter of notation, as you may see. The equation stays the same.

Improving the gradient descent function

Now that we have a compact cost function we can simplify the look of the gradient descent formula, which becomes:

[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\vec{\theta}) & \newline \rbrace \end{align*} [tex]

In particular we have a slightly different kind of derivative, that is:

[tex] \frac{\partial}{\partial \theta_j} J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j [tex]

As always I lift you from the burden of computing the derivative step by step. So let me re-write the formula with all the pieces glued together:

[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j & \newline \rbrace \end{align*} [tex]

Remember to compute each value separately by storing [texi]\theta[texi]s in temporary variables, as we did for the univariate version. The unrolled loop would look like the following:

[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*} [tex]

In conclusion, adding the little trick of [texi]x_0 = 1[texi] makes the notation easier to read and more compact. You still have to loop through each [texi]\theta_j[texi] in the gradient descent formula until you reach convergence, but the outcome is a practical vector. Plug it into the hypothesis function, compute the inner product as seen above and you have a working implementation of the multivariate linear regression.

Sources

Khan Academy - Vector dot product and vector length (video)
Stat Trek - Vector Multiplication (link)
Machine Learning @ Coursera - Multiple Features (link)
Machine Learning @ Coursera - Gradient Descent for Multiple Variables (link)