h1# Linear regression with one variable

Finding the best-fitting straight line through points of a data set.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

Linear regression is one of the most famous way to describe your data and make predictions on it. The picture 1. below, borrowed from the first chapter of this stunning machine learning series, shows the housing prices from a fantasy country somewhere in the world. You are collecting real-estate information because you want to predict the house prices given, say, the size in square feet.

1. House prices given their size.

Given your input data, how can you predict any house price outside your initial data set? For example, how much a 1100 square feet house is worth? Linear regression will help answering that question: you shrink your data into a line (the dotted one in the picture above), with a corresponding mathematical equation. If you know the equation of that line, you can find any output (y) given any input (x).

h2## Terminology and notations

When you gathered your initial data, you actually created the so-called training set, which is the set of housing prices. The algorithm's job is to learn from those data to predict prices of new houses. You are using input data to train the program, that's where the name comes from.

The training set can be summarized in a table, like the one you see below:

size (feet2) (x) price (\$) (y)
815 165,000
1510 310,000
2100 410,000
... ...

The number of training examples, or the number of lines in the table above, are noted as m; the input variable x is the single house size on the left column and y is the output variable, namely the price, on the right column.

The list (x, y) denotes a single, generic training example, while (x^{(i)}, y^{(i)}) represents a specific training example. So if I write (x^{(2)}, y^{(2)}) I'm referring to the second row in the table above, where x^{(2)} = 1510 and y^{(2)} = 310,000.

h3### Naming the algorithm parts

2. Overview of a linear regression algorithm.

The training set of housing prices is fed into the learning algorithm. Its main job is to produce a function, which by convention is called h (for hypothesis). You then use that hypothesis function to output the estimate house price y, by giving it the size of a house in input x.

h3### The hypothesis function

The hypothesis function must have a formula, like any other function in the world. That is:

h_theta(x) = theta_0 + theta_1 (x)

Theta's (theta_i in general) are the parameters of the function. Usually the theta subscript gets dropped and the hypothesis function is simply written as h(x).

That formula might look scary, but if you think about it it's nothing fancier than the traditional equation of a line, except that we use theta_0 and theta_1 instead of m and q. Do you remember?

 f(x) = q + mx

In fact the hypothesis function is just the equation of the dotted line you can see in the picture 1.

In our humble hypothesis function there is only one variable, that is x. For this reason our task is often called linear regression with one variable. Experts call it also univariate linear regression, where univariate means "one variable".

h2## The cost function: a mathematical intuition

Well, at this point we know that there's a hypothesis function to be found. More precisely we have to find the parameters theta_0 and theta_1 so that the hypothesis function best fits the training data. If you recall how the equation of a line works, those two parameters control the slope and the height of the line. By tweaking theta_0 and theta_1 we want to find a line that represents at best our data. Picture 3. below shows what I mean:

3. Varying the values of theta_0 and theta_1 provide different outcomes: from good (left) to bad (right).

We definitely want something like the first example in the picture above. So how to find proper values for theta_0 and theta_1? You certainly recall that in our training set we have several examples where we know the size of the house x and the actual price of the house y. We know those prices and sizes because we previously took a survey for those data. So the idea in a nutshell: let's try to choose the hypothesis function parameters so that at least in the existing training set, given the x as input parameter to the hypothesis function we make reasonably accurate predictions for the y values. Once we are satisfied, we can use the hypothesis function with its pretty parameters to make predictions on new input data.

From a mathematical point of view I want that, for each i-th point in my data set, the difference h_theta(x^{(i)}) - y^{(i)} is very small. Here h_theta(x^{(i)}) is the prediction of the hypothesis when it is input the size of house number i, while y^{(i)} is the actual price of the house number i. If that difference is small, it means that the hypothesis has made an accurate prediction, because it's similar to the actual data.

The operation I described so far is a part of the so-called mean squared error function (MSE), a function that does exactly what we want: it measures how close a fitted line is to some data points. The smaller the MSE, the closer the fit is to the data. Actually there are many other functions that work well for such task, but the MSE is the most commonly used one for regression problems.

If I plug our data into the MSE function, our final formula looks like that:

text{MSE} = 1/{2m} sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2

Note the 1/{2m} and the summation part: we are properly computing a mean. That 2 at the denominator will ease some calculations in future steps. Also, the squaring is done so negative values do not cancel positive values.

Let me now expand the above equation. Since

h_theta(x^{(i)}) = theta_0 + theta_1x^{(i)}

then

text{MSE} = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2

By convention we would define a cost function (aka loss function) J that is just the above equation written more compact:

J(theta_0, theta_1) = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2

Now, we want to find good values of theta_0 and theta_1, so good that the above cost function can produce the best possible values, namely the smallest ones (because small values mean less errors). This is an optimization problem: the problem of finding the best solution from all feasible solutions. It can be written as

stackrel"minimize"{theta_0, theta_1}\ J(theta_0, theta_1)

h2## Applying the cost function

Let's now feed our theoretical function with some real data. To better understand how the cost function works I will temporaly set theta_0 = 0 so that our hypothesis function looks like

h_theta(x) = theta_1x

stackrel"minimize"{theta_1}\ J(theta_1)

This will help a lot with cost function visualization: keeping theta_0 != 0 would require a three-dimentional plot that initially would be a source of annoyance. Just remember: with theta_0 = 0 the hypothesis function becomes a line passing through the origin (0, 0), while theta_1 controls the slope.

h3### Cost function with one variable

Picture 4. shows the relationship between the hypothesis function and the cost function. Let's suppose that our data is made of three points as you may see in the leftmost plot.

4. Hypothesis function (left) and cost function (right) with one variable.

Changing the values of theta_1, namely changing the slope of the hypothesis function produces points in the cost function.

For example, with theta_1 = 1:

J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (h_theta(x^{i}) - y^{(i)})^2

since theta_0 = 0 I can write the hypothesis function like that:

J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (theta_1(x^{i}) - y^{(i)})^2

Now let's plug some numbers in:

J(theta_1 = 1) = 1/6 [(1*1 - 1)^2 + (1*2 - 2)^2 + (1*3 - 3)^2]

J(theta_1 = 1) = 1/6 [(0)^2 + (0)^2 + (0)^2]

J(theta_1 = 1) = 0

In words: for theta_1 = 1, the cost function has produced a value of 0. Let's try with the other two values:

J(theta_1=0.5) = 1/6 [(0.5*1 - 1)^2 + (0.5*2 - 2)^2 + (0.5*3 - 3)^2] ~= 0.6 J(theta_1=0) = 1/6 [(0*1 - 1)^2 + (0*2 - 2)^2 + (0*3 - 3)^2] ~= 2.3

In picture 4. you may find the values 0, 0.6 and 2.3 plotted as full dots on the cost function. The empty ones are values from other theta's, not shown in the hypothesis function but computed separately: they reveal that the cost function is actually a parabola with its minimum at 0.

You can read the whole picture in reverse: every point of the cost function corresponds to a specific slope of the hypothesis function. We decided to take the best value, namely the minimum of the cost function. Looking at the curve of the cost function, the value that minimizes J(theta_1) is theta_1 = 1: that value means the best slope of the hypothesis function for our particular training set.

h3### Cost function with two (or more) variables

I previously simplified the problem by setting theta_0 to zero, a decision that led to a 2-dimensional cost function: great for learning purposes but not so realistic. In the real world 3-dimensional (and even more!) cost functions are quite common. Fortunately there are some neat ways to visualize them without losing too much information and mental sanity.

The first approach is to just draw the actual function in three dimensions, as shown in picture 5. Just note that, not knowing the exact formula yet, axes values are more or less random.

5. Three-dimensional cost function.

It looks like a cup and the optimization problem consists in finding the lowest point on the bottom edge.

Sometimes, when the picture becomes too messy, it's common to switch to another representation: the contour plot. A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant horizontal slices, called contours, on a 2-dimensional format. Figure 6. shows what I'm talking about.

6. Contour plot of cost function.

It's now much more easy to read, don't you think? Our optimization task now requires to find the exact center of that figure, where theta_0 = theta_1 = 0.

However so far we have just guessed for the best values of theta's, simply by looking at the function's plot. What we actually want is our program to find those values that minimize the cost function. In the next chapters we will see some proper algorithms to do that.

h2## Sources

Machine learning @ Coursera - Model representation (link)
Machine learning @ Coursera - Cost function (link)
Machine learning @ Coursera - Cost function intuition 1 (link)
Machine learning @ Coursera - Cost function intuition 2 (link)
Wikipedia - Mean squared error (link)