— Written by Triangles on April 01, 2016 • updated on November 10, 2019 • ID 33 —

Linear regression with one variable

Finding the best-fitting straight line through points of a data set.

Terminology and notations

When you gathered your initial data, you actually created the so-called training set, which is the set of housing prices. The algorithm's job is to learn from those data to predict prices of new houses. You are using input data to train the program, that's where the name comes from.

The training set can be summarized in a table, like the one you see below:

size (feet²) §(x)§	price ($) §(y)§
815	165,000
1510	310,000
2100	410,000
...	...

The number of training examples, or the number of lines in the table above, are noted as §m§; the input variable §x§ is the single house size on the left column and §y§ is the output variable, namely the price, on the right column.

The list §(x, y)§ denotes a single, generic training example, while §(x^{(i)}, y^{(i)})§ represents a specific training example. So if I write §(x^{(2)}, y^{(2)})§ I'm referring to the second row in the table above, where §x^{(2)} = 1510§ and §y^{(2)} = 310,000§.

Naming the algorithm parts

2. Overview of a linear regression algorithm.

The training set of housing prices is fed into the learning algorithm. Its main job is to produce a function, which by convention is called §h§ (for hypothesis). You then use that hypothesis function to output the estimate house price §y§, by giving it the size of a house in input §x§.

The hypothesis function

The hypothesis function must have a formula, like any other function in the world. That is:

§h_theta(x) = theta_0 + theta_1 (x)§

Theta's (§theta_i§ in general) are the parameters of the function. Usually the theta subscript gets dropped and the hypothesis function is simply written as §h(x)§.

That formula might look scary, but if you think about it it's nothing fancier than the traditional equation of a line, except that we use §theta_0§ and §theta_1§ instead of §m§ and §q§. Do you remember?

§ f(x) = q + mx §

In fact the hypothesis function is just the equation of the dotted line you can see in the picture 1.

In our humble hypothesis function there is only one variable, that is §x§. For this reason our task is often called linear regression with one variable. Experts call it also univariate linear regression, where univariate means "one variable".

The cost function: a mathematical intuition

Well, at this point we know that there's a hypothesis function to be found. More precisely we have to find the parameters §theta_0§ and §theta_1§ so that the hypothesis function best fits the training data. If you recall how the equation of a line works, those two parameters control the slope and the height of the line. By tweaking §theta_0§ and §theta_1§ we want to find a line that represents at best our data. Picture 3. below shows what I mean:

Different values of thetas for the hypothesis function

3. Varying the values of §theta_0§ and §theta_1§ provide different outcomes: from good (left) to bad (right).

We definitely want something like the first example in the picture above. So how to find proper values for §theta_0§ and §theta_1§? You certainly recall that in our training set we have several examples where we know the size of the house §x§ and the actual price of the house §y§. We know those prices and sizes because we previously took a survey for those data. So the idea in a nutshell: let's try to choose the hypothesis function parameters so that at least in the existing training set, given the §x§ as input parameter to the hypothesis function we make reasonably accurate predictions for the §y§ values. Once we are satisfied, we can use the hypothesis function with its pretty parameters to make predictions on new input data.

From a mathematical point of view I want that, for each §i§-th point in my data set, the difference §h_theta(x^{(i)}) - y^{(i)}§ is very small. Here §h_theta(x^{(i)})§ is the prediction of the hypothesis when it is input the size of house number §i§, while §y^{(i)}§ is the actual price of the house number §i§. If that difference is small, it means that the hypothesis has made an accurate prediction, because it's similar to the actual data.

The operation I described so far is a part of the so-called mean squared error function (MSE), a function that does exactly what we want: it measures how close a fitted line is to some data points. The smaller the MSE, the closer the fit is to the data. Actually there are many other functions that work well for such task, but the MSE is the most commonly used one for regression problems.

If I plug our data into the MSE function, our final formula looks like that:

§text{MSE} = 1/{2m} sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2§

Note the §1/{2m}§ and the summation part: we are properly computing a mean. That 2 at the denominator will ease some calculations in future steps. Also, the squaring is done so negative values do not cancel positive values.

Let me now expand the above equation. Since

§h_theta(x^{(i)}) = theta_0 + theta_1x^{(i)}§

then

§text{MSE} = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§

By convention we would define a cost function (aka loss function) §J§ that is just the above equation written more compact:

§J(theta_0, theta_1) = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§

Now, we want to find good values of §theta_0§ and §theta_1§, so good that the above cost function can produce the best possible values, namely the smallest ones (because small values mean less errors). This is an optimization problem: the problem of finding the best solution from all feasible solutions. It can be written as

§stackrel"minimize"{theta_0, theta_1}\ J(theta_0, theta_1)§

Applying the cost function

Let's now feed our theoretical function with some real data. To better understand how the cost function works I will temporaly set §theta_0 = 0§ so that our hypothesis function looks like

§h_theta(x) = theta_1x§

and the minimization task like

§stackrel"minimize"{theta_1}\ J(theta_1)§

This will help a lot with cost function visualization: keeping §theta_0 != 0§ would require a three-dimentional plot that initially would be a source of annoyance. Just remember: with §theta_0 = 0§ the hypothesis function becomes a line passing through the origin §(0, 0)§, while §theta_1§ controls the slope.

Cost function with one variable

Picture 4. shows the relationship between the hypothesis function and the cost function. Let's suppose that our data is made of three points as you may see in the leftmost plot.

4. Hypothesis function (left) and cost function (right) with one variable.

Changing the values of §theta_1§, namely changing the slope of the hypothesis function produces points in the cost function.

For example, with §theta_1 = 1§:

§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (h_theta(x^{i}) - y^{(i)})^2§

since §theta_0 = 0§ I can write the hypothesis function like that:

§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (theta_1(x^{i}) - y^{(i)})^2§

Now let's plug some numbers in:

§J(theta_1 = 1) = 1/6 [(1*1 - 1)^2 + (1*2 - 2)^2 + (1*3 - 3)^2]§

§J(theta_1 = 1) = 1/6 [(0)^2 + (0)^2 + (0)^2]§

§J(theta_1 = 1) = 0§

In words: for §theta_1 = 1§, the cost function has produced a value of 0. Let's try with the other two values:

§J(theta_1=0.5) = 1/6 [(0.5*1 - 1)^2 + (0.5*2 - 2)^2 + (0.5*3 - 3)^2] ~= 0.6§ §J(theta_1=0) = 1/6 [(0*1 - 1)^2 + (0*2 - 2)^2 + (0*3 - 3)^2] ~= 2.3§

In picture 4. you may find the values 0, 0.6 and 2.3 plotted as full dots on the cost function. The empty ones are values from other theta's, not shown in the hypothesis function but computed separately: they reveal that the cost function is actually a parabola with its minimum at 0.

You can read the whole picture in reverse: every point of the cost function corresponds to a specific slope of the hypothesis function. We decided to take the best value, namely the minimum of the cost function. Looking at the curve of the cost function, the value that minimizes §J(theta_1)§ is §theta_1 = 1§: that value means the best slope of the hypothesis function for our particular training set.

Cost function with two (or more) variables

I previously simplified the problem by setting §theta_0§ to zero, a decision that led to a 2-dimensional cost function: great for learning purposes but not so realistic. In the real world 3-dimensional (and even more!) cost functions are quite common. Fortunately there are some neat ways to visualize them without losing too much information and mental sanity.

The first approach is to just draw the actual function in three dimensions, as shown in picture 5. Just note that, not knowing the exact formula yet, axes values are more or less random.

5. Three-dimensional cost function.

It looks like a cup and the optimization problem consists in finding the lowest point on the bottom edge.

Sometimes, when the picture becomes too messy, it's common to switch to another representation: the contour plot. A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant horizontal slices, called contours, on a 2-dimensional format. Figure 6. shows what I'm talking about.

6. Contour plot of cost function.

It's now much more easy to read, don't you think? Our optimization task now requires to find the exact center of that figure, where §theta_0 = theta_1 = 0§.

However so far we have just guessed for the best values of theta's, simply by looking at the function's plot. What we actually want is our program to find those values that minimize the cost function. In the next chapters we will see some proper algorithms to do that.

Sources

Machine learning @ Coursera - Model representation (link)
Machine learning @ Coursera - Cost function (link)
Machine learning @ Coursera - Cost function intuition 1 (link)
Machine learning @ Coursera - Cost function intuition 2 (link)
Wikipedia - Mean squared error (link)
Wikipedia - Optimization problem (link)
Wikipedia - Loss function (link)
NIST/SEMATECH e-Handbook of Statistical Methods - Contour plot (link)

machine learning • supervised learning • unsupervised learning • classification • linear regression • algorithm

Introduction to machine learning

The gradient descent function

comments

Burak on October 01, 2018 at 15:27

This cleared everything up for me. Thank you very much.

Anne Martinez on January 02, 2019 at 20:14

You make this so easy to understand! I love the way you break it down :)

From India on December 26, 2019 at 18:05

This is some great resource to read along with Andrew Ng's course.

Wesley on March 21, 2020 at 20:15

Great introduction into the intuition behind cost function. I didn't realize that you just minimize MSE!

vysakh on April 14, 2020 at 20:31

great notes,cleared all my doubts..Thank u

Socialbee on April 27, 2020 at 01:11

Glad I found this page. This cleared all the doubts of Andrew's lecture on linear regression with one variable. Thank you!

Vaibhav on April 29, 2020 at 07:09

Great job

Alekhya on June 19, 2021 at 19:44

Explained it in easier way. Thank god i found this.

share!