Join us on Facebook!
— Written by Triangles on April 01, 2016 • updated on November 10, 2019 • ID 33 —
Finding the best-fitting straight line through points of a data set.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
Linear regression is one of the most famous way to describe your data and make predictions on it. The picture 1. below, borrowed from the first chapter of this stunning machine learning series, shows the housing prices from a fantasy country somewhere in the world. You are collecting real-estate information because you want to predict the house prices given, say, the size in square feet.
Given your input data, how can you predict any house price outside your initial data set? For example, how much a 1100 square feet house is worth? Linear regression will help answering that question: you shrink your data into a line (the dotted one in the picture above), with a corresponding mathematical equation. If you know the equation of that line, you can find any output (y) given any input (x).
When you gathered your initial data, you actually created the so-called training set, which is the set of housing prices. The algorithm's job is to learn from those data to predict prices of new houses. You are using input data to train the program, that's where the name comes from.
The training set can be summarized in a table, like the one you see below:
size (feet2) §(x)§ | price ($) §(y)§ |
---|---|
815 | 165,000 |
1510 | 310,000 |
2100 | 410,000 |
... | ... |
The number of training examples, or the number of lines in the table above, are noted as §m§; the input variable §x§ is the single house size on the left column and §y§ is the output variable, namely the price, on the right column.
The list §(x, y)§ denotes a single, generic training example, while §(x^{(i)}, y^{(i)})§ represents a specific training example. So if I write §(x^{(2)}, y^{(2)})§ I'm referring to the second row in the table above, where §x^{(2)} = 1510§ and §y^{(2)} = 310,000§.
The training set of housing prices is fed into the learning algorithm. Its main job is to produce a function, which by convention is called §h§ (for hypothesis). You then use that hypothesis function to output the estimate house price §y§, by giving it the size of a house in input §x§.
The hypothesis function must have a formula, like any other function in the world. That is:
§h_theta(x) = theta_0 + theta_1 (x)§
Theta's (§theta_i§ in general) are the parameters of the function. Usually the theta subscript gets dropped and the hypothesis function is simply written as §h(x)§.
That formula might look scary, but if you think about it it's nothing fancier than the traditional equation of a line, except that we use §theta_0§ and §theta_1§ instead of §m§ and §q§. Do you remember?
§ f(x) = q + mx §
In fact the hypothesis function is just the equation of the dotted line you can see in the picture 1.
In our humble hypothesis function there is only one variable, that is §x§. For this reason our task is often called linear regression with one variable. Experts call it also univariate linear regression, where univariate means "one variable".
Well, at this point we know that there's a hypothesis function to be found. More precisely we have to find the parameters §theta_0§ and §theta_1§ so that the hypothesis function best fits the training data. If you recall how the equation of a line works, those two parameters control the slope and the height of the line. By tweaking §theta_0§ and §theta_1§ we want to find a line that represents at best our data. Picture 3. below shows what I mean:
We definitely want something like the first example in the picture above. So how to find proper values for §theta_0§ and §theta_1§? You certainly recall that in our training set we have several examples where we know the size of the house §x§ and the actual price of the house §y§. We know those prices and sizes because we previously took a survey for those data. So the idea in a nutshell: let's try to choose the hypothesis function parameters so that at least in the existing training set, given the §x§ as input parameter to the hypothesis function we make reasonably accurate predictions for the §y§ values. Once we are satisfied, we can use the hypothesis function with its pretty parameters to make predictions on new input data.
From a mathematical point of view I want that, for each §i§-th point in my data set, the difference §h_theta(x^{(i)}) - y^{(i)}§ is very small. Here §h_theta(x^{(i)})§ is the prediction of the hypothesis when it is input the size of house number §i§, while §y^{(i)}§ is the actual price of the house number §i§. If that difference is small, it means that the hypothesis has made an accurate prediction, because it's similar to the actual data.
The operation I described so far is a part of the so-called mean squared error function (MSE), a function that does exactly what we want: it measures how close a fitted line is to some data points. The smaller the MSE, the closer the fit is to the data. Actually there are many other functions that work well for such task, but the MSE is the most commonly used one for regression problems.
If I plug our data into the MSE function, our final formula looks like that:
§text{MSE} = 1/{2m} sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2§
Note the §1/{2m}§ and the summation part: we are properly computing a mean. That 2 at the denominator will ease some calculations in future steps. Also, the squaring is done so negative values do not cancel positive values.
Let me now expand the above equation. Since
§h_theta(x^{(i)}) = theta_0 + theta_1x^{(i)}§
then
§text{MSE} = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§
By convention we would define a cost function (aka loss function) §J§ that is just the above equation written more compact:
§J(theta_0, theta_1) = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§
Now, we want to find good values of §theta_0§ and §theta_1§, so good that the above cost function can produce the best possible values, namely the smallest ones (because small values mean less errors). This is an optimization problem: the problem of finding the best solution from all feasible solutions. It can be written as
§stackrel"minimize"{theta_0, theta_1}\ J(theta_0, theta_1)§
Let's now feed our theoretical function with some real data. To better understand how the cost function works I will temporaly set §theta_0 = 0§ so that our hypothesis function looks like
§h_theta(x) = theta_1x§
and the minimization task like
§stackrel"minimize"{theta_1}\ J(theta_1)§
This will help a lot with cost function visualization: keeping §theta_0 != 0§ would require a three-dimentional plot that initially would be a source of annoyance. Just remember: with §theta_0 = 0§ the hypothesis function becomes a line passing through the origin §(0, 0)§, while §theta_1§ controls the slope.
Picture 4. shows the relationship between the hypothesis function and the cost function. Let's suppose that our data is made of three points as you may see in the leftmost plot.
Changing the values of §theta_1§, namely changing the slope of the hypothesis function produces points in the cost function.
For example, with §theta_1 = 1§:
§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (h_theta(x^{i}) - y^{(i)})^2§
since §theta_0 = 0§ I can write the hypothesis function like that:
§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (theta_1(x^{i}) - y^{(i)})^2§
Now let's plug some numbers in:
§J(theta_1 = 1) = 1/6 [(1*1 - 1)^2 + (1*2 - 2)^2 + (1*3 - 3)^2]§
§J(theta_1 = 1) = 1/6 [(0)^2 + (0)^2 + (0)^2]§
§J(theta_1 = 1) = 0§
In words: for §theta_1 = 1§, the cost function has produced a value of 0. Let's try with the other two values:
§J(theta_1=0.5) = 1/6 [(0.5*1 - 1)^2 + (0.5*2 - 2)^2 + (0.5*3 - 3)^2] ~= 0.6§ §J(theta_1=0) = 1/6 [(0*1 - 1)^2 + (0*2 - 2)^2 + (0*3 - 3)^2] ~= 2.3§
In picture 4. you may find the values 0, 0.6 and 2.3 plotted as full dots on the cost function. The empty ones are values from other theta's, not shown in the hypothesis function but computed separately: they reveal that the cost function is actually a parabola with its minimum at 0.
You can read the whole picture in reverse: every point of the cost function corresponds to a specific slope of the hypothesis function. We decided to take the best value, namely the minimum of the cost function. Looking at the curve of the cost function, the value that minimizes §J(theta_1)§ is §theta_1 = 1§: that value means the best slope of the hypothesis function for our particular training set.
I previously simplified the problem by setting §theta_0§ to zero, a decision that led to a 2-dimensional cost function: great for learning purposes but not so realistic. In the real world 3-dimensional (and even more!) cost functions are quite common. Fortunately there are some neat ways to visualize them without losing too much information and mental sanity.
The first approach is to just draw the actual function in three dimensions, as shown in picture 5. Just note that, not knowing the exact formula yet, axes values are more or less random.
It looks like a cup and the optimization problem consists in finding the lowest point on the bottom edge.
Sometimes, when the picture becomes too messy, it's common to switch to another representation: the contour plot. A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant horizontal slices, called contours, on a 2-dimensional format. Figure 6. shows what I'm talking about.
It's now much more easy to read, don't you think? Our optimization task now requires to find the exact center of that figure, where §theta_0 = theta_1 = 0§.
However so far we have just guessed for the best values of theta's, simply by looking at the function's plot. What we actually want is our program to find those values that minimize the cost function. In the next chapters we will see some proper algorithms to do that.
Machine learning @ Coursera - Model representation (link)
Machine learning @ Coursera - Cost function (link)
Machine learning @ Coursera - Cost function intuition 1 (link)
Machine learning @ Coursera - Cost function intuition 2 (link)
Wikipedia - Mean squared error (link)
Wikipedia - Optimization problem (link)
Wikipedia - Loss function (link)
NIST/SEMATECH e-Handbook of Statistical Methods - Contour plot (link)