Join us on Facebook!
— Written by Triangles on October 29, 2017 • updated on November 10, 2019 • ID 59 —
Preparing the logistic regression algorithm for the actual implementation.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
In the previous article "Introduction to classification and logistic regression" I outlined the mathematical basics of the logistic regression algorithm, whose task is to separate things in the training example by computing the decision boundary.
The decision boundary can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best [texi]\theta[texi]s parameters in order to make the decision boundary actually separate the data points correctly. In this article we'll see how to compute those [texi]\theta[texi]s.
Suppose we have a generic training set
[tex]\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}[tex]
made of [texi]m[texi] training examples, where [texi](x^{(1)}, y^{(1)})[texi] is the 1st example and so on. More specifically, [texi]x^{(m)}[texi] is the input variable of the [texi]m[texi]-th example, while [texi]y^{(m)}[texi] is its output variable. Being this a classification problem, each example has of course the output [texi]y[texi] bound between [texi]0[texi] and [texi]1[texi]. In other words, [texi]y \in {0,1}[texi].
Each example is represented as usual by its feature vector
[tex] \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ \dots \\ x_n \end{bmatrix} [tex]
where [texi]x_0 = 1[texi] (the same old trick). This is a generic example, we don't know the exact number of features.
Finally we have the hypothesis function for logistic regression, as seen in the previous article:
[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]
Our task now is to choose the best parameters [texi]\theta[texi]s in the equation above, given the current training set, in order to minimize errors. Remember that [texi]\theta[texi] is not a single parameter: it expands to the equation of the decision boundary which can be a line or a more complex formula (with more [texi]\theta[texi]s to guess).
The procedure is similar to what we did for linear regression: define a cost function and try to find the best possible values of each [texi]\theta[texi] by minimizing the cost function output. The minimization will be performed by a gradient descent algorithm, whose task is to parse the cost function output until it finds the lowest minimum point.
You might remember the original cost function [texi]J(\theta)[texi] used in linear regression. I can tell you right now that it's not going to work here with logistic regression. If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.
This strange outcome is due to the fact that in logistic regression we have the sigmoid function around, which is non-linear (i.e. not a line). With the [texi]J(\theta)[texi] depicted in figure 1. the gradient descent algorithm might get stuck in a local minimum point. That's why we still need a neat convex function as we did for linear regression: a bowl-shaped function that eases the gradient descent function's work to converge to the optimal minimum point.
Let me go back for a minute to the cost function we used in linear regression:
[tex] J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 [tex]
which can be rewritten in a slightly different way:
[tex] J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 [tex]
Nothing scary happened: I've just moved the [texi]\frac{1}{2}[texi] next to the summation part. Now let's make it more general by defining a new function
[tex]\mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2[tex]
In words, a function [texi]\mathrm{Cost}[texi] that takes two parameters in input: [texi]h_\theta(x^{(i)})[texi] as hypothesis function and [texi]y^{(i)}[texi] as output. You can think of it as the cost the algorithm has to pay if it makes a prediction [texi]h_\theta(x^{(i)})[texi] while the actual label was [texi]y^{(i)}[texi].
With this new piece of the puzzle I can rewrite the cost function for the linear regression as follows:
[tex] J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) [tex]
However we know that the linear regression's cost function cannot be used in logistic regression problems. So what is this all about? Well, it turns out that for logistic regression we just have to find a different [texi]\mathrm{Cost}[texi] function, while the summation part stays the same.
For logistic regression, the [texi]\mathrm{Cost}[texi] function is defined as:
[tex] \mathrm{Cost}(h_\theta(x),y) = \begin{cases} -\log(h_\theta(x)) & \text{if y = 1} \\ -\log(1-h_\theta(x)) & \text{if y = 0} \end{cases} [tex]
The [texi]i[texi] indexes have been removed for clarity. In words this is the cost the algorithm pays if it predicts a value [texi]h_\theta(x)[texi] while the actual cost label turns out to be [texi]y[texi]. By using this function we will grant the convexity to the function the gradient descent algorithm has to process, as discussed above. There is also a mathematical proof for that, which is outside the scope of this introductory course.
In case [texi]y = 1[texi], the output (i.e. the cost to pay) approaches to 0 as [texi]h_\theta(x)[texi] approaches to 1. Conversely, the cost to pay grows to infinity as [texi]h_\theta(x)[texi] approaches to 0. You can clearly see it in the plot 2. below, left side. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is [texi]y = 1[texi] but the algorithm predicts [texi]h_\theta(x) = 0[texi], the outcome is completely wrong.
Conversely, the same intuition applies when [texi]y = 0[texi], depicted in the plot 2. below, right side. Bigger penalties when the label is [texi]y = 0[texi] but the algorithm predicts [texi]h_\theta(x) = 1[texi].
What we have just seen is the verbose version of the cost function for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm.
[tex] \mathrm{Cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1 - y) \log(1-h_\theta(x)) [tex]
Proof: try to replace [texi]y[texi] with 0 and 1 and you will end up with the two pieces of the original function.
With the optimization in place, the logistic regression cost function can be rewritten as:
[tex] \begin{align} J(\theta) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & = - \dfrac{1}{m} [\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))] \\ \end{align} [tex]
I've moved the minus sign outside to avoid additional parentheses.
What's left? We have the hypothesis function and the cost function: we are almost done. It's now time to find the best values for [texi]\theta[texi]s parameters in the cost function, or in other words to minimize the cost function by running the gradient descent algorithm. The procedure is identical to what we did for linear regression.
More formally, we want to minimize the cost function:
[tex] \min_{\theta} J(\theta) [tex]
Which will output a set of parameters [texi]\theta[texi], the best ones (i.e. with less error). Once done, we will be ready to make predictions on new input examples with their features [texi]x[texi], by using the new [texi]\theta[texi]s in the hypothesis function:
[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]
Where [texi]h_\theta(x)[texi] is the output, the prediction, or yet the probability that [texi]y = 1[texi].
The way we are going to minimize the cost function is by using the gradient descent. The good news is that the procedure is 99% identical to what we did for linear regression.
To minimize the cost function we have to run the gradient descent function on each parameter:
[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ \text{\}} \end{align} [tex]
Remember to simultaneously update all [texi]\theta_j[texi] as we did in the linear regression counterpart: if you have [texi] n[texi] features, that is a feature vector [texi]\vec{\theta} = [\theta_0, \theta_1, \cdots \theta_n][texi], all those parameters have to be updated simultaneously on each iteration:
[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \cdots \\ \theta_1 & := \cdots \\ \cdots \\ \theta_n & := \cdots \\ \text{\}} \end{align} [tex]
Back to the algorithm, I'll spare you the computation of the daunting derivative [texi]\frac{\partial}{\partial \theta_j} J(\theta)[texi], which becomes:
[tex] \frac{\partial}{\partial \theta_j} J(\theta) = \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} [tex]
So the loop above can be rewritten as:
[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ \text{\}} \end{align} [tex]
Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis [texi]h_\theta(x)[texi]: for linear regression we had [texi]h_\theta(x) = \theta^{\top}{x}[texi], whereas for logistic regression we have [texi]h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}[texi].
From now on you can apply the same techniques to optimize the gradient descent algorithm we have seen for linear regression, to make sure the conversion to the minimum point works correctly. In the next chapter I will delve into some advanced optimization tricks, as well as defining and avoiding the problem of overfitting.
Machine Learning Course @ Coursera - Cost function (video)
Machine Learning Course @ Coursera - Simplified Cost Function and Gradient Descent (video)