Like it!

Join us on Facebook!

Like it!

The cost function in logistic regression

Preparing the logistic regression algorithm for the actual implementation.

Other articles from this series

In the previous article "Introduction to classification and logistic regression" I outlined the mathematical basics of the logistic regression algorithm, whose task is to separate things in the training example by computing the decision boundary.

The decision boundary can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best [texi]\theta[texi]s parameters in order to make the decision boundary actually separate the data points correctly. In this article we'll see how to compute those [texi]\theta[texi]s.

Suppose we have a generic training set

[tex]\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}[tex]

made of [texi]m[texi] training examples, where [texi](x^{(1)}, y^{(1)})[texi] is the 1st example and so on. More specifically, [texi]x^{(m)}[texi] is the input variable of the [texi]m[texi]-th example, while [texi]y^{(m)}[texi] is its output variable. Being this a classification problem, each example has of course the output [texi]y[texi] bound between [texi]0[texi] and [texi]1[texi]. In other words, [texi]y \in {0,1}[texi].

Each example is represented as usual by its feature vector

[tex] \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ \dots \\ x_n \end{bmatrix} [tex]

where [texi]x_0 = 1[texi] (the same old trick). This is a generic example, we don't know the exact number of features.

Finally we have the hypothesis function for logistic regression, as seen in the previous article:

[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]

Our task now is to choose the best parameters [texi]\theta[texi]s in the equation above, given the current training set, in order to minimize errors. Remember that [texi]\theta[texi] is not a single parameter: it expands to the equation of the decision boundary which can be a line or a more complex formula (with more [texi]\theta[texi]s to guess).

The procedure is similar to what we did for linear regression: define a cost function and try to find the best possible values of each [texi]\theta[texi] by minimizing the cost function output. The minimization will be performed by a gradient descent algorithm, whose task is to parse the cost function output until it finds the lowest minimum point.

The cost function used in linear regression won't work here

You might remember the original cost function [texi]J(\theta)[texi] used in linear regression. I can tell you right now that it's not going to work here with logistic regression. If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.

Non-convex function
1. An example of a non-convex function. The grey point on the right side shows a potential local minimum.

This strange outcome is due to the fact that in logistic regression we have the sigmoid function around, which is non-linear (i.e. not a line). With the [texi]J(\theta)[texi] depicted in figure 1. the gradient descent algorithm might get stuck in a local minimum point. That's why we still need a neat convex function as we did for linear regression: a bowl-shaped function that eases the gradient descent function's work to converge to the optimal minimum point.

A better cost function for logistic regression

Let me go back for a minute to the cost function we used in linear regression:

[tex] J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 [tex]

which can be rewritten in a slightly different way:

[tex] J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 [tex]

Nothing scary happened: I've just moved the [texi]\frac{1}{2}[texi] next to the summation part. Now let's make it more general by defining a new function

[tex]\mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2[tex]

In words, a function [texi]\mathrm{Cost}[texi] that takes two parameters in input: [texi]h_\theta(x^{(i)})[texi] as hypothesis function and [texi]y^{(i)}[texi] as output. You can think of it as the cost the algorithm has to pay if it makes a prediction [texi]h_\theta(x^{(i)})[texi] while the actual label was [texi]y^{(i)}[texi].

With this new piece of the puzzle I can rewrite the cost function for the linear regression as follows:

[tex] J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) [tex]

However we know that the linear regression's cost function cannot be used in logistic regression problems. So what is this all about? Well, it turns out that for logistic regression we just have to find a different [texi]\mathrm{Cost}[texi] function, while the summation part stays the same.

Logistic regression cost function

For logistic regression, the [texi]\mathrm{Cost}[texi] function is defined as:

[tex] \mathrm{Cost}(h_\theta(x),y) = \begin{cases} -\log(h_\theta(x)) & \text{if y = 1} \\ -\log(1-h_\theta(x)) & \text{if y = 0} \end{cases} [tex]

The [texi]i[texi] indexes have been removed for clarity. In words this is the cost the algorithm pays if it predicts a value [texi]h_\theta(x)[texi] while the actual cost label turns out to be [texi]y[texi]. By using this function we will grant the convexity to the function the gradient descent algorithm has to process, as discussed above. There is also a mathematical proof for that, which is outside the scope of this introductory course.

In case [texi]y = 1[texi], the output (i.e. the cost to pay) approaches to 0 as [texi]h_\theta(x)[texi] approaches to 1. Conversely, the cost to pay grows to infinity as [texi]h_\theta(x)[texi] approaches to 0. You can clearly see it in the plot 2. below, left side. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is [texi]y = 1[texi] but the algorithm predicts [texi]h_\theta(x) = 0[texi], the outcome is completely wrong.

Conversely, the same intuition applies when [texi]y = 0[texi], depicted in the plot 2. below, right side. Bigger penalties when the label is [texi]y = 0[texi] but the algorithm predicts [texi]h_\theta(x) = 1[texi].

Cost function for logistic regression
2. How the cost function for logistic regression looks like.

Additional cost function optimizations

What we have just seen is the verbose version of the cost function for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm.

[tex] \mathrm{Cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1 - y) \log(1-h_\theta(x)) [tex]

Proof: try to replace [texi]y[texi] with 0 and 1 and you will end up with the two pieces of the original function.

With the optimization in place, the logistic regression cost function can be rewritten as:

[tex] \begin{align} J(\theta) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & = - \dfrac{1}{m} [\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))] \\ \end{align} [tex]

I've moved the minus sign outside to avoid additional parentheses.

Plugging the cost function and the gradient descent together

What's left? We have the hypothesis function and the cost function: we are almost done. It's now time to find the best values for [texi]\theta[texi]s parameters in the cost function, or in other words to minimize the cost function by running the gradient descent algorithm. The procedure is identical to what we did for linear regression.

More formally, we want to minimize the cost function:

[tex] \min_{\theta} J(\theta) [tex]

Which will output a set of parameters [texi]\theta[texi], the best ones (i.e. with less error). Once done, we will be ready to make predictions on new input examples with their features [texi]x[texi], by using the new [texi]\theta[texi]s in the hypothesis function:

[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]

Where [texi]h_\theta(x)[texi] is the output, the prediction, or yet the probability that [texi]y = 1[texi].

The way we are going to minimize the cost function is by using the gradient descent. The good news is that the procedure is 99% identical to what we did for linear regression.

To minimize the cost function we have to run the gradient descent function on each parameter:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ \text{\}} \end{align} [tex]

Remember to simultaneously update all [texi]\theta_j[texi] as we did in the linear regression counterpart: if you have [texi] n[texi] features, that is a feature vector [texi]\vec{\theta} = [\theta_0, \theta_1, \cdots \theta_n][texi], all those parameters have to be updated simultaneously on each iteration:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \cdots \\ \theta_1 & := \cdots \\ \cdots \\ \theta_n & := \cdots \\ \text{\}} \end{align} [tex]

Back to the algorithm, I'll spare you the computation of the daunting derivative [texi]\frac{\partial}{\partial \theta_j} J(\theta)[texi], which becomes:

[tex] \frac{\partial}{\partial \theta_j} J(\theta) = \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} [tex]

So the loop above can be rewritten as:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ \text{\}} \end{align} [tex]

Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis [texi]h_\theta(x)[texi]: for linear regression we had [texi]h_\theta(x) = \theta^{\top}{x}[texi], whereas for logistic regression we have [texi]h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}[texi].

From now on you can apply the same techniques to optimize the gradient descent algorithm we have seen for linear regression, to make sure the conversion to the minimum point works correctly. In the next chapter I will delve into some advanced optimization tricks, as well as defining and avoiding the problem of overfitting.

Sources

Machine Learning Course @ Coursera - Cost function (video)
Machine Learning Course @ Coursera - Simplified Cost Function and Gradient Descent (video)

previous article
Introduction to classification and logistic regression
next article
The problem of overfitting in machine learning algorithms
comments
Moose on February 08, 2019 at 05:21
Could you please write the hypothesis function with the different theta's described like you did with multivariable linear regression:
hθ(x)=θ0+θ1x1+θ2x2+...+θnxn

What does it look like for logistical regression?
Denson George on March 05, 2019 at 13:03
"There is also a mathematical proof for that, which is outside the scope of this introductory course."
Triangles,do you have a post/another source where I can read up on this? Can you link it to the post for curious people like me.

Many Thanks,
Denson
Triangles on March 08, 2019 at 10:35
@George my last-minute search led me to this: https://math.stackexchange.com/questions/1582452/logistic-regression-prove-that-the-cost-function-is-convex
Not exactly a formal mathematical paper, It could be a start though.
Salem on June 26, 2019 at 12:46
I have suggested a new algorithm to find the global optimum solution for nonlinear functions
yalamuri sreenivasulu reddy on July 03, 2019 at 14:44
hypothesis function for logistic regression is wrong it suppose to be h(theta) = 1/(1+e^(-theta'*x))
Praveen on August 10, 2019 at 16:02
Very clear explanation, Thank you
Carl on August 11, 2019 at 17:34
Do you know of a similar tutorial that is considering multiple classes than this binary case? I.e. using softmax expressions.
Triangles on August 14, 2019 at 12:23
@Carl Not yet, sorry!
hasan on September 03, 2019 at 14:42
You are missing a minus sign in the exponent in the hypothesis function of the logistic regression. The correct form should be:
$h_\theta(x) = \frac{1}{1 + e^{-\theta^{\top} x}}$
mahesh on April 16, 2020 at 16:01
Nice explanation. how does thetas learned using maximum likehood estimation
Zeeshan on May 04, 2020 at 21:34
In the last formula for cost function, the Summation sign should be outside the square bracket
Bharpur Singh on May 07, 2020 at 09:06
The logistic or Sigmoid function is written wrongly it should be negative of theta transpose x.
Roman on August 17, 2020 at 07:27
How do we jump from linear J to logistic J = -ylog(g(x)) - ylog(1-g(x)) ?
dinesh on March 04, 2021 at 16:27
why the cost function changes for logistic why not the gradient descent function ?