Like it!

Join me on Facebook!

Like it!

The cost function in logistic regression

Preparing the logistic regression algorithm for the actual implementation.

Other articles from this series

In the previous article "Introduction to classification and logistic regression" I outlined the mathematical basics of the logistic regression algorithm, whose task is to separate things in the training example by computing the decision boundary.

The decision boundary can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best [texi]\theta[texi]s parameters in order to make the decision boundary actually separate the data points correctly. In this article we'll see how to compute those [texi]\theta[texi]s.

Suppose we have a generic training set

[tex]\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}[tex]

made of [texi]m[texi] training examples, where [texi](x^{(1)}, y^{(1)})[texi] is the 1st example and so on. More specifically, [texi]x^{(m)}[texi] is the input variable of the [texi]m[texi]-th example, while [texi]y^{(m)}[texi] is its output variable. Being this a classification problem, each example has of course the output [texi]y[texi] bound between [texi]0[texi] and [texi]1[texi]. In other words, [texi]y \in {0,1}[texi].

Each example is represented as usual by its feature vector

[tex] \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ \dots \\ x_n \end{bmatrix} [tex]

where [texi]x_0 = 1[texi] (the same old trick). This is a generic example, we don't know the exact number of features.

Finally we have the hypothesis function for logistic regression, as seen in the previous article:

[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]

Our task now is to choose the best parameters [texi]\theta[texi]s in the equation above, given the current training set, in order to minimize errors. Remember that [texi]\theta[texi] is not a single parameter: it expands to the equation of the decision boundary which can be a line or a more complex formula (with more [texi]\theta[texi]s to guess).

The procedure is similar to what we did for linear regression: define a cost function and try to find the best possible values of each [texi]\theta[texi] by minimizing the cost function output. The minimization will be performed by a gradient descent algorithm, whose task is to parse the cost function output until it finds the lowest minimum point.

The cost function used in linear regression won't work here

You might remember the original cost function [texi]J(\theta)[texi] used in linear regression. I can tell you right now that it's not going to work here with logistic regression. If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.

Non-convex function
1. An example of a non-convex function. The grey point on the right side shows a potential local minimum.

This strange outcome is due to the fact that in logistic regression we have the sigmoid function around, which is non-linear (i.e. not a line). With the [texi]J(\theta)[texi] depicted in figure 1. the gradient descent algorithm might get stuck in a local minimum point. That's why we still need a neat convex function as we did for linear regression: a bowl-shaped function that eases the gradient descent function's work to converge to the optimal minimum point.

A better cost function for logistic regression

Let me go back for a minute to the cost function we used in linear regression:

[tex] J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 [tex]

which can be rewritten in a slightly different way:

[tex] J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 [tex]

Nothing scary happened: I've just moved the [texi]\frac{1}{2}[texi] next to the summation part. Now let's make it more general by defining a new function

[tex]\mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2[tex]

In words, a function [texi]\mathrm{Cost}[texi] that takes two parameters in input: [texi]h_\theta(x^{(i)})[texi] as hypothesis function and [texi]y^{(i)}[texi] as output. You can think of it as the cost the algorithm has to pay if it makes a prediction [texi]h_\theta(x^{(i)})[texi] while the actual label was [texi]y^{(i)}[texi].

With this new piece of the puzzle I can rewrite the cost function for the linear regression as follows:

[tex] J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) [tex]

However we know that the linear regression's cost function cannot be used in logistic regression problems. So what is this all about? Well, it turns out that for logistic regression we just have to find a different [texi]\mathrm{Cost}[texi] function, while the summation part stays the same.

Logistic regression cost function

For logistic regression, the [texi]\mathrm{Cost}[texi] function is defined as:

[tex] \mathrm{Cost}(h_\theta(x),y) = \begin{cases} -\log(h_\theta(x)) & \text{if y = 1} \\ -\log(1-h_\theta(x)) & \text{if y = 0} \end{cases} [tex]

The [texi]i[texi] indexes have been removed for clarity. In words this is the cost the algorithm pays if it predicts a value [texi]h_\theta(x)[texi] while the actual cost label turns out to be [texi]y[texi]. By using this function we will grant the convexity to the function the gradient descent algorithm has to process, as discussed above. There is also a mathematical proof for that, which is outside the scope of this introductory course.

In case [texi]y = 1[texi], the output (i.e. the cost to pay) approaches to 0 as [texi]h_\theta(x)[texi] approaches to 1. Conversely, the cost to pay grows to infinity as [texi]h_\theta(x)[texi] approaches to 0. You can clearly see it in the plot 2. below, left side. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is [texi]y = 1[texi] but the algorithm predicts [texi]h_\theta(x) = 0[texi], the outcome is completely wrong.

Conversely, the same intuition applies when [texi]y = 0[texi], depicted in the plot 2. below, right side. Bigger penalties when the label is [texi]y = 0[texi] but the algorithm predicts [texi]h_\theta(x) = 1[texi].

Cost function for logistic regression
2. How the cost function for logistic regression looks like.

Additional cost function optimizations

What we have just seen is the verbose version of the cost function for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm.

[tex] \mathrm{Cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1 - y) \log(1-h_\theta(x)) [tex]

Proof: try to replace [texi]y[texi] with 0 and 1 and you will end up with the two pieces of the original function.

With the optimization in place, the logistic regression cost function can be rewritten as:

[tex] \begin{align} J(\theta) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & = - \dfrac{1}{m} [\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))] \\ \end{align} [tex]

I've moved the minus sign outside to avoid additional parentheses.

Plugging the cost function and the gradient descent together

What's left? We have the hypothesis function and the cost function: we are almost done. It's now time to find the best values for [texi]\theta[texi]s parameters in the cost function, or in other words to minimize the cost function by running the gradient descent algorithm. The procedure is identical to what we did for linear regression.

More formally, we want to minimize the cost function:

[tex] \min_{\theta} J(\theta) [tex]

Which will output a set of parameters [texi]\theta[texi], the best ones (i.e. with less error). Once done, we will be ready to make predictions on new input examples with their features [texi]x[texi], by using the new [texi]\theta[texi]s in the hypothesis function:

[tex] h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}} [tex]

Where [texi]h_\theta(x)[texi] is the output, the prediction, or yet the probability that [texi]y = 1[texi].

The way we are going to minimize the cost function is by using the gradient descent. The good news is that the procedure is 99% identical to what we did for linear regression.

To minimize the cost function we have to run the gradient descent function on each parameter:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ \text{\}} \end{align} [tex]

Remember to simultaneously update all [texi]\theta_j[texi] as we did in the linear regression counterpart: if you have [texi] n[texi] features, that is a feature vector [texi]\vec{\theta} = [\theta_0, \theta_1, \cdots \theta_n][texi], all those parameters have to be updated simultaneously on each iteration:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \cdots \\ \theta_1 & := \cdots \\ \cdots \\ \theta_n & := \cdots \\ \text{\}} \end{align} [tex]

Back to the algorithm, I'll spare you the computation of the daunting derivative [texi]\frac{\partial}{\partial \theta_j} J(\theta)[texi], which becomes:

[tex] \frac{\partial}{\partial \theta_j} J(\theta) = \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} [tex]

So the loop above can be rewritten as:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ \text{\}} \end{align} [tex]

Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis [texi]h_\theta(x)[texi]: for linear regression we had [texi]h_\theta(x) = \theta^{\top}{x}[texi], whereas for logistic regression we have [texi]h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}[texi].

From now on you can apply the same techniques to optimize the gradient descent algorithm we have seen for linear regression, to make sure the conversion to the minimum point works correctly. In the next chapter I will delve into some advanced optimization tricks, as well as defining and avoiding the problem of overfitting.

Sources

Machine Learning Course @ Coursera - Cost function (video)
Machine Learning Course @ Coursera - Simplified Cost Function and Gradient Descent (video)

previous article
Introduction to classification and logistic regression
next article
The problem of overfitting in machine learning algorithms
comments