Join me on Facebook!

h1The cost function in logistic regression

Preparing the logistic regression algorithm for the actual implementation.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

In the previous article "Introduction to classification and logistic regression" I outlined the mathematical basics of the logistic regression algorithm, whose task is to separate things in the training example by computing the decision boundary.

The decision boundary can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best math$\theta$s parameters in order to make the decision boundary actually separate the data points correctly. In this article we'll see how to compute those math$\theta$s.

Suppose we have a generic training set

math$$\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}$$

made of math$m$ training examples, where math$(x^{(1)}, y^{(1)})$ is the 1st example and so on. More specifically, math$x^{(m)}$ is the input variable of the math$m$-th example, while math$y^{(m)}$ is its output variable. Being this a classification problem, each example has of course the output math$y$ bound between math$0$ and math$1$. In other words, math$y \in {0,1}$.

Each example is represented as usual by its feature vector

math$$\vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ \dots \\ x_n \end{bmatrix}$$

where math$x_0 = 1$ (the same old trick). This is a generic example, we don't know the exact number of features.

Finally we have the hypothesis function for logistic regression, as seen in the previous article:

math$$h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}$$

Our task now is to choose the best parameters math$\theta$s in the equation above, given the current training set, in order to minimize errors. Remember that math$\theta$ is not a single parameter: it expands to the equation of the decision boundary which can be a line or a more complex formula (with more math$\theta$s to guess).

The procedure is similar to what we did for linear regression: define a cost function and try to find the best possible values of each math$\theta$ by minimizing the cost function output. The minimization will be performed by a gradient descent algorithm, whose task is to parse the cost function output until it finds the lowest minimum point.

h3The cost function used in linear regression won't work here

You might remember the original cost function math$J(\theta)$ used in linear regression. I can tell you right now that it's not going to work here with logistic regression. If you try to use the linear regression's cost function to generate math$J(\theta)$ in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.

1. An example of a non-convex function. The grey point on the right side shows a potential local minimum.

This strange outcome is due to the fact that in logistic regression we have the sigmoid function around, which is non-linear (i.e. not a line). With the math$J(\theta)$ depicted in figure 1. the gradient descent algorithm might get stuck in a local minimum point. That's why we still need a neat convex function as we did for linear regression: a bowl-shaped function that eases the gradient descent function's work to converge to the optimal minimum point.

h2A better cost function for logistic regression

Let me go back for a minute to the cost function we used in linear regression:

math$$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

which can be rewritten in a slightly different way:

math$$J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$$

Nothing scary happened: I've just moved the math$\frac{1}{2}$ next to the summation part. Now let's make it more general by defining a new function

math$$\mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$$

In words, a function math$\mathrm{Cost}$ that takes two parameters in input: math$h_\theta(x^{(i)})$ as hypothesis function and math$y^{(i)}$ as output. You can think of it as the cost the algorithm has to pay if it makes a prediction math$h_\theta(x^{(i)})$ while the actual label was math$y^{(i)}$.

With this new piece of the puzzle I can rewrite the cost function for the linear regression as follows:

math$$J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$$

However we know that the linear regression's cost function cannot be used in logistic regression problems. So what is this all about? Well, it turns out that for logistic regression we just have to find a different math$\mathrm{Cost}$ function, while the summation part stays the same.

h3Logistic regression cost function

For logistic regression, the math$\mathrm{Cost}$ function is defined as:

math$$\mathrm{Cost}(h_\theta(x),y) = \begin{cases} -\log(h_\theta(x)) & \text{if y = 1} \\ -\log(1-h_\theta(x)) & \text{if y = 0} \end{cases}$$

The math$i$ indexes have been removed for clarity. In words this is the cost the algorithm pays if it predicts a value math$h_\theta(x)$ while the actual cost label turns out to be math$y$. By using this function we will grant the convexity to the function the gradient descent algorithm has to process, as discussed above. There is also a mathematical proof for that, which is outside the scope of this introductory course.

In case math$y = 1$, the output (i.e. the cost to pay) approaches to 0 as math$h_\theta(x)$ approaches to 1. Conversely, the cost to pay grows to infinity as math$h_\theta(x)$ approaches to 0. You can clearly see it in the plot 2. below, left side. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is math$y = 1$ but the algorithm predicts math$h_\theta(x) = 0$, the outcome is completely wrong.

Conversely, the same intuition applies when math$y = 0$, depicted in the plot 2. below, right side. Bigger penalties when the label is math$y = 0$ but the algorithm predicts math$h_\theta(x) = 1$.

2. How the cost function for logistic regression looks like.

h3Additional cost function optimizations

What we have just seen is the verbose version of the cost function for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm.

math$$\mathrm{Cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1 - y) \log(1-h_\theta(x))$$

Proof: try to replace math$y$ with 0 and 1 and you will end up with the two pieces of the original function.

With the optimization in place, the logistic regression cost function can be rewritten as:

math\begin{align} J(\theta) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ & = - \dfrac{1}{m} [\sum_{i=1}^{m} y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1-h_\theta(x^{(i)}))] \\ \end{align}

I've moved the minus sign outside to avoid additional parentheses.

h2Plugging the cost function and the gradient descent together

What's left? We have the hypothesis function and the cost function: we are almost done. It's now time to find the best values for math$\theta$s parameters in the cost function, or in other words to minimize the cost function by running the gradient descent algorithm. The procedure is identical to what we did for linear regression.

More formally, we want to minimize the cost function:

math$$\min_{\theta} J(\theta)$$

Which will output a set of parameters math$\theta$, the best ones (i.e. with less error). Once done, we will be ready to make predictions on new input examples with their features math$x$, by using the new math$\theta$s in the hypothesis function:

math$$h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}$$

Where math$h_\theta(x)$ is the output, the prediction, or yet the probability that math$y = 1$.

The way we are going to minimize the cost function is by using the gradient descent. The good news is that the procedure is 99% identical to what we did for linear regression.

To minimize the cost function we have to run the gradient descent function on each parameter:

math\begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ \text{\}} \end{align}

Remember to simultaneously update all math$\theta_j$ as we did in the linear regression counterpart: if you have math$n$ features, that is a feature vector math$\vec{\theta} = [\theta_0, \theta_1, \cdots \theta_n]$, all those parameters have to be updated simultaneously on each iteration:

math\begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \cdots \\ \theta_1 & := \cdots \\ \cdots \\ \theta_n & := \cdots \\ \text{\}} \end{align}

Back to the algorithm, I'll spare you the computation of the daunting derivative math$\frac{\partial}{\partial \theta_j} J(\theta)$, which becomes:

math$$\frac{\partial}{\partial \theta_j} J(\theta) = \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

So the loop above can be rewritten as:

math\begin{align} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ \text{\}} \end{align}

Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis math$h_\theta(x)$: for linear regression we had math$h_\theta(x) = \theta^{\top}{x}$, whereas for logistic regression we have math$h_\theta(x) = \frac{1}{1 + e^{\theta^{\top} x}}$.

From now on you can apply the same techniques to optimize the gradient descent algorithm we have seen for linear regression, to make sure the conversion to the minimum point works correctly. In the next chapter I will delve into some advanced optimization tricks, as well as defining and avoiding the problem of overfitting.

h2Sources

Machine Learning Course @ Coursera - Cost function (video)
Machine Learning Course @ Coursera - Simplified Cost Function and Gradient Descent (video)

previous article