Join me on Facebook!
— Written by Triangles on July 17, 2017 • updated on January 24, 2019 • ID 55 —
Get your feet wet with another fundamental machine learning algorithm for binary classification.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
Classification is another big family of machine learning algorithms. Unlike regression, where the output can take on continue values, in classification your algorithms produce discrete outcomes: one/zero, yes/no, do/don't and so on.
Spam versus non-spam emails is a traditional example of classification task: you want to predict if the email fed to your program is spammy or not, where usually [texi]0[texi] means not spam and [texi]1[texi] means spam.
Formally, we want to predict a variable [texi]y \in \{0,1\}[texi], where [texi]0[texi] is called negative class, while [texi]1[texi] is called positive class. Such task is known as binary classification.
Other classification problems might require more than a binary output, for example where [texi]y \in \{0,1,2,3\}[texi]. Such task is known as a multiclass classification.
Let's start from how not to do things. In classification problems, linear regression performs very poorly and when it works it's usually a stroke of luck. The main reason is that in classification, unlike in regression, you don't have to choose the best line through a set of points, but rather you want to somehow separate those points.
Say for example that you are playing with image recognition: given a bunch of photos of bananas, you want to tell whether they are ripe or not, given the color. You collect the data and plot it as in the left picture in figure 1. below. Every white dot is an element in the training set; after the linear regression algorithm has been run on it, you end up with the well-known hypothesis function [texi]h_\theta(x)[texi] depicted by the dashed line.
You could now say that when [texi]h_\theta(x) \geq 0.5[texi] the outcome is [texi]1[texi] (ripe), and [texi]0[texi] otherwise, as marked by the empty circle on the hypothesis line. Everything appears to be in order: above a certain threshold of yellowness, the bananas are ripe, below they are not.
However, as more training examples are added (picture 1., right side), they might alter the original slope of the hypothesis line, thus distorting the threshold. That's the main reason why applying linear regression to classification problems often it's not a good idea.
Logistic regression is a more performant algorithm used in classification problems. The most important feature is its ability to produce a sort of hard-limited hypothesis function: [texi]0 \leq h_\theta(x) \leq 1[texi].
The hypothesis function is slightly different from the one used in linear regression. For logistic regression,
[tex]h_\theta(x) = g(\theta^{\top} x)[tex]
which is the traditional hypothesis function processed by a new function [texi]g[texi], defined as:
[tex]g(z) = \frac{1}{1 + e^{-z}}[tex]
It is called sigmoid function or logistic function and looks like the picture 2.:
In a sigmoid function, as the input [texi]z[texi] goes to [texi]- \infty[texi], the output [texi]g(z)[texi] approaches to [texi]0[texi]; as [texi]z[texi] goes to [texi]+ \infty[texi], the output [texi]g(z)[texi] approaches to [texi]1[texi]. This works like an output limiter which makes the hypothesis function bound between [texi]0[texi] and [texi]1[texi].
For clarity let me unroll the hypothesis function plugged into the sigmoid:
[tex]h_\theta(x) = g(\theta^{\top} x)[tex]
becomes
[tex]h_\theta(x) = \frac{1}{1 + e^{-\theta^{\top} x}}[tex]
But what's the meaning of it?
The logistic regression's hypothesis function outputs a number between [texi]0[texi] and [texi]1[texi]. You can think of it as the estimated probability that [texi]y = 1[texi] on a new example [texi]x[texi] in input.
Let's go back to the bananas classification task. It is a single-feature problem (i.e. the "yellowness"), so my feature vector [texi]x[texi] is defined as
[tex] \vec{x} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} [tex]
Where the first feature [texi]x_0 = 1[texi] is just a trick I've explained in one of the previous articles and [texi]x_1[texi] is the "yellowness" of each banana.
It's now time to define whether a new picture of a banana given in input to my algorithm is ripe or not. I grab its feature vector and plug it into the hypothesis function, which magically produces some outcome, say:
[tex]h_\theta(x) = 0.9[tex]
The hypothesis is telling me that for the new banana in input the probability that [texi]y = 1[texi] is [texi]0.9[texi]. In other words, the new banana in the picture has 90% chance of being ripe.
Let me write it more formally and generally:
[tex]h_\theta(x) = P(y = 1 | x; \theta)[tex]
In words, the hypothesis function tells you the probability that [texi]y = 1[texi] given [texi]x[texi] (i.e. given the new banana in input with a yellowness represented by the feature [texi]x[texi]), parametrized by [texi]\theta[texi] (i.e. whose parameters are [texi]\theta[texi]).
Since the outcome [texi]y[texi] is restricted between two values [texi]0[texi] and [texi]1[texi], I can compute the probability that [texi]y = 0[texi] as well. Let's figure out the following statement:
[tex]P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1[tex]
The sum of each probability, namely of [texi]y[texi] being [texi]0[texi] and [texi]y[texi] being [texi]1[texi] is of course [texi]1[texi] (or 100%). So by moving one term on the other side of the equation, we will end up with:
[tex]P(y = 0 | x; \theta) = 1 - P(y = 1 | x; \theta)[tex]
Back to the banana example, we had an estimated probability of [texi]0.9[texi], or:
[tex]h_\theta(x) = P(y = 1 | x; \theta) = 0.9[tex]
Let's find the probability of [texi]y[texi] being [texi]0[texi], that is an unripe banana:
[tex]P(y = 0 | x; \theta) = 1 - 0.9 = 0.1[tex]
In other words, the new banana has 10% chance of being unripe.
So far I've talked about percentages: where is the binary output? It's very simple to obtain. We already know that the hypothesis function [texi]h_\theta(x) = g(\theta^{\top} x)[texi] churns out values in range [texi][0 ,1][texi]: let's make the following assumption:
That's kind of intuitive. Now, when exactly is [texi]h_\theta(x)[texi] above or below [texi]0.5[texi]? If you look at the generic sigmoid function above (picture 2.), you'll notice that [texi]g(z) \geq 0.5 [texi] whenever [texi]z \geq 0[texi]. Since the hypothesis function for logistic regression is defined as [texi]h_\theta(x) = g(\theta^{\top} x)[texi], I can clearly state that [texi]h_\theta(x) = g(\theta^{\top} x) \geq 0.5[texi] whenever [texi]\theta^{\top} x \geq 0[texi] (because [texi]z = \theta^{\top} x[texi], right?).
You can easily figure out when [texi]g(z) < 0.5 [texi]. I can now update the assumptions above:
Let's now glue together all those pieces and take a look at a graphical example of logistic regression. Suppose you have a training set as in picture 3. below. Two features [texi]x_1[texi] and [texi]x_2[texi] — size and yellowness of a banana for example — and a bunch of items scattered around the plane — each point is a banana in the training set, where empy points represent unripe bananas.
The task of the logistic regression algorithm is to separate those points, by drawing a line between them. That line is technically called decision boundary and it is used to infer about the data set: everything that is below the decision boundary belongs to category A and the remaining part belongs to category B.
The decision boundary is a line, hence it can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best [texi]\theta[texi]s parameters in order to make the decision boundary actually separate the data points correctly.
We still don't know how to compute those parameters — I will talk about it in the next chapter. For now suppose that we know the hypothesis function:
[tex] h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2) [tex]
and through some magical procedures we end up with the following parameters:
[tex] \begin{align} \theta_0 & = -3 \\ \theta_1 & = 1 \\ \theta_2 & = 1 \end{align} [tex]
which form the usual parameter vector like:
[tex] \vec{\theta} = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix} [tex]
We know since the beginning that [texi]h_\theta(x) = g(\theta^{\top} x)[texi] and for the current example [texi]\theta^{\top} x = \theta_0 + \theta_1x_1 + \theta_2x_2[texi] which is, not surpisingly, the equation of a line (the decision boundary). We also know from the previous paragraph that we can predict [texi]y = 1[texi] whenever [texi]\theta^{\top} x \geq 0[texi]. Thus can state that:
[tex] y = 1 \ \ \text{if} \ \ \theta_0 + \theta_1x_1 + \theta_2x_2 \geq 0 [tex]
And for our specific example:
[tex] \begin{align} & y = 1 \ \ \text{if} \ \ \ -3 + x_1 + x_2 \geq 0 \\ & y = 1 \ \ \text{if} \ \ \ x_1 + x_2 \geq 3 \end{align} [tex]
If you draw the equation [texi]x_1 + x_2 = 3[texi] you will obtain the decision boundary line in picture 3. above.
In words: given the current training set, pick a new example from the real world with features [texi]x_1,\ x_2[texi] — a banana with a certain level of yellowness and a certain size. It will be classified as 1 ([texi]y = 1[texi] i.e. ripe) whenever it satisfies the inequation [texi]x_1 + x_2 \geq 3[texi], that is whenever it lies above ([texi]\geq[texi]) the decision boundary. It will be classified as 0 ([texi]y = 0[texi] i.e. unripe) otherwise.
Here the power of the sigmoid function emerges. It does not matter how far the new example lies from the decision boundary: [texi]y = 1[texi] if above; else [texi]y = 0[texi].
I made up the previous example: I already knew the shape of the decision boundary (a line) and its equation. In the real world you will find oddly-shaped decision boundaries, as a simple straight line doesn't always separate things well. Those shapes are usually described by polynomial equations.
Logistic regression has no built-in ability to define the decision boundary's equation. Usually it's good practice to look at data and figure out the appropriate shape to implement: that's the so-called data exploration. The gradient boosted logistic regression, an advanced machine learning algorithm, has the ability to generate the decision boundary by itself. I will look into it in future sessions.
Machine Learning Course @ Coursera - Classification (link)
Machine Learning Course @ Coursera - Hypothesis representation (link)
Machine Learning Course @ Coursera - Decision boundary (link)
Cross Validated - How is the decision boundary's equation determined? (link)
I really enjoyed reading the article as the explanation is clear as the soul of a newborn child, the language is nice, without any mindblowing math words, and even the design of the page is amazing! You did a great job and made my life much easier!
Super tiny but: it seems to me that you missed the minus sign in the formula of the hypothesis function in the "Logistic regression to the rescue" section :p