math math h1# Introduction to classification and logistic regression

Get your feet wet with another fundamental machine learning algorithm for binary classification.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

Classification is another big family of machine learning algorithms. Unlike regression, where the output can take on continue values, in classification your algorithms produce discrete outcomes: one/zero, yes/no, do/don't and so on.

Spam versus non-spam emails is a traditional example of classification task: you want to predict if the email fed to your program is spammy or not, where usually math$0$ means not spam and math$1$ means spam.

Formally, we want to predict a variable math$y \in \{0,1\}$, where math$0$ is called negative class, while math$1$ is called positive class. Such task is known as binary classification.

Other classification problems might require more than a binary output, for example where math$y \in \{0,1,2,3\}$. Such task is known as a multiclass classification.

h2## Linear regression doesn't work with classification

Let's start from how not to do things. In classification problems, linear regression performs very poorly and when it works it's usually a stroke of luck. The main reason is that in classification, unlike in regression, you don't have to choose the best line through a set of points, but rather you want to somehow separate those points.

Say for example that you are playing with image recognition: given a bunch of photos of bananas, you want to tell whether they are ripe or not, given the color. You collect the data and plot it as in the left picture in figure 1. below. Every white dot is an element in the training set; after the linear regression algorithm has been run on it, you end up with the well-known hypothesis function math$h_\theta(x)$ depicted by the dashed line.

You could now say that when math$h_\theta(x) \geq 0.5$ the outcome is math$1$ (ripe), and math$0$ otherwise, as marked by the empty circle on the hypothesis line. Everything appears to be in order: above a certain threshold of yellowness, the bananas are ripe, below they are not.

However, as more training examples are added (picture 1., right side), they might alter the original slope of the hypothesis line, thus distorting the threshold. That's the main reason why applying linear regression to classification problems often it's not a good idea.

math 1. First attempt to handle a classification problem with linear regression. Outliers distort the hypothesis line.

h2## Logistic regression to the rescue

Logistic regression is a more performant algorithm used in classification problems. The most important feature is its ability to produce a sort of hard-limited hypothesis function: math$0 \leq h_\theta(x) \leq 1$.

The hypothesis function is slightly different from the one used in linear regression. For logistic regression,

math$$h_\theta(x) = g(\theta^{\top} x)$$

which is the traditional hypothesis function processed by a new function math$g$, defined as:

math$$g(z) = \frac{1}{1 + e^{-z}}$$

It is called sigmoid function or logistic function and looks like the picture 2.:

math 2. The sigmoid function. Credits: Wikipedia.

In a sigmoid function, as the input math$z$ goes to math$- \infty$, the output math$g(z)$ approaches to math$0$; as math$z$ goes to math$+ \infty$, the output math$g(z)$ approaches to math$1$. This works like an output limiter which makes the hypothesis function bound between math$0$ and math$1$.

For clarity let me unroll the hypothesis function plugged into the sigmoid:

math$$h_\theta(x) = g(\theta^{\top} x)$$

becomes

math$$h_\theta(x) = \frac{1}{1 + e^{-\theta^{\top} x}}$$

But what's the meaning of it?

h2## Interpretation of the new hypothesis output

The logistic regression's hypothesis function outputs a number between math$0$ and math$1$. You can think of it as the estimated probability that math$y = 1$ on a new example math$x$ in input.

Let's go back to the bananas classification task. It is a single-feature problem (i.e. the "yellowness"), so my feature vector math$x$ is defined as

math$$\vec{x} = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix}$$

Where the first feature math$x_0 = 1$ is just a trick I've explained in one of the previous articles and math$x_1$ is the "yellowness" of each banana.

It's now time to define whether a new picture of a banana given in input to my algorithm is ripe or not. I grab its feature vector and plug it into the hypothesis function, which magically produces some outcome, say:

math$$h_\theta(x) = 0.9$$

The hypothesis is telling me that for the new banana in input the probability that math$y = 1$ is math$0.9$. In other words, the new banana in the picture has 90% chance of being ripe.

Let me write it more formally and generally:

math$$h_\theta(x) = P(y = 1 | x; \theta)$$

In words, the hypothesis function tells you the probability that math$y = 1$ given math$x$ (i.e. given the new banana in input with a yellowness represented by the feature math$x$), parametrized by math$\theta$ (i.e. whose parameters are math$\theta$).

Since the outcome math$y$ is restricted between two values math$0$ and math$1$, I can compute the probability that math$y = 0$ as well. Let's figure out the following statement:

math$$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$$

The sum of each probability, namely of math$y$ being math$0$ and math$y$ being math$1$ is of course math$1$ (or 100%). So by moving one term on the other side of the equation, we will end up with:

math$$P(y = 0 | x; \theta) = 1 - P(y = 1 | x; \theta)$$

Back to the banana example, we had an estimated probability of math$0.9$, or:

math$$h_\theta(x) = P(y = 1 | x; \theta) = 0.9$$

Let's find the probability of math$y$ being math$0$, that is an unripe banana:

math$$P(y = 0 | x; \theta) = 1 - 0.9 = 0.1$$

In other words, the new banana has 10% chance of being unripe.

h2## Defining the binary output

So far I've talked about percentages: where is the binary output? It's very simple to obtain. We already know that the hypothesis function math$h_\theta(x) = g(\theta^{\top} x)$ churns out values in range math$[0 ,1]$: let's make the following assumption:

• if math$h_\theta(x) \geq 0.5$ predict math$y = 1$
• if math$h_\theta(x) < 0.5$ predict math$y = 0$

That's kind of intuitive. Now, when exactly is math$h_\theta(x)$ above or below math$0.5$? If you look at the generic sigmoid function above (picture 2.), you'll notice that math$g(z) \geq 0.5$ whenever math$z \geq 0$. Since the hypothesis function for logistic regression is defined as math$h_\theta(x) = g(\theta^{\top} x)$, I can clearly state that math$h_\theta(x) = g(\theta^{\top} x) \geq 0.5$ whenever math$\theta^{\top} x \geq 0$ (because math$z = \theta^{\top} x$, right?).

You can easily figure out when math$g(z) < 0.5$. I can now update the assumptions above:

• if math$h_\theta(x) \geq 0.5$ (i.e. math$\theta^{\top} x \geq 0$) predict math$y = 1$
• if math$h_\theta(x) < 0.5$ (i.e. math$\theta^{\top} x < 0$) predict math$y = 0$

h2## The decision boundary

Let's now glue together all those pieces and take a look at a graphical example of logistic regression. Suppose you have a training set as in picture 3. below. Two features math$x_1$ and math$x_2$ — size and yellowness of a banana for example — and a bunch of items scattered around the plane — each point is a banana in the training set, where empy points represent unripe bananas.

The task of the logistic regression algorithm is to separate those points, by drawing a line between them. That line is technically called decision boundary and it is used to infer about the data set: everything that is below the decision boundary belongs to category A and the remaining part belongs to category B.

math 3. A simple logistic regression problem with two features and the decision boundary depicted as the dotted line.

The decision boundary is a line, hence it can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best math$\theta$s parameters in order to make the decision boundary actually separate the data points correctly.

We still don't know how to compute those parameters — I will talk about it in the next chapter. For now suppose that we know the hypothesis function:

math$$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$$

and through some magical procedures we end up with the following parameters:

math\begin{align} \theta_0 & = -3 \\ \theta_1 & = 1 \\ \theta_2 & = 1 \end{align}

which form the usual parameter vector like:

math$$\vec{\theta} = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix}$$

We know since the beginning that math$h_\theta(x) = g(\theta^{\top} x)$ and for the current example math$\theta^{\top} x = \theta_0 + \theta_1x_1 + \theta_2x_2$ which is, not surpisingly, the equation of a line (the decision boundary). We also know from the previous paragraph that we can predict math$y = 1$ whenever math$\theta^{\top} x \geq 0$. Thus can state that:

math$$y = 1 \ \ \text{if} \ \ \theta_0 + \theta_1x_1 + \theta_2x_2 \geq 0$$

And for our specific example:

math\begin{align} & y = 1 \ \ \text{if} \ \ \ -3 + x_1 + x_2 \geq 0 \\ & y = 1 \ \ \text{if} \ \ \ x_1 + x_2 \geq 3 \end{align}

If you draw the equation math$x_1 + x_2 = 3$ you will obtain the decision boundary line in picture 3. above.

In words: given the current training set, pick a new example from the real world with features math$x_1,\ x_2$ — a banana with a certain level of yellowness and a certain size. It will be classified as 1 (math$y = 1$ i.e. ripe) whenever it satisfies the inequation math$x_1 + x_2 \geq 3$, that is whenever it lies above (math$\geq$) the decision boundary. It will be classified as 0 (math$y = 0$ i.e. unripe) otherwise.

Here the power of the sigmoid function emerges. It does not matter how far the new example lies from the decision boundary: math$y = 1$ if above; else math$y = 0$.

h3### How do I find the equation for the decision boundary?

I made up the previous example: I already knew the shape of the decision boundary (a line) and its equation. In the real world you will find oddly-shaped decision boundaries, as a simple straight line doesn't always separate things well. Those shapes are usually described by polynomial equations.

Logistic regression has no built-in ability to define the decision boundary's equation. Usually it's good practice to look at data and figure out the appropriate shape to implement: that's the so-called data exploration. The gradient boosted logistic regression, an advanced machine learning algorithm, has the ability to generate the decision boundary by itself. I will look into it in future sessions.

h2## Sources

Machine Learning Course @ Coursera - Classification (link)
Machine Learning Course @ Coursera - Hypothesis representation (link)
Machine Learning Course @ Coursera - Decision boundary (link)
Cross Validated - How is the decision boundary's equation determined? (link)

previous article
next article