h1# Multivariate linear regression

How to upgrade a linear regression algorithm from one to many input variables.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• How to find the minimum of a function using an iterative algorithm.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

Up to now I've played with linear regression based on a single variable. In the original version of the algorithm I had a single input feature math$x$ (the size of the house in the house pricing problem) that I used to predict the output math$y$ (the price of the house). I eventually ended up with a hypothesis function for such problem:

math$$h_\theta(x) = \theta_0 + \theta_1 x$$

It's now time to introduce a more powerful version that works with multiple variables called multivariate linear regression, where the term multivariate is a fancy word for more than one variable.

h2## The house pricing problem with multiple features

You surely need more than one feature in order to better predict the price of a house, like for example the number of rooms, the number of floors, the age of the house itself and so on. Those are your new input features math$x$. Being more than one, we need to update the notation a little bit.

size # of bedrooms # of floors age price(\$)
2104 5 1 40 460,000
1416 3 2 30 230,000
1534 3 2 30 315,000
... ... ... ... ...

h3### Improving the notation

This is what I'm going to use:

• math$n$ — number of input features;
• math$m$ — number of training examples;
• math$x_n$ — math$n$-th input feature;
• math$x^{(i)}$ — input feature of math$i$-th training example;
• math$x^{(i)}_j$ — value of feature math$j$ in math$i$-th training example;

Let me explain it a little bit. The lower-case math$n$ is the number of input features, the number of columns in the table above. There are four input features here, so math$n = 4$. As in the univariate version, math$m$ denotes the number of training examples, that is the number of rows in the table above.

The math$n$-th input feature is written as math$x_n$. For example: math$x_1$ is the size of the house, math$x_4$ is the age of the house. The output variable is still one, so it remains math$y$ with no subscript.

I will be writing math$x^{(i)}$ to refer to all the values of a specific training example, i.e. the math$i$-th row in the table. Being more than one input values, the result turns out to be a vector. For example, I grab the values of the third training example:

math$$\vec{x}^{(3)} = \begin{bmatrix} 1534 \\ 3 \\ 2 \\ 30 \end{bmatrix}$$

Being math$x^{(i)}$ a vector, I'll use math$x^{(i)}_{j}$ to refer to a specific value of that vector. For example math$x^{(3)}_4 = 30$.

h2## A multivariate version of the hypothesis function

The original, univariate hypothesis function was like:

math$$h_\theta(x) = \theta_0 + \theta_1 x$$

with one input variable math$x$. Obviously it has to be updated in order to work with multiple inputs. It's going to be:

math$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$

Quite annoying when you have tons of parameters, isn't it? We can simplify it. First of all let me add a fake value math$x_0 = 1$:

math$$h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$

The magic is going to happen, thanks to that trick: any linear algebra wizard will now recognize the formula above as the inner product between two vectors math$\vec{\theta}$ and math$\vec{x}$. In particular:

math$$math$$\vec{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} \qquad \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$$$$

By definition of the inner product, the first argument (math$\vec{\theta}$) must be a row vector. However our math$\vec{\theta}$ is a column vector. That's not a problem: just transpose it, that is make it a row (lay it down). A transposed vector is written as math$\vec{\theta}^\top$, so I'm ready to beautifully compress the hypothesis function as follows:

math$$h_\theta(x) = \vec{\theta}^{\top} \vec{x}$$

This notation will make the implementation way easier: computing the hypothesis function it's now just a matter of an inner product between two vectors, a simple task you can accomplish with any mathematical package of your favorite programming language.

h2## A multivariate version of the gradient descent function

I've updated the hypothesis function to work with multiple input parameters. Both the gradient descent function and the cost function need some tweaks as well.

h3### Improving the cost function

We know that the cost function takes in input all the parameters math$\theta_0, \theta_1, ... \theta_n$, but let's now plug in the vector math$\vec{\theta}$ instead of writing each parameter separately:

math$$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

It's just a matter of notation, as you may see. The equation stays the same.

h3### Improving the gradient descent function

Now that we have a compact cost function we can simplify the look of the gradient descent formula, which becomes:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\vec{\theta}) & \newline \rbrace \end{align*}

In particular we have a slightly different kind of derivative, that is:

math$$\frac{\partial}{\partial \theta_j} J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j$$

As always I lift you from the burden of computing the derivative step by step. So let me re-write the formula with all the pieces glued together:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j & \newline \rbrace \end{align*}

Remember to compute each value separately by storing math$\theta$s in temporary variables, as we did for the univariate version. The unrolled loop would look like the following:

math\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}

In conclusion, adding the little trick of math$x_0 = 1$ makes the notation easier to read and more compact. You still have to loop through each math$\theta_j$ in the gradient descent formula until you reach convergence, but the outcome is a practical vector. Plug it into the hypothesis function, compute the inner product as seen above and you have a working implementation of the multivariate linear regression.

h2## Sources

Khan Academy - Vector dot product and vector length (video)
Stat Trek - Vector Multiplication (link)
Machine Learning @ Coursera - Multiple Features (link)