Join me on Facebook!
— Written by Triangles on March 05, 2017 • updated on March 05, 2017 • ID 50 —
How to upgrade a linear regression algorithm from one to many input variables.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
How to optimize the gradient descent algorithm — A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
The problem of overfitting in machine learning algorithms — Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.
Up to now I've played with linear regression based on a single variable. In the original version of the algorithm I had a single input feature [texi]x[texi] (the size of the house in the house pricing problem) that I used to predict the output [texi]y[texi] (the price of the house). I eventually ended up with a hypothesis function for such problem:
[tex] h_\theta(x) = \theta_0 + \theta_1 x [tex]
It's now time to introduce a more powerful version that works with multiple variables called multivariate linear regression, where the term multivariate is a fancy word for more than one variable.
You surely need more than one feature in order to better predict the price of a house, like for example the number of rooms, the number of floors, the age of the house itself and so on. Those are your new input features [texi]x[texi]. Being more than one, we need to update the notation a little bit.
size | # of bedrooms | # of floors | age | price($) |
---|---|---|---|---|
2104 | 5 | 1 | 40 | 460,000 |
1416 | 3 | 2 | 30 | 230,000 |
1534 | 3 | 2 | 30 | 315,000 |
... | ... | ... | ... | ... |
This is what I'm going to use:
Let me explain it a little bit. The lower-case [texi]n[texi] is the number of input features, the number of columns in the table above. There are four input features here, so [texi]n = 4[texi]. As in the univariate version, [texi]m[texi] denotes the number of training examples, that is the number of rows in the table above.
The [texi]n[texi]-th input feature is written as [texi]x_n[texi]. For example: [texi]x_1[texi] is the size of the house, [texi]x_4[texi] is the age of the house. The output variable is still one, so it remains [texi]y[texi] with no subscript.
I will be writing [texi]x^{(i)}[texi] to refer to all the values of a specific training example, i.e. the [texi]i[texi]-th row in the table. Being more than one input values, the result turns out to be a vector. For example, I grab the values of the third training example:
[tex] \vec{x}^{(3)} = \begin{bmatrix} 1534 \\ 3 \\ 2 \\ 30 \end{bmatrix} [tex]
Being [texi]x^{(i)}[texi] a vector, I'll use [texi]x^{(i)}_{j}[texi] to refer to a specific value of that vector. For example [texi]x^{(3)}_4 = 30[texi].
The original, univariate hypothesis function was like:
[tex] h_\theta(x) = \theta_0 + \theta_1 x [tex]
with one input variable [texi]x[texi]. Obviously it has to be updated in order to work with multiple inputs. It's going to be:
[tex] h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n [tex]
Quite annoying when you have tons of parameters, isn't it? We can simplify it. First of all let me add a fake value [texi]x_0 = 1[texi]:
[tex] h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n [tex]
The magic is going to happen, thanks to that trick: any linear algebra wizard will now recognize the formula above as the inner product between two vectors [texi]\vec{\theta}[texi] and [texi]\vec{x}[texi]. In particular:
[tex] \begin{equation} \vec{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} \qquad \vec{x} = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \end{equation} [tex]
By definition of the inner product, the first argument ([texi]\vec{\theta}[texi]) must be a row vector. However our [texi]\vec{\theta}[texi] is a column vector. That's not a problem: just transpose it, that is make it a row (lay it down). A transposed vector is written as [texi]\vec{\theta}^\top[texi], so I'm ready to beautifully compress the hypothesis function as follows:
[tex] h_\theta(x) = \vec{\theta}^{\top} \vec{x} [tex]
This notation will make the implementation way easier: computing the hypothesis function it's now just a matter of an inner product between two vectors, a simple task you can accomplish with any mathematical package of your favorite programming language.
I've updated the hypothesis function to work with multiple input parameters. Both the gradient descent function and the cost function need some tweaks as well.
We know that the cost function takes in input all the parameters [texi]\theta_0, \theta_1, ... \theta_n[texi], but let's now plug in the vector [texi]\vec{\theta}[texi] instead of writing each parameter separately:
[tex] J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 [tex]
It's just a matter of notation, as you may see. The equation stays the same.
Now that we have a compact cost function we can simplify the look of the gradient descent formula, which becomes:
[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\vec{\theta}) & \newline \rbrace \end{align*} [tex]
In particular we have a slightly different kind of derivative, that is:
[tex] \frac{\partial}{\partial \theta_j} J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j [tex]
As always I lift you from the burden of computing the derivative step by step. So let me re-write the formula with all the pieces glued together:
[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j & \newline \rbrace \end{align*} [tex]
Remember to compute each value separately by storing [texi]\theta[texi]s in temporary variables, as we did for the univariate version. The unrolled loop would look like the following:
[tex] \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*} [tex]
In conclusion, adding the little trick of [texi]x_0 = 1[texi] makes the notation easier to read and more compact. You still have to loop through each [texi]\theta_j[texi] in the gradient descent formula until you reach convergence, but the outcome is a practical vector. Plug it into the hypothesis function, compute the inner product as seen above and you have a working implementation of the multivariate linear regression.
Khan Academy - Vector dot product and vector length (video)
Stat Trek - Vector Multiplication (link)
Machine Learning @ Coursera - Multiple Features (link)
Machine Learning @ Coursera - Gradient Descent for Multiple Variables (link)