math math h1Percentiles and quantiles

Basic tools to split and analyze data sets.

h2Quantiles

Any set of data, arranged in ascending or descending order, can be divided into various parts, also known as partitions or subsets, regulated by quantiles. Quantile is a generic term for those values that divide the set into partitions of size n, so that each part represents 1/n of the set. Quantiles are not the partition itself, they are the numbers that define the partition. You can think of them as a sort of numeric boundary.

There are various kind of quantiles, like the quartiles (watch out for the different letter!) which divide a list of numbers into quarters. They are also known as 4-quantiles. The list of type of quantiles is quite big; you can also have 3-quartiles, 5-quartiles, 6-quartiles up to 1000-quantiles (and above, I guess). It all depends of the size of your data and what you have to do with it.

What follows is an example with quartiles, where x is a set of numbers:

 x = {5, 6_1, 9, 11_2, 13, 20_3, 26}

• first quartile, or Q1 = 6
• second quartile, or Q2 = 11
• third quartile, or Q3 = 20

The second quartile always corresponds to the median of the set x.

h2Percentiles

Percentiles are quite similar to quantiles: they split your set, but only into two partitions. For a generic kth percentile, the lower partition contains k % of the data, and the upper partition contains the rest of the data, which amounts to 100 - k %, because the total amount of data is 100%. Of course k can be any number between 0 and 100.

h3How to compute percentiles - the percentile ranking

Say I have the usual set of numbers:

 x = {1, 2, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 9, 10, 11, 13, 14, 14, 18, 20, 20, 21}

and I want to compute the percentile ranking of a specific number at position n, for example 13. In x, the number at position 13 is 8. I can use the generic formula:

 "percentile(n)" = "number of values below n" / "size of set x" * 100

which turns into:

 "percentile(13)" = 12 / 23 * 100 ~~ 52.1

In words, the number at position 13 splits the set in two parts: the lower one contains approximately 52.1% of the data, while the upper one contains approximately 47.9% of the data. I can also say that the number at position 13 is approximately the 52.1th percentile of the set, namely the value that divides the data so that approximately 52.1% is below it.

h3How to compute percentiles - the value at r %

This is actually the opposite question. Given a set of numbers, I want to know the position n of that number that gives me a percentile ranking r (or the r-th percentile). Given the previous data set, I want to know what value exists at percentile ranking of - say - 40% (or at the 40th percentile). I can use the generic formula:

 n = r / 100 * ("size of set x" + 1)

which is:

 n = 40 / 100 * (23 + 1) ~~ 9.6

It tells me that the number I'm looking for lies roughly in the 9.6th position of my set, between 6 and 7. Unfortunately I'm working with a discrete set of numbers and I don't have such a 9.6th value in there. I can easily approximate it, since 9,6 means "a little bit more than 9 and a half" and I pick up the 10th position, which corresponds to the number 7.

In words, the 40th percentile in x is the number at position 10 (7 in my data set).