h1# Percentiles and quantiles

Basic tools to split and analyze data sets.

h2## Quantiles

Any set of data, arranged in ascending or descending order, can be divided into various parts, also known as partitions or subsets, regulated by quantiles. Quantile is a generic term for those values that divide the set into partitions of size n, so that each part represents 1/n of the set. Quantiles are not the partition itself, they are the numbers that define the partition. You can think of them as a sort of numeric boundary.

There are various kind of quantiles, like the quartiles (watch out for the different letter!) which divide a list of numbers into quarters. They are also known as 4-quantiles. The list of type of quantiles is quite big; you can also have 3-quartiles, 5-quartiles, 6-quartiles up to 1000-quantiles (and above, I guess). It all depends of the size of your data and what you have to do with it.

What follows is an example with quartiles, where x is a set of numbers:

 x = {5, 6_1, 9, 11_2, 13, 20_3, 26}

• first quartile, or Q1 = 6
• second quartile, or Q2 = 11
• third quartile, or Q3 = 20

The second quartile always corresponds to the median of the set x.

h2## Percentiles

Percentiles are quite similar to quantiles: they split your set, but only into two partitions. For a generic kth percentile, the lower partition contains k % of the data, and the upper partition contains the rest of the data, which amounts to 100 - k %, because the total amount of data is 100%. Of course k can be any number between 0 and 100.

h3### How to compute percentiles - the percentile ranking

Say I have the usual set of numbers:

 x = {1, 2, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 9, 10, 11, 13, 14, 14, 18, 20, 20, 21}

and I want to compute the percentile ranking of a specific number at position n, for example 13. In x, the number at position 13 is 8. I can use the generic formula:

 "percentile(n)" = "number of values below n" / "size of set x" * 100

which turns into:

 "percentile(13)" = 12 / 23 * 100 ~~ 52.1

In words, the number at position 13 splits the set in two parts: the lower one contains approximately 52.1% of the data, while the upper one contains approximately 47.9% of the data. I can also say that the number at position 13 is approximately the 52.1th percentile of the set, namely the value that divides the data so that approximately 52.1% is below it.

h3### How to compute percentiles - the value at r %

This is actually the opposite question. Given a set of numbers, I want to know the position n of that number that gives me a percentile ranking r (or the r-th percentile). Given the previous data set, I want to know what value exists at percentile ranking of - say - 40% (or at the 40th percentile). I can use the generic formula:

 n = r / 100 * ("size of set x" + 1)

which is:

 n = 40 / 100 * (23 + 1) ~~ 9.6

It tells me that the number I'm looking for lies roughly in the 9.6th position of my set, between 6 and 7. Unfortunately I'm working with a discrete set of numbers and I don't have such a 9.6th value in there. I can easily approximate it, since 9,6 means "a little bit more than 9 and a half" and I pick up the 10th position, which corresponds to the number 7.

In words, the 40th percentile in x is the number at position 10 (7 in my data set).