Average ignoring extremes (outliers) with PostgreSQL
Introduction
In some cases, you may want a "trimmed average" instead of a regular average. For example, if you try to calculate the average net worth in your appartment building and Bill Gates is living in the penthouse, it might throw off the data a little bit. In this specific case, it might make more sense to say that the average net worth in your apartment building excluding Bill Gates is $587,333 and that Bill Gates has a net worth of 79,100,000,000.
Average size banana
Try it yourself
If you'd like to try the above example for yourself, just create this table:
CREATE TEMP TABLE building (
id serial,
name text,
unit_number text,
net_worth bigint
);
INSERT INTO building (name, unit_number, net_worth)
VALUES ('Gabriel', '1a', 326000),
('Tom', '1b', 885000),
('Jason', '1c', 563000),
('Bob', '2a', 145000),
('Roger', '2b', 517000),
('Lucas', '2c', 937000),
('Jerome', '3a', 986000),
('Andrew', '3b', 705000),
('John', '3c', 179000),
('Paul', '4a', 863000),
('Richard', '4b', 467000),
('Joseph', '4c', 475000),
('Bill', 'PH', 79100000000);
Regular average
Here is the average net worth in the building as well as the lowest and maximum net worth:
gab@test> SELECT round(AVG(net_worth)) as avg, MIN(net_worth), MAX(net_worth) FROM building;
avg | min | max
------------+--------+-------------
6085157538 | 145000 | 79100000000
(1 row)
Time: 0.476 ms
Average without the outliers
As you can see above, the average net worth in my apartment building is $608,515,7538 which does not really make sense because 1 tenant is way above that amount and all the other tenants under. We can use standard deviation to remove outliers and get a "trimmed average". In PostgreSQL, the function is called stddev_samp().
The "sample standard deviation" of the net worths is:
gab@test> SELECT STDDEV_SAMP(net_worth) FROM building;
stddev_samp
-------------
21938229866
(1 row)
Time: 0.401 ms
You can, for example, keep only the 95% data closest to the average with this formula: (Average - Standard Deviation * 2) < DATA WE KEEP < (Average + Standard Deviation * 2)
Here is this calculation for our data:
gab@test> SELECT (AVG(net_worth) - STDDEV_SAMP(net_worth) * 2) as lower_bound,
(AVG(net_worth) + STDDEV_SAMP(net_worth) * 2) as upper_bound
FROM building;
lower_bound | upper_bound
-----------------------+----------------------
-37791302193.53846154 | 49961617270.46153846
(1 row)
Time: 0.546 ms
This means that for our trimmed average, we would consider only the tenants with a net worth between $(37,791,302,193) and $49,961,617,270 which would remove Bill Gates.
Here is the trimmed average:
gab@test> WITH bounds AS (
SELECT (AVG(net_worth) - STDDEV_SAMP(net_worth) * 2) as lower_bound,
(AVG(net_worth) + STDDEV_SAMP(net_worth) * 2) as upper_bound
FROM building
)
SELECT round(AVG(net_worth)) as average
FROM building
WHERE net_worth BETWEEN (SELECT lower_bound FROM bounds) AND (SELECT upper_bound FROM bounds);
average
---------
587333
(1 row)
Time: 0.761 ms
The average net worth of our tenants excluding Bill Gates is $587,333.
In this example, we used "Standard Deviation * 2" to keep only the 95% data closest to the average. You can change this value to remove more or less outliers:
Percentage included in the set | Percentage excluded from the set | Value |
---|---|---|
50% | 50% | 0.674 |
68% | 32% | 0.994 |
68.268% | 31.731% | 1 |
80% | 20% | 1.281 |
90% | 10% | 1.644 |
95% | 5% | 1.959 |
95.449% | 4.550% | 2 |
99.730% | 0.269% | 3 |
99.999% | 0.0000% | 6 |
Read more about the rules for normally distributed data.