Window Functions in PostgreSQL
Introduction
Window functions looks at "windows" of your data while processing it. They give you "the ability to perform calculations across sets of rows that are related to the current query row".
In simple english, there are currently 11 window functions in PostgreSQL:
- row_number(): return the number of the current row from 1 to N
- rank(): return the rank of the current row (with gaps)
- dense_rank(): return the rank of the current row (without gaps)
- percent_rank(): relative rank of the current row
- cume_dist(): relative rank of the current row (bis)
- ntile(): group your rows in buckets
- lag(): return the previous row (or one of them)
- lead(): return the next row (or one of them)
- first_value(): return the first row
- last_value(): return the last row
- nth_value(): return the Nth row
My bedroom window does not have a lot of functions but look at the view!
It looks hard but it's easy
To use window functions, you need to use the "window function syntax" with an OVER() clause.
Let's create a sample table that we will use to provide an example for every window function. In our example, we are managing a class with 14 students:
-- Creating the table
CREATE TEMP TABLE students (
id serial,
name text,
grade int DEFAULT NULL,
last_seen_in_class date
);
-- Adding some students
INSERT INTO students (name, grade, last_seen_in_class) VALUES ('Jacob', '9', '2014-08-16'), ('Michael', '6', '2014-08-24'),
('Matthew', '7', '2014-08-24'), ('Emily', '5', '2014-08-17'),
('Emma', '8', '2014-08-17'), ('Christopher', '9', '2014-08-24'),
('Ashley', '10', '2014-08-16'), ('William', '0', '2014-07-18'),
('Grace', '3', '2014-08-21'), ('Tyler', '4', '2014-08-20'),
('Alexis', '4', '2014-08-24'), ('Alexander', '10', '2014-08-22'),
('Victoria', '4', '2014-08-24'), ('Benjamin', '1', '2014-08-24');
We now have a table with a list of students:
gab@gab # SELECT * FROM students;
id | name | grade | last_seen_in_class
----+-------------+-------+--------------------
1 | Jacob | 9 | 2014-08-16
2 | Michael | 6 | 2014-08-24
3 | Matthew | 7 | 2014-08-24
4 | Emily | 5 | 2014-08-17
5 | Emma | 8 | 2014-08-17
6 | Christopher | 9 | 2014-08-24
7 | Ashley | 10 | 2014-08-16
8 | William | 0 | 2014-07-18
9 | Grace | 3 | 2014-08-21
10 | Tyler | 4 | 2014-08-20
11 | Alexis | 4 | 2014-08-24
12 | Alexander | 10 | 2014-08-22
13 | Victoria | 4 | 2014-08-24
14 | Benjamin | 1 | 2014-08-24
(14 rows)
The OVER clause
Every window function needs an OVER clause. A window function will be over some portion of the rows (or all the rows) selected by your query. The OVER clause allows you to specify that. In the beginning, I recommend to simply repeat the ORDER BY clause in your over clause, for example :
SELECT a, b, c, any_window_function() OVER (ORDER BY a)
FROM foo
ORDER BY a
As you can see, my query is ordered by column "a" and I used the same ORDER BY in the OVER clause. Once you are comfortable with window functions, take a few minutes to read the official documentation about Window Function Calls to learn about the complete OVER clause syntax.
Now, let's try some window functions!
1. row_number(): return the number of the current row from 1 to N
You can, for example, sort your students by grade and find the number of each row:
gab@gab # SELECT ROW_NUMBER() OVER (ORDER BY grade), id, name, grade
FROM students
ORDER BY grade;
row_number | id | name | grade
------------+----+-------------+-------
1 | 8 | William | 0
2 | 14 | Benjamin | 1
3 | 9 | Grace | 3
4 | 13 | Victoria | 4
5 | 11 | Alexis | 4
6 | 10 | Tyler | 4
7 | 4 | Emily | 5
8 | 2 | Michael | 6
9 | 3 | Matthew | 7
10 | 5 | Emma | 8
11 | 6 | Christopher | 9
12 | 1 | Jacob | 9
13 | 7 | Ashley | 10
14 | 12 | Alexander | 10
(14 rows)
2. rank(): return the rank of the current row (with gaps)
Within the list of students, you can find the rank of each student depending on their grade (two students with the same grade will have the same rank):
gab@gab # SELECT rank() OVER (ORDER BY grade DESC), id, name, grade
FROM students
ORDER BY grade DESC;
rank | id | name | grade
------+----+-------------+-------
1 | 12 | Alexander | 10
1 | 7 | Ashley | 10
3 | 6 | Christopher | 9
3 | 1 | Jacob | 9
5 | 5 | Emma | 8
6 | 3 | Matthew | 7
7 | 2 | Michael | 6
8 | 4 | Emily | 5
9 | 13 | Victoria | 4
9 | 10 | Tyler | 4
9 | 11 | Alexis | 4
12 | 9 | Grace | 3
13 | 14 | Benjamin | 1
14 | 8 | William | 0
(14 rows)
As you can see, two students are in first position with a grade of 10, then two more are in third position with the grade 9.
3. dense_rank(): return the rank of the current row (without gaps)
dense_rank() is similar as rank() except that there is no gap between the ranks (after rank 1, you will have rank 2, when above we had rank 3):
gab@gab # SELECT dense_rank() OVER (ORDER BY grade DESC), id, name, grade
FROM students
ORDER BY grade DESC;
dense_rank | id | name | grade
------------+----+-------------+-------
1 | 12 | Alexander | 10
1 | 7 | Ashley | 10
2 | 6 | Christopher | 9
2 | 1 | Jacob | 9
3 | 5 | Emma | 8
4 | 3 | Matthew | 7
5 | 2 | Michael | 6
6 | 4 | Emily | 5
7 | 13 | Victoria | 4
7 | 10 | Tyler | 4
7 | 11 | Alexis | 4
8 | 9 | Grace | 3
9 | 14 | Benjamin | 1
10 | 8 | William | 0
(14 rows)
4. percent_rank(): relative rank of the current row
With percent_rank(), instead of getting a rank "position", you will get a rank percentage with the following calculation: (rank - 1) / (total rows - 1).
In the example below, Christopher is in third position and we have 14 rows, so his percent_rank() = (3 - 2) / (14 - 1) = 2 / 13 = 0.153846153846154
Here is an example with our students:
gab@gab # SELECT percent_rank() OVER (ORDER BY grade DESC), id, name, grade
FROM students
ORDER BY grade DESC;
percent_rank | id | name | grade
-------------------+----+-------------+-------
0 | 12 | Alexander | 10
0 | 7 | Ashley | 10
0.153846153846154 | 6 | Christopher | 9
0.153846153846154 | 1 | Jacob | 9
0.307692307692308 | 5 | Emma | 8
0.384615384615385 | 3 | Matthew | 7
0.461538461538462 | 2 | Michael | 6
0.538461538461538 | 4 | Emily | 5
0.615384615384615 | 13 | Victoria | 4
0.615384615384615 | 10 | Tyler | 4
0.615384615384615 | 11 | Alexis | 4
0.846153846153846 | 9 | Grace | 3
0.923076923076923 | 14 | Benjamin | 1
1 | 8 | William | 0
(14 rows)
5. cume_dist(): relative rank of the current row (bis)
cume_dist() offers another way of calculating a relative rank with the following formula: (number of rows preceding or peer with current row) / (total rows)
Christopher is in third position as well as Jacob, so there are 4 students with a grade equal or superior to Christopher (Christopher himself, Jacob, Ashley & Alexander) and there are still 14 rows. Christopher's cume_dist() = 4 / 14 = 0.285714285714286 Jacob has the same grade as Christopher's, he will have the same cume_dist().
Here is an example with all of our students:
gab@gab # SELECT cume_dist() OVER (ORDER BY grade DESC), id, name, grade
FROM students
ORDER BY grade DESC;
cume_dist | id | name | grade
-------------------+----+-------------+-------
0.142857142857143 | 12 | Alexander | 10
0.142857142857143 | 7 | Ashley | 10
0.285714285714286 | 6 | Christopher | 9
0.285714285714286 | 1 | Jacob | 9
0.357142857142857 | 5 | Emma | 8
0.428571428571429 | 3 | Matthew | 7
0.5 | 2 | Michael | 6
0.571428571428571 | 4 | Emily | 5
0.785714285714286 | 13 | Victoria | 4
0.785714285714286 | 10 | Tyler | 4
0.785714285714286 | 11 | Alexis | 4
0.857142857142857 | 9 | Grace | 3
0.928571428571429 | 14 | Benjamin | 1
1 | 8 | William | 0
(14 rows)
6. ntile(): group your rows in buckets
ntile() tries to group rows in buckets with an equal number of rows. If you have 10 rows and want 2 buckets, you will have 5 rows in each bucket.
Here is an example where we group our students in 2 and 4 buckets:
gab@gab # SELECT id, name, grade,
ntile(2) OVER (ORDER BY grade DESC) as two_buckets,
ntile(4) OVER (ORDER BY grade DESC) as four_buckets
FROM students
ORDER BY grade DESC;
id | name | grade | two_buckets | four_buckets
----+-------------+-------+-------------+--------------
12 | Alexander | 10 | 1 | 1
7 | Ashley | 10 | 1 | 1
6 | Christopher | 9 | 1 | 1
1 | Jacob | 9 | 1 | 1
5 | Emma | 8 | 1 | 2
3 | Matthew | 7 | 1 | 2
2 | Michael | 6 | 1 | 2
4 | Emily | 5 | 2 | 2
13 | Victoria | 4 | 2 | 3
10 | Tyler | 4 | 2 | 3
11 | Alexis | 4 | 2 | 3
9 | Grace | 3 | 2 | 4
14 | Benjamin | 1 | 2 | 4
8 | William | 0 | 2 | 4
(14 rows)
If you like ntile(), take a few seconds to check out my article about Weighted ntile, a PostgreSQL function I wrote to help you rank your data in weighted groups.
7. lag(): return the previous row (or one of them)
lag() returns the value of one or more rows before the one currently displayed.
In the following example, you can see the name of the student one and two positions before the current row:
gab@gab # SELECT id, name, grade,
lag(name, 1) OVER (ORDER BY grade DESC) as one_before,
lag(name, 2) OVER (ORDER BY grade DESC) as two_before
FROM students
ORDER BY grade DESC;
id | name | grade | one_before | two_before
----+-------------+-------+-------------+-------------
12 | Alexander | 10 | {NULL} | {NULL}
7 | Ashley | 10 | Alexander | {NULL}
6 | Christopher | 9 | Ashley | Alexander
1 | Jacob | 9 | Christopher | Ashley
5 | Emma | 8 | Jacob | Christopher
3 | Matthew | 7 | Emma | Jacob
2 | Michael | 6 | Matthew | Emma
4 | Emily | 5 | Michael | Matthew
13 | Victoria | 4 | Emily | Michael
10 | Tyler | 4 | Victoria | Emily
11 | Alexis | 4 | Tyler | Victoria
9 | Grace | 3 | Alexis | Tyler
14 | Benjamin | 1 | Grace | Alexis
8 | William | 0 | Benjamin | Grace
(14 rows)
8. lead(): return the next row (or one of them)
lead() is the exact opposite as lag(), it returns the value of one or more rows after the one currently displayed.
In the following example, you can see the name of the student one and two position after the current row:
gab@gab # SELECT id, name, grade,
lead(name, 1) OVER (ORDER BY grade DESC) as one_after,
lead(name, 2) OVER (ORDER BY grade DESC) as two_after
FROM students
ORDER BY grade DESC;
id | name | grade | one_after | two_after
----+-------------+-------+-------------+-------------
12 | Alexander | 10 | Ashley | Christopher
7 | Ashley | 10 | Christopher | Jacob
6 | Christopher | 9 | Jacob | Emma
1 | Jacob | 9 | Emma | Matthew
5 | Emma | 8 | Matthew | Michael
3 | Matthew | 7 | Michael | Emily
2 | Michael | 6 | Emily | Victoria
4 | Emily | 5 | Victoria | Tyler
13 | Victoria | 4 | Tyler | Alexis
10 | Tyler | 4 | Alexis | Grace
11 | Alexis | 4 | Grace | Benjamin
9 | Grace | 3 | Benjamin | William
14 | Benjamin | 1 | William | {NULL}
8 | William | 0 | {NULL} | {NULL}
(14 rows)
9. first_value(): return the first row
first_value() returns the first row from our set of data. For example, you can see the best grade in the class and calculate the difference between the current grade and the best grade:
gab@gab # SELECT id, name, grade,
first_value(grade) OVER (ORDER BY grade DESC) as best_grade,
(first_value(grade) OVER (ORDER BY grade DESC) - grade) as difference
FROM students
ORDER BY grade DESC;
id | name | grade | best_grade | difference
----+-------------+-------+------------+------------
12 | Alexander | 10 | 10 | 0
7 | Ashley | 10 | 10 | 0
6 | Christopher | 9 | 10 | 1
1 | Jacob | 9 | 10 | 1
5 | Emma | 8 | 10 | 2
3 | Matthew | 7 | 10 | 3
2 | Michael | 6 | 10 | 4
4 | Emily | 5 | 10 | 5
13 | Victoria | 4 | 10 | 6
10 | Tyler | 4 | 10 | 6
11 | Alexis | 4 | 10 | 6
9 | Grace | 3 | 10 | 7
14 | Benjamin | 1 | 10 | 9
8 | William | 0 | 10 | 10
(14 rows)
10. last_value(): return the last row
last_value() returns the last row within the "window frame". As the official documentation says, the window frame contains by default "the rows from the start of the partition through the last peer of the current row.". This is likely to give unhelpful results for last_value and nth_value. We can bypass this issue by specifing a range of data like in the example below.
We can now see the worst grade of the class and calculate the difference between the current grade and the worst grade:
gab@gab # SELECT id, name, grade,
last_value(grade) OVER (ORDER BY grade DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as worst_grade,
(last_value(grade) OVER (ORDER BY grade DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - grade) as difference
FROM students
ORDER BY grade DESC;
id | name | grade | worst_grade | difference
----+-------------+-------+-------------+------------
12 | Alexander | 10 | 0 | -10
7 | Ashley | 10 | 0 | -10
6 | Christopher | 9 | 0 | -9
1 | Jacob | 9 | 0 | -9
5 | Emma | 8 | 0 | -8
3 | Matthew | 7 | 0 | -7
2 | Michael | 6 | 0 | -6
4 | Emily | 5 | 0 | -5
13 | Victoria | 4 | 0 | -4
10 | Tyler | 4 | 0 | -4
11 | Alexis | 4 | 0 | -4
9 | Grace | 3 | 0 | -3
14 | Benjamin | 1 | 0 | -1
8 | William | 0 | 0 | 0
(14 rows)
11. nth_value(): return the Nth row
nth_value() returns the Nth row within the "window frame". It has the same limitation as and we need again to specify a range of data to get useful results.
For example, here is the grade of the student in 4th position (Jacob) and the grade of the student in 8th position (Emily):
gab@gab # SELECT id, name, grade,
nth_value(grade, 4) OVER (ORDER BY grade DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as position_4,
nth_value(grade, 8) OVER (ORDER BY grade DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as position_8
FROM students
ORDER BY grade DESC;
id | name | grade | position_4 | position_8
----+-------------+-------+------------+------------
12 | Alexander | 10 | 9 | 5
7 | Ashley | 10 | 9 | 5
6 | Christopher | 9 | 9 | 5
1 | Jacob | 9 | 9 | 5
5 | Emma | 8 | 9 | 5
3 | Matthew | 7 | 9 | 5
2 | Michael | 6 | 9 | 5
4 | Emily | 5 | 9 | 5
13 | Victoria | 4 | 9 | 5
10 | Tyler | 4 | 9 | 5
11 | Alexis | 4 | 9 | 5
9 | Grace | 3 | 9 | 5
14 | Benjamin | 1 | 9 | 5
8 | William | 0 | 9 | 5
(14 rows)
Read more
You can read more about window functions on PostgreSQL's official website. Here are the most relevant pages:
- General-Purpose Window Functions
- Window Function Calls (application of an aggregate-like function over some portion of the rows selected by a query)