SQL Window Functions
Window functions define ranking and aggregation functions you can execute for a subset of rows. The syntax is a bit weird and takes a bit of a time to get used to; but after some practice, you will appreciate them a lot. They are incredibly useful. You can define window functionality with this syntax:
AGGREGATION|RANKING OVER( ORDERED|PARTITONED TABLE ) AS column_name
To keep it simple, you can construct a window 2 ways: you can order rows or partition them. You can think of window functions as creating a new column in a result set by using functions - such as sum(), avg() etc - that act on a window. For example, for each row, you can project both an employee's salary, and average salary for the entire window. This is not something we can do with the group by function.
I recommend you try out the queries as you follow along, it will be fun!
CREATE TABLE IF NOT EXISTS "employee" (
"id" INT,
"first_name" TEXT,
"department_id" INT,
"salary" INT,
"years_worked" INT
);
INSERT INTO "employee" VALUES
(1,'Diane',1,5330,4),
(2,'Clarence',1,3617,2),
(3,'Eugene',1,4877,2),
(4,'Philip',1,5259,3),
(5,'Ann',2,2094,5),
(6,'Charles',2,5167,5),
(7,'Russell',2,3762,4),
(9,'Larry'3,2796,4),
(10,'Willie',3,4771,5),
(11,'Janet',3,3782,2),
(12,'Doris',3,6419,1),
(13,'Amy',3,6261,1),
(14,'Keith',3,4928,8),
(15,'Karen',4,6347,6),
(16,'Kathy',4,6286,1),
(17,'Joe',5,5639,3),
(18,'Barbara',5,3232,1),
(19,'Todd',5,4653,1),
(20,'Ronald',5,2076,5);
CREATE TABLE IF NOT EXISTS "department" (
"id" INT,
"name" TEXT
);
INSERT INTO "department" VALUES
(1,'IT'),
(2,'Management'),
(3,'Human Resources'),
(4,'Accounting'),
(5,'Help Desk');
-- Join employee and department tables, so we can re-use it for following queries
WITH sample as
(
SELECT employee.id, first_name, salary, years_worked, department.name as department
FROM employee LEFT JOIN department ON employee.department_id=department.id
)
SELECT * from sample;
RANKING
Syntax: RANK() | DENSE_RANK() | ROW_NUMBER() OVER(ORDER BY column_name)
Rank is a simple window function; it doesn't take any arguments. Let's say, we want to order employees based on their seniority, and rank them such that most senior employees get top ranking. This is how you can do it:
SELECT
first_name,
salary,
years_worked,
department,
DENSE_RANK() OVER(ORDER BY years_worked DESC) AS seniority
FROM sample;
This query orders employees based on years_worked, and for each employee a rank is assigned based on this ordering into a new column called 'seniority'. Then this new column is joined with the rest of the result set.
Since we are getting comfortable with window functions, now let's get the seniority ranking for each department:
SELECT first_name,
salary,
years_worked,
department,
DENSE_RANK() OVER(PARTITION BY department ORDER BY years_worked desc) as dept_seniority
FROM sample;
The only difference here is now we are constructing the window by first partitioning the dataset by department, then ordering the rows for each partition based on the years worked.
AGGREGATIONS
Let's say, we want to know how each employee's salary differs from average department salary.
-- You can also do aggregations over windows
SELECT
first_name,
salary,
years_worked,
department,
DENSE_RANK() over(partition by department order by years_worked desc) as seniority,
AVG(salary) over(partition by department) as avg_department_salary
FROM sample;
These all are very simple examples, but I hope it was useful. There is one more important thing to mention: How do we define the range of a window? Imagine that you want to calculate a running sum of salaries; we will assume that id field indicates the hire order.
SELECT
id,
first_name,
department,
salary,
SUM(salary) OVER(order by id) as cumulative_sum
FROM sample;
WINDOW range
Above query is equivalent to:
SELECT
id,
first_name,
department,
salary,
SUM(salary) OVER(order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum
FROM sample;
What if you want to get moving average of salaries? The lower boundary is 1 row before the current row. The upper boundary is 1 row after the current row.
SELECT
id,
first_name,
department,
salary,
ROUND(AVG(salary) OVER(order by id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 2) as moving_average
FROM sample;
One last example, and we are done:-)
Imagine that you want to calculate more than one aggregate function over the same window in a query, for example, you want to get minimum, maximum, and average salaries per department in the same result set. Instead of repeating ourselves like this:
SELECT
first_name,
department,
salary,
MIN(salary) OVER(partition by department) as min_salary_per_dept,
AVG(salary) OVER(partition by department) as avg_salary_per_dept,
MAX(salary) OVER(partition by department) as max_salary_per_dept
FROM sample;
We can do this:
SELECT
first_name,
department,
salary,
MIN(salary) OVER w as min_salary_per_dept,
AVG(salary) OVER w as avg_salary_per_dept,
MAX(salary) OVER w as max_salary_per_dept
FROM sample
WINDOW w as (partition by department);
I hope you enjoyed this write-up, and it was useful to you! As always, I'm open to constructive feedback, and feel free to connect on LinkedIn.