SQL window functions have been around for some time now (they were part of the SQL 2003 ISO standard). However, adoption has taken time in some database systems. For example, they weren’t introduced in MySQL until MySQL 8 (2018).
Window functions are functions that aggregation data, but are returned to the un-aggregated row. For example, if you wanted to know what percentage of a given order for a customer makes up their total sales over a given period, a window function could return the total sales to each order row. This could also be done through a derived table, joined to the orders table. However, using a window function is simpler, involves far less code and often runs faster. Some database systems also optimize for window functions, which can mean they are a better solution.
I haven’t used them extensively myself. The main reason is that much of this type of data manipulation I prefer to do in R/Python. However, anything that would reduce the size of the dataset or speed up calculations or easier to maintain is always worth taking a look at. I have to admit that there are instances that window functions would have really helped, but I wasn’t keenly aware of them and their usefulness. We all tend to go with what we know (or what’s available), but it’s good to question.
Structure of Window Functions
The basic structure of window functions is the function that you want, such as SUM(), followed by an OVER() function with the column you want to sum over, which goes into a PARTITION BY command, inside the OVER() function. Additionally, if it’s a ranking window function, such as ordering sales from highest to lowest for each customer, then an ORDER BY command can be used inside the OVER() function. Here is an example:
SUM()
SELECT customer_id,
total_sales,
SUM(total_sales) OVER(PARTITION BY customer_id) AS total_sales_by_cust
FROM orders
In this statement, I’m selecting the customer_id & total_sales columns and summing the total sales by customer. Here is a sample of the results:
We can see customer_id C10 & C11 and there are a number of sales for each. With the SUM() window function we are able to sum all of the total_sales for each customer. You can also add a calculated column for the percentage each sale was for total sales by customer. If we wanted to do this without a window function, this is one way to do that:
SELECT o.customer_id,
o.total_sales,
os.total_sales_by_cust
FROM orders o INNER JOIN
(SELECT customer_id, SUM(total_sales) AS total_sales_by_cust
FROM orders
GROUP BY customer_id) os ON o.customer_id = os.customer_id
This requires more code than a window function and might take longer to run and be harder to maintain.
RANK()
Instead, let’s say that you are not interested in the sum of sales, but want the sales ordered in descending order from largest to smallest. That can be done with a window function called RANK(). This is how to write it (I’m just keeping the column names the same):
SELECT customer_id,
total_sales,
RANK() OVER(PARTITION BY customer_id ORDER BY total_sales DESC) AS total_sales_by_cust
FROM orders
I’ve added an ORDER BY command because we want the data ordered, or ranked, by how much the sale was. Also, we don’t put anything inside the RANK parenthesis. This is what the result looks like:
Same data with the same number of rows, but in this case it’s ordered in descending sale value for each customer_id. To do this one without a window function would take a bit more code than the first example, but I think this shows the structure of window functions and their usefulness. Next, I want to go through some of the types of window functions and what they’re good for.
Ranking Window Functions
Ranking functions order the data, making use of the ORDER BY command within the OVER() function. I’ve just used one as an example – the RANK() function. There are three other ranking window functions.
ROW_NUMBER()
ROW_NUMBER() assigns a unique sequential number to each row based on the order. For example, if you wanted to count, in order, the sale based on the date of the sale, you could write:
SELECT customer_id,
order_date,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date) AS order_number
FROM orders
This will give you the following result:
This returns each sale, starting with the first one in the data for each customer_id. It assigns a sequential integer for every sale by the date of order.
There is also DENSE_RANK(), which is like RANK() but without gaps in ranking for ties. That means if a customer placed their first order and a separate order on the same day, both orders would be counted as number 1. However, with RANK(), the next order (on another day) would be counted as number 3, where as with DENSE_RANK(), it would be counted as 2. If you needed it as 1,2,3, then use ROW_NUMBER().
NTILE()
Finally, there is a ranking function that allows you to split the data into n even buckets. For example, if you wanted to split sales into three even buckets, you could write:
SELECT customer_id,
SUM(total_sales) AS total_sales,
NTILE(3) OVER (ORDER BY SUM(total_sales) DESC) AS Bucket
FROM orders
GROUP BY customer_id
That would aggregate the sales by customer, then rank the customer into one of three buckets, depending on the sales amount. This is the result:
This is showing bucket 1 & 2 (there’s of course a 3 as well). You can split the data by however many buckets you need and by whichever column you want to split into n equal parts.
Aggregate Window Functions
These are functions like SUM(), AVG(), MIN(), MAX(), & COUNT(). I’ve already shown an example of SUM(). These do not use the ORDER BY command in the OVER() function because they return the same value for each row in the partition. I think they are fairly self-explanatory.
Value Functions
Value functions return the first (FIRST_VALUE()), last (LAST_VALUE()), or nth value (NTH_VALUE()). For example, to find the third sale for each customer, you could write:
SELECT customer_id,
order_date,
NTH_VALUE(order_date, 3) OVER(PARTITION BY customer_id ORDER BY order_date) AS third_value
FROM orders
Keep in mind that until you get to the nth value, a NULL will be returned. After that, it will just be the third value in the order you chose. Unlike first or last, you need to specify both the column and the integer for the nth value within the NTH_VALUE() function. Here is the returned dataset:
Offset Window Functions
There are two offset functions – LAG() & LEAD(). These can be useful if you want to compare data between periods. For example, to compare today’s sales to yesterday’s sales for each product, this is how that can be done:
SELECT product_id,
EXTRACT(MONTH FROM order_date) AS MONTH,
SUM(total_sales) AS monthly_sales,
LAG(SUM(total_sales), 1) OVER (PARTITION BY product_id ORDER BY EXTRACT(MONTH FROM order_date))
FROM orders
GROUP BY product_id,
EXTRACT(MONTH FROM order_date)
And the results would look like this:
In the lag column we have the previous month’s sales data. You can see that each value in the lag column (I didn’t name it anything else) is the same as the value in the monthly_sales column, one row up. There are also NULL values in the lag column where there is not a prior month (do not leave it like this in production – address the NULLs).
Summary
There are a few other window functions (CUME_DIST(), PERCENT_RANK(), NTH_PERCENTILE) that you can explore. Also, there are a few other features beyond what I showed here that allow additional customization of the data.
Overall, I think window functions are really helpful and I hope you agree. I’ll probably be using more of them in the future.