Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:
Dialect | As aggregate function | As window function |
---|---|---|
MariaDB 10.3.3 | No | Yes |
Oracle 18c | Yes | Yes |
PostgreSQL 11 | Yes | No |
SQL Server 2017 | No | Yes |
Teradata 16 | Yes | No |
Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:
- Aggregate function:
PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
- Window function:
PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)
Workarounds if the feature is unavailable
Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC
using PERCENT_RANK
and FIRST_VALUE
as follows. We’re using the Sakila database in this example.
Emulating window functions
Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:
SELECT DISTINCT rating, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) OVER() x1, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) OVER (PARTITION BY rating) x2 FROM film ORDER BY rating;
Yielding
RATING X1 X2 ------------------- G 114 107 NC-17 114 112 PG 114 113 PG-13 114 125 R 114 115
What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT
because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.
Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK
and FIRST_VALUE
. Here’s the complete query:
SELECT DISTINCT rating, first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER (ORDER BY length) p1, percent_rank() OVER (PARTITION BY rating ORDER BY length) p2 FROM film ) t ORDER BY rating;
So, we’re doing this in two steps (visual example further down):
PERCENT_RANK
: In a derived table, we’re calculating thePERCENT_RANK
value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whosePERCENT_RANK
is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whosePERCENT_RANK
is 0.9 or lessFIRST_VALUE
: Once we’ve found thePERCENT_RANK
, we’re not quite done yet. We need to find the last row whosePERCENT_RANK
is less or equal to the percentile we’re interested in. I could have usedLAST_VALUE
, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows byPERCENT_RANK
(p1 or p2), I translated all ranks higher than the percentile I’m looking for intoNULL
using aCASE
expression, and then I made sure usingNULLS LAST
that the percentile I’m looking for will be the first row in theFIRST_VALUE
function’s window specification. Easy!
To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:
SELECT length, CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1, first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1 FROM ( SELECT length, percent_rank() OVER (ORDER BY length) p1 FROM film ) t ORDER BY length;
The result is
length |p1 |x1 | -------|-----|----| 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 47 |0.01 |114 | ... 113 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | ... 185 | |114 | 185 | |114 | 185 | |114 |
So the FIRST_VALUE
function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.
The same for p2:
SELECT length, rating, CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER (PARTITION BY rating ORDER BY length) p2 FROM film ) t ORDER BY rating, length;
Yielding:
length |rating |p2 |x2 | -------|-------|-----|----| 47 |G |0.00 |107 | 47 |G |0.00 |107 | 48 |G |0.01 |107 | 48 |G |0.01 |107 | ... 105 |G |0.47 |107 | 106 |G |0.49 |107 | 107 |G |0.49 |107 | 107 |G |0.49 |107 | <-- Last row in G partition whose 108 |G | |107 | PERCENT_RANK is <= 0.5 108 |G | |107 | 109 |G | |107 | ... 185 |G | |107 | 185 |G | |107 | 46 |PG |0.00 |113 | 47 |PG |0.01 |113 | 47 |PG |0.01 |113 | ... 111 |PG |0.49 |113 | 113 |PG |0.49 |113 | 113 |PG |0.49 |113 | <-- Last row in PG partition whose 114 |PG | |113 | PERCENT_RANK is <= 0.5 114 |PG | |113 | ...
Perfect! Notice if your RDBMS doesn’t support the NULLS LAST
clause in your ORDER BY
clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST
(MySQL does), or you can emulate it as such:
-- This ORDER BY x NULLS LAST -- Is the same as this ORDER BY CASE WHEN x IS NULL THEN 1 ELSE 0 END, x
Emulating aggregate functions
If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT
. It will probably be easier than the emulation below. Do check for performance though!
When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.
Here’s how to use the aggregate function version in Oracle:
-- Without GROUP BY SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1 FROM film; -- With GROUP BY SELECT rating, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2 FROM film GROUP BY rating ORDER BY rating;
Trivial! The result is the same as before:
X1 --- 114 RATING X2 ----------- G 107 NC-17 112 PG 113 PG-13 125 R 115
Now, let’s emulate these on e.g. MySQL, using window functions.
-- Without GROUP BY SELECT MAX(x1) x1 FROM ( SELECT first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1 FROM ( SELECT length, percent_rank() OVER (ORDER BY length) p1 FROM film ) t ) t;
It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX()
. This is the same as what I did before with DISTINCT
, for illustration purposes.
-- With GROUP BY SELECT rating, MAX(x2) x2 FROM ( SELECT rating, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER ( PARTITION BY rating ORDER BY length) p2 FROM film ) t ) t GROUP BY rating ORDER BY rating;
All we’re really doing (again) is translate the GROUP BY
expression to a PARTITION BY
expression in the window function, and then redo the previous exercise.
Conclusion
Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC
inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK
and FIRST_VALUE
in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT
with a slightly more tricky approach to finding that FIRST_VALUE
, which I’ll leave as an exercise to the reader.
A future jOOQ version might emulate this for you, automatically.
Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.