Quantcast
Channel: sql – Java, SQL and jOOQ.
Viewing all articles
Browse latest Browse all 426

How to Emulate PERCENTILE_DISC in MySQL and Other RDBMS

$
0
0

In my previous article, I showed what the very useful percentile functions (also known as inverse distribution functions) can be used for.

Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect As aggregate function As window function
MariaDB 10.3.3 No Yes
Oracle 18c Yes Yes
PostgreSQL 11 Yes No
SQL Server 2017 No Yes
Teradata 16 Yes No

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

  • Aggregate function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)
  • Window function: PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)

Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate PERCENTILE_DISC using PERCENT_RANK and FIRST_VALUE as follows. We’re using the Sakila database in this example.

Emulating window functions

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT
  rating,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER() x1,
  percentile_disc(0.5) 
    WITHIN GROUP (ORDER BY length) 
    OVER (PARTITION BY rating) x2
FROM film
ORDER BY rating;

Yielding

RATING  X1      X2
-------------------
G       114     107
NC-17   114     112
PG      114     113
PG-13   114     125
R       114     115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films per rating range from 107 minutes to 125 minutes. I’ve used DISTINCT because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support PERCENT_RANK and FIRST_VALUE. Here’s the complete query:

SELECT DISTINCT
  rating,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (ORDER BY length) p1,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

  1. PERCENT_RANK: In a derived table, we’re calculating the PERCENT_RANK value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose PERCENT_RANK is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose PERCENT_RANK is 0.9 or less
  2. FIRST_VALUE: Once we’ve found the PERCENT_RANK, we’re not quite done yet. We need to find the last row whose PERCENT_RANK is less or equal to the percentile we’re interested in. I could have used LAST_VALUE, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by PERCENT_RANK (p1 or p2), I translated all ranks higher than the percentile I’m looking for into NULL using a CASE expression, and then I made sure using NULLS LAST that the percentile I’m looking for will be the first row in the FIRST_VALUE function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT
  length,
  CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1,
  first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
FROM (
  SELECT
    length,
    percent_rank() OVER (ORDER BY length) p1
  FROM film
) t
ORDER BY length;

The result is

length |p1   |x1  |
-------|-----|----|
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
46     |0.00 |114 |
47     |0.01 |114 |
...
113    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 |
114    |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
115    |     |114 |
...
185    |     |114 |
185    |     |114 |
185    |     |114 |

So the FIRST_VALUE function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT 
  length,
  rating,
  CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2,
  first_value(length) OVER (
    PARTITION BY rating 
    ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
FROM (
  SELECT
    rating,
    length,
    percent_rank() OVER (PARTITION BY rating ORDER BY length) p2
  FROM film
) t
ORDER BY rating, length;

Yielding:

length |rating |p2   |x2  |
-------|-------|-----|----|
47     |G      |0.00 |107 |
47     |G      |0.00 |107 |
48     |G      |0.01 |107 |
48     |G      |0.01 |107 |
...
105    |G      |0.47 |107 |
106    |G      |0.49 |107 |
107    |G      |0.49 |107 |
107    |G      |0.49 |107 | <-- Last row in G partition whose
108    |G      |     |107 |     PERCENT_RANK is <= 0.5
108    |G      |     |107 |
109    |G      |     |107 |
...
185    |G      |     |107 |
185    |G      |     |107 |
46     |PG     |0.00 |113 |
47     |PG     |0.01 |113 |
47     |PG     |0.01 |113 |
...
111    |PG     |0.49 |113 |
113    |PG     |0.49 |113 |
113    |PG     |0.49 |113 | <-- Last row in PG partition whose
114    |PG     |     |113 |     PERCENT_RANK is <= 0.5
114    |PG     |     |113 |
...

Perfect! Notice if your RDBMS doesn’t support the NULLS LAST clause in your ORDER BY clause (e.g. MySQL), you might either hope that it defaults to sorting NULLS LAST (MySQL does), or you can emulate it as such:

-- This
ORDER BY x NULLS LAST

-- Is the same as this
ORDER BY
  CASE WHEN x IS NULL THEN 1 ELSE 0 END,
  x

Emulating aggregate functions

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using DISTINCT. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1
FROM film;

-- With GROUP BY
SELECT
  rating,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2
FROM film
GROUP BY rating
ORDER BY rating;

Trivial! The result is the same as before:

X1
---
114


RATING  X2
-----------
G       107
NC-17   112
PG      113
PG-13   125
R       115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY
SELECT
  MAX(x1) x1
FROM (
  SELECT first_value(length) OVER (
    ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1
  FROM (
    SELECT
      length,
      percent_rank() OVER (ORDER BY length) p1
    FROM film
  ) t
) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as MAX(). This is the same as what I did before with DISTINCT, for illustration purposes.

-- With GROUP BY
SELECT
  rating,
  MAX(x2) x2
FROM (
  SELECT
    rating,
    first_value(length) OVER (
      PARTITION BY rating 
      ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2
  FROM (
    SELECT
      rating,
      length,
      percent_rank() OVER (
        PARTITION BY rating 
        ORDER BY length) p2
    FROM film
  ) t
) t
GROUP BY rating
ORDER BY rating;

All we’re really doing (again) is translate the GROUP BY expression to a PARTITION BY expression in the window function, and then redo the previous exercise.

Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the PERCENTILE_DISC inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses PERCENT_RANK and FIRST_VALUE in all RDBMS that support window functions. A similar exercise could be made with PERCENTILE_CONT with a slightly more tricky approach to finding that FIRST_VALUE, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.


Viewing all articles
Browse latest Browse all 426

Trending Articles