TaskRabbit is Hiring!

We’re a tight-knit team that’s passionate about building a solution that helps people by maximizing their time, talent and skills. We are actively hiring for our Engineering and Design teams. Click To Learn more

Aaron Binns

Of Medians and MySQL

@ 30 Sep 2015

mysql


Here at TaskRabbit, our primary analytics database is MySQL running in Amazon RDS. The company-wide interface to that database is Looker.

Since Looker is a web interface on top of MySQL, the main statistical tools available are those that are provided by MySQL, such as SUM(), MIN(), MAX(), AVG() and so forth.

And when is comes to statistical tools on MySQL, one of the common questions is:

How do I compute a median in MySQL?

Computing a median (or other percentile) for a numeric DB field is a very common situation for us. For example, we want to monitor both the median and mean price that clients are paying for housekeeping services in New York City. We might even want to track the 25th and 75th percentiles as well.

For computing medians in MySQL, the best answer that I found on StackOverflow is this one from Serpiton, excerpt:

...
SUBSTRING_INDEX(
  SUBSTRING_INDEX(
    GROUP_CONCAT(
      (QNTY_Sell/QNTYDelivered)*1000
      ORDER BY (QNTY_Sell/QNTYDelivered)*1000
      SEPARATOR ',')
    , ',', (COUNT(*) + 1) / 2)
  , ',', -1)
...

I like this approach of using GROUP_CONCAT() and SUBSTRING_INDEX() because it’s simple (IMO) and doesn’t rely on vendor extensions or custom libraries/plugins.

However, one problem that I encountered when using this approach is that MySQL has a maximum length on the catenated string. In our MySQL deployment, if that maximum is exceeded, then the string is silently truncated, and the rest of the computation of the median continues with that truncated string, leading to an incorrect value.

Now, MySQL does have a config param to incrase that length, which we set in our MySQL server config

group_concat_max_len = <some large value>

Even so, if one exceeds that value, then the error due to silent truncation can still happen and you might not know it. :(

So, I combined this median computation technique with a few other SQL tricks to come up the following gnarly SQL snippet to compute a percentile while checking for truncation errors. If the string in the GROUP_CONCAT() is truncated, then you’ll get a NULL value, which can alert the analyst that something went wrong.

The whole enchilada

Here’s the full kit:

CAST(
  NULLIF(
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        if ( CHAR_LENGTH(          GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )             ) -
             CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) )
             = COUNT(x) - 1
             , GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )
             , null
           ) 
           , ','
           , CEILING( COUNT(x) * 2 / 4 ) )
      , ',', -1 )
    , '' )
  AS SIGNED )

Let’s unpack that a bit, from the inside-out.

Guard against silent string truncation

Starting with the innermost bit, we have:

if ( CHAR_LENGTH(          GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )             ) -
     CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) )
     = COUNT(x) - 1
     , GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )
     , null
   )

This was inspired by Yannis on StackOverflow answering the question of how to count the number of items in a GROUP_CONCAT string. By taking the length of the string, then taking the length of the string with the separators removed, the difference in those lengths is the number of separators; and the number of separators is one less than the number of items in the string.

Lastly, we compare that number to the expected number of items in the COUNT() and if they differ, return a NULL. If they match, then return the verified catenated string.

Find the median

Now that we have the verified catenated string, or a NULL if there was a problem verifying it, we get the median value

SUBSTRING_INDEX(
  SUBSTRING_INDEX(
         <verified catenated string>
       , ','
       , CEILING( COUNT(x) * 2 / 4 ) )
  , ',', -1 )

This is pretty much exactly as was originally suggested by Serpiton for computing the median. Of course, you can replace 2 with 1 or 3 to get the 25th and 75th percentiles respectively.

Convert to number, handling empty strings

The GROUP_CONCAT() and SUBSTRING_INDEX() functions can produce a string value, but since we are computing the median of numeric values, we want to convert the final value to a numeric.

CAST( 
  NULLIF(
    <median value as string>
    , '' )
  AS SIGNED )

The NULLIF ensures that if the string is empty (which can happen) we get a NULL value. If you don’t do this, and just have

CAST( '' AS SIGNED )

then the CAST() will convert the empty string to 0 – which is bad.

That’s all folks

That’s pretty much it.

Comments

Coments Loading...