Here at TaskRabbit, our primary analytics database is MySQL running in Amazon RDS. The company-wide interface to that database is Looker.

Since Looker is a web interface on top of MySQL, the main statistical
tools available are those that are provided by MySQL, such as `SUM()`

,
`MIN()`

, `MAX()`

, `AVG()`

and so forth.

And when is comes to statistical tools on MySQL, one of the common questions is:

```
How do I compute a median in MySQL?
```

Computing a median (or other percentile) for a numeric DB field is a very common situation for us. For example, we want to monitor both the median and mean price that clients are paying for housekeeping services in New York City. We might even want to track the 25th and 75th percentiles as well.

For computing medians in MySQL, the best answer that I found on StackOverflow is this one from Serpiton, excerpt:

```
...
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
(QNTY_Sell/QNTYDelivered)*1000
ORDER BY (QNTY_Sell/QNTYDelivered)*1000
SEPARATOR ',')
, ',', (COUNT(*) + 1) / 2)
, ',', -1)
...
```

I like this approach of using `GROUP_CONCAT()`

and `SUBSTRING_INDEX()`

because it’s simple (IMO) and doesn’t rely on vendor extensions or
custom libraries/plugins.

However, one problem that I encountered when using this approach is
that MySQL has a maximum length on the catenated string. In our MySQL
deployment, if that maximum is exceeded, then the string is *silently*
truncated, and the rest of the computation of the median continues
with that truncated string, leading to an incorrect value.

Now, MySQL does have a config param to incrase that length, which we set in our MySQL server config

```
group_concat_max_len = <some large value>
```

Even so, if one exceeds that value, then the error due to silent truncation can still happen and you might not know it. :(

So, I combined this median computation technique with a few other
SQL tricks
to come up the following gnarly SQL snippet to compute a percentile
while checking for truncation errors. If the string in the
`GROUP_CONCAT()`

is truncated, then you’ll get a NULL value, which can
alert the analyst that something went wrong.

### The whole enchilada

Here’s the full kit:

```
CAST(
NULLIF(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
if ( CHAR_LENGTH( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) ) -
CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) )
= COUNT(x) - 1
, GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )
, null
)
, ','
, CEILING( COUNT(x) * 2 / 4 ) )
, ',', -1 )
, '' )
AS SIGNED )
```

Let’s unpack that a bit, from the inside-out.

### Guard against silent string truncation

Starting with the innermost bit, we have:

```
if ( CHAR_LENGTH( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) ) -
CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) )
= COUNT(x) - 1
, GROUP_CONCAT( x ORDER BY x SEPARATOR ',' )
, null
)
```

This was inspired by
Yannis on
StackOverflow answering the question of how to count the number of
items in a `GROUP_CONCAT`

string. By taking the length of the string,
then taking the length of the string with the separators removed, the
difference in those lengths is the number of separators; and the
number of separators is one less than the number of items in the
string.

Lastly, we compare that number to the expected number of items in the
`COUNT()`

and if they differ, return a `NULL`

. If they match, then
return the *verified* catenated string.

### Find the median

Now that we have the verified catenated string, or a NULL if there was a problem verifying it, we get the median value

```
SUBSTRING_INDEX(
SUBSTRING_INDEX(
<verified catenated string>
, ','
, CEILING( COUNT(x) * 2 / 4 ) )
, ',', -1 )
```

This is pretty much exactly as was originally suggested by Serpiton for computing the median. Of course, you can replace 2 with 1 or 3 to get the 25th and 75th percentiles respectively.

### Convert to number, handling empty strings

The `GROUP_CONCAT()`

and `SUBSTRING_INDEX()`

functions can produce a
string value, but since we are computing the median of numeric values,
we want to convert the final value to a numeric.

```
CAST(
NULLIF(
<median value as string>
, '' )
AS SIGNED )
```

The `NULLIF`

ensures that if the string is empty (which can happen) we
get a `NULL`

value. If you don’t do this, and just have

```
CAST( '' AS SIGNED )
```

then the `CAST()`

will convert the empty string to `0`

– which is
bad.

### That’s all folks

That’s pretty much it.