Here at TaskRabbit, our primary analytics database is MySQL running in Amazon RDS. The company-wide interface to that database is Looker.
Since Looker is a web interface on top of MySQL, the main statistical
tools available are those that are provided by MySQL, such as
AVG() and so forth.
And when is comes to statistical tools on MySQL, one of the common questions is:
How do I compute a median in MySQL?
Computing a median (or other percentile) for a numeric DB field is a very common situation for us. For example, we want to monitor both the median and mean price that clients are paying for housekeeping services in New York City. We might even want to track the 25th and 75th percentiles as well.
... SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT( (QNTY_Sell/QNTYDelivered)*1000 ORDER BY (QNTY_Sell/QNTYDelivered)*1000 SEPARATOR ',') , ',', (COUNT(*) + 1) / 2) , ',', -1) ...
I like this approach of using
because it’s simple (IMO) and doesn’t rely on vendor extensions or
However, one problem that I encountered when using this approach is that MySQL has a maximum length on the catenated string. In our MySQL deployment, if that maximum is exceeded, then the string is silently truncated, and the rest of the computation of the median continues with that truncated string, leading to an incorrect value.
Now, MySQL does have a config param to incrase that length, which we set in our MySQL server config
group_concat_max_len = <some large value>
Even so, if one exceeds that value, then the error due to silent truncation can still happen and you might not know it. :(
So, I combined this median computation technique with a few other
to come up the following gnarly SQL snippet to compute a percentile
while checking for truncation errors. If the string in the
GROUP_CONCAT() is truncated, then you’ll get a NULL value, which can
alert the analyst that something went wrong.
The whole enchilada
Here’s the full kit:
CAST( NULLIF( SUBSTRING_INDEX( SUBSTRING_INDEX( if ( CHAR_LENGTH( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) ) - CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) ) = COUNT(x) - 1 , GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) , null ) , ',' , CEILING( COUNT(x) * 2 / 4 ) ) , ',', -1 ) , '' ) AS SIGNED )
Let’s unpack that a bit, from the inside-out.
Guard against silent string truncation
Starting with the innermost bit, we have:
if ( CHAR_LENGTH( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) ) - CHAR_LENGTH( REPLACE( GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ), ',' , '' ) ) = COUNT(x) - 1 , GROUP_CONCAT( x ORDER BY x SEPARATOR ',' ) , null )
This was inspired by
StackOverflow answering the question of how to count the number of
items in a
GROUP_CONCAT string. By taking the length of the string,
then taking the length of the string with the separators removed, the
difference in those lengths is the number of separators; and the
number of separators is one less than the number of items in the
Lastly, we compare that number to the expected number of items in the
COUNT() and if they differ, return a
NULL. If they match, then
return the verified catenated string.
Find the median
Now that we have the verified catenated string, or a NULL if there was a problem verifying it, we get the median value
SUBSTRING_INDEX( SUBSTRING_INDEX( <verified catenated string> , ',' , CEILING( COUNT(x) * 2 / 4 ) ) , ',', -1 )
This is pretty much exactly as was originally suggested by Serpiton for computing the median. Of course, you can replace 2 with 1 or 3 to get the 25th and 75th percentiles respectively.
Convert to number, handling empty strings
SUBSTRING_INDEX() functions can produce a
string value, but since we are computing the median of numeric values,
we want to convert the final value to a numeric.
CAST( NULLIF( <median value as string> , '' ) AS SIGNED )
NULLIF ensures that if the string is empty (which can happen) we
NULL value. If you don’t do this, and just have
CAST( '' AS SIGNED )
CAST() will convert the empty string to
0 – which is
That’s all folks
That’s pretty much it.