At TaskRabbit, like many places, mobile is becoming more and more important all the time. One major result of this is that our Taskers now exclusively use an iOS or Android app to get, book, and report work. When someone wants to hire them or there is something we think they might be a great fit for comes up, we send them a push notification. In the past, there were also emails and equivalent functionality on a web site, but now we’re all in on mobile.
This situation elevates those push notifications from a feature to a critical piece of infrastructure. When we realized this, we also realized that we weren’t monitoring it as well as our other critical infrastructure. There are plenty of tests in the server code to make sure that we generate push notifications correctly and plenty in the app code to receive and display them. But there was nothing to be absolutely sure that the channel was open on production at any given time.
Here is how push notifications are sent in our flow.
- Some code decides that it wants to send a push notification and publishes the message it to Resque Bus
- Switchboard, our bus app listening for that, picks it up and puts together the right information and sends it to Urban Airship
- Urban Airship figures out a bunch of Apple/Google stuff I don’t want to know about and send it to the appropriate place.
- Apple/Google pushes it down to the device
- The device receives the message and shows it and/or runs code on the device.
It turns out there is plenty of things that can go wrong. A few we’ve seen:
- The message bus is backed up and something with higher priority is starving the messages being processed.
- Urban Airship is having problems of some sort or a minor API change is needed.
- Our certificates between the device/Apple/Urban Airship are not coordinated correctly or expired.
- The device isn’t registering its information correctly.
When monitoring important things, I’m a big believer in finding the ground truth of what it means to be “working” and monitoring that. For example, you could monitor that some cron job is running by that cron job telling you every time it runs. But I’d rather check the data that the job manipulates and make sure that is up to date.
The only parallel that we could think of for our push situation was actually sending the push notifications. So here’s what we do:
- Every 5 minutes send a “canary” push notification to each app, targeting a specific user made for the purpose.
- On each app, if it receives this “canary” notification, POST its contents (a timestamp of when it was created) back to the server.
- The server rights down in Redis the timestamp for each application.
- Monitor those timestamps. If they are older than 11 minutes or not present, alert.
In these scheme and in a first for us, we actually need real devices. In iOS apps, only run code if the app is actually running - otherwise it just shows the notifications. So we needed 2 iOS devices to handle our 2 iOS apps. Android allows us to always run the code, so we just have 1 Android device. We took a piece of wood that was left over from building our bar (ha!), zip-tied a power strip to it, and velcro’d the devices so we could pick them up more easily. We logged in as our user, and waited for the pushes to come. And they did! One other note is that we’re having to update the apps as they get updated for our users. This is important because we need to test the device registration code as well.