This is a post which we wrote for internal use at TaskRabbit. It is very specific to our infastructure and topology, however we wanted to share it in the hopes that it might provide value to you as well.
When you Move Fast and Break Things, often times… you sometimes end up breaking things.
A good engineer understands the ecosystem her application runs within, and how to check for and fix deployment errors. Perhpas your most recent deployment added a new gem that ended up consuming far more ram than you expected, or something went wrong with your asset pipeline and now there is no CSS. This post is a survey of some of the tools we use at TaskRabbit to deploy and rollback.
TaskRabbit uses capistrano to deploy all of our applications. We have even modified Capistrano so we can use it universally to deploy Rails, Sinatra, and even node.js applications. This allows all of our developers to know the patern each deploy will follow, and we can ensure that there will be no supprises between applications. Below you can see what each capistrano does (with our extensions).
At it’s simplest case, capistrano creates a timestamped ‘release’ directory every time you deploy which is eventually symlinked to the ‘current’ directory. Each deploy you:
- Create a new deploy directoy
- Update code (git pull)
- Run precompile (assets, etc)
- Run database update (migrations, seeds)
- Change the symlink (this is what capistrano called “commit”)
- Restart the web workers (unicorns)
- Clear the cache
- Restart the background workers (resque)
98% of the time, our deployments work without any problems. 1% of the time something goes wrong, and we can recover with
cap deploy:rollback. This built-in capistano command reverts back to the previous release, redoes the symlink, and restarts the workers. It is very simple and very effective. You can do it manually if you need to. The final 1% of the time we need to resort to the tools lower in the stack.
TaskRabbit uses monit to control all of our application stack, including nginx, unicorn, and resque. This allows us to have a simple control interface (IE:
monit start unicorn) for all the applications we care about. While these applications may have complex boot arguments, a human can easily control them. This makes contolling the system in an emergency simple. If something goes wrong with monit, we can still inspecct the monit.rc boot files to know what we need to run manually. Monit also provides a nice aggrigate reporting tool, m/monit so we can report on and check all of our serer’s status.
We use chef to configure our servers. To us, this means not only the underlying packages and apps, but also all of the configuration files we need. This includes various .yml files most rails apps need along with ruby itself. If something is wrong with your application, it might be worth it to manually run
We use a combination of Ubuntu and Joyent’s SmartOS operating system. It’s important to know how to read load average spikes, tell which processes are running, and how to peek into the system’s recourse. Are you out of ram or disk space? How can you recover it?
Did this deploy change an API endpoint or expect a differenent schema from another aplication? It’s improtant to have higher level metrics about the interplay between your apps. Is your resque queue backed up, or are new users not getting sent emails?
At TaskRabbit, we don’t do continious deployment, but we do want to reduce the time it takes to get a feature up on production as much as possible. To this end, we allow any team lead to deploy as long they demonstrate that they can recover from a bad deployment. In the image below, you will see the questions on our deployment test. Can you answer them all?