There are many ways to run a DevOps team. Is there a recipe in this post to run the best ops team ever? Certainly not. This is just the starting point: the basic things you need to look for when everything is on fire and you can’t just double or triple the size of the team.
How the ops team at a startup might start
I wish I was able to start this section by saying “of the 74 ops teams I helped build over the last 20 years…”, I can’t. But I’ve seen a lot of examples of people starting products, some successful, most of them not at all.
Usually it’s either a couple of developers or one guy that has an idea who hires the smallest number of developers possible. And you’ve got 2 types of developers: the ones that are in love with Docker, containers, and anything that revolves around that; and the ones that agree it’s an amazing thing we have nowadays, but that’s it.
If you’ve got one developer on your team that is in love with Docker and doesn’t mind spending weekends coding, chances are you’ve automated a lot of things for free and ops is still a challenge but you’ll do fine.
If you don’t have one of those developers, then your product was done with all the love and maybe, maybe, some Docker here and there, but the focus was on the product, not on throwing it in production properly.
In that case, with the MVP fever running through everyone in the team, here’s how deployment to production is likely going to be the first couple of times:
- Get 2 VPS instances somewhere
- Deploy nginx and Django on one VPS
- Deploy some database on the other VPS
- Make sure you can reach nginx and both VPS can talk to each other
- Go back to developing that feature or bug fix while the product hopefully makes money
That’s obviously a simplification, but you get the idea.
Eventually, moving the code to the VPS and restarting the Django instance becomes painful because you’re doing it all the time so one of the Python devs says “I’ll automate this with fabric”.
Things run somewhat smoothly, except for those 5 days where the database kept throwing errors and two of the developers didn’t sleep a lot.
Then the team grows because the product is making a bit more money, and it turns out that new developer also likes to be up to date with the latest tech in everything, he sees these set of scripts and says “We should probably do this with Puppet or Ansible”.
So he writes an Ansible playbook one day in between tickets, and deployments are done through his machine now because nobody else knows how to use that tool.
Fast forward some time, the playbook is quite complex, you have many VPS instances, you start considering moving to AWS or some cloud. The developer that said the word “ansible” is one of a group of 3 or 4 DevOps. And they spend day and sometimes night making sure everything is running, getting developers what they need to put their shiny new thing online, or hopefully automating some tasks here and there.
When things start moving fast, you have no time to think about the best possible option. You need to fix the problem in front of you and continue.
I always say “It’s easy to be a good parent when you had a good night’s sleep”. This is a similar thing. But it’s still a mess and it needs some figuring out. Infrastructure doesn’t “grow up” on their own like kids do, so the typical parenting situation where “you just need to wait it out, it’s a phase” doesn’t work here.
What do you do then?
Make the things that suck cyclical
First, just like with parents, let the team sleep peacefully in a predictable schedule. Help them get some work/life balance, even if they don’t like it or say they don’t need it.
The best approach for this is to set up some tool such as PagerDuty, that helps you handle this. It routes alerts based on scheduling, it escalates issues if the on call dev is not responding, etc.
DevOps can now have maybe a whole week where they can have a 9 to 5 schedule and focus on other things that aren’t fires. They can plan vacations in a way that doesn’t leave the rest of the team toasted, maybe right after an on-call week.
It helps structure an otherwise completely reactive situation.
Measure and set bear traps
Running an online company is like walking in a forest covered in snow: you walk in the peaceful silence, just hearing your steps on the snow, maybe some branches moving. Until a polar bear appears out of nowhere, bites your arm off and now it’s all covered in red and there’s no way anybody else will walk around without seeing either your glowing red blood or you trying to find the bandaids or something.
That is, if you don’t have polar bear traps. Polar bear traps are just like regular bear traps but, well… white. And you need to set them up everywhere surrounding the places you want to walk in peace.
NOTE TO THE READER: Please don’t hurt bears, of any kind, this is an analogy.
When you first face a team as described in the first section, they are likely sleep deprived, and, as anybody in that situation, they can’t even begin to think about what’s wrong.
Once they are rested and feel like human beings again, it’s a matter of asking the right questions. In this case, the questions are simple:
- What went wrong?
- How can we detect it?
- What signs can we see when it’s about to happen?
This is basically figuring out where the bears usually start walking towards you so that you can detect them way before they start running.
If you can accomplish that, you will end up in situations where you need to act quickly, but you end up with no down time because you catch the issue fast enough.
This is not enough though.
Identify problems
Detecting the wave as it’s starting to form is super important, but that’s not the way you solve an issue. You need to spend time thinking about what actually happened, what the actual problem is and not just detect the symptoms.
You need to allocate time into doing these analyses into proper postmortems.
Free up time to automate
There’s not a lot to this stage: it’s important to always have a stage in every situation where you only focus on automation of tasks.
Sometimes that means just write a bash script instead of saving the bash history and doing that task by doing up up return up up return.
Sometimes it’s about changing some architectural aspect on software so deployment is faster.
Sometimes it’s about having a tool that makes sure containers keep running and auto scales. This is the way to go, by the way, but depending on the size of your team and how their time is spent, it’s not a simple task. You can certainly plan it in progressive steps though.
Repeat
There are always going to be tasks that almost objectively suck, most of them you’ll eventually automate, but there’s always something left. There are always going to be paths you didn’t put traps on, although they’ll be fewer and fewer with time. And there are always things that can be automated even more.
So it’s a never ending job; the goal needs to be that it’ll become easier and easier with time.