How we learned to trust our developers with infrastructure changes
This is a tale of how Bynder engineers went from week-long waits to having the power to change infrastructure on their own.
When I first joined Bynder in 2019, I was assigned to the DevOps team, this is the first time I saw what it actually meant to manage all of a product’s infrastructure as code (IaC). Sure, I had dabbled in Ansible and similar tech but here I’m seeing every single piece of infrastructure written in terraform, put in a code repository, suggested in a Pull Request and approved by another member of the team. All this, before it is even deployed into production. “Amazing” I thought!
Not long after, I was given my first ticket. Some team wanted me to make a small change in the infrastructure, “this is no big deal” I thought. I went on to set up the code, created the pull request, got it approved and got it deployed. BOOM, done in just 1 day of work, easy peasy lemon squeezy.
Proud of the work I had just accomplished in such a small amount of time, I went on to change the ticket status to done. This is when I realized… The ticket had been open for 2 weeks.
Let that sink in, a small infrastructure change blocked a development team for 2 weeks.
Two weeks of waiting
This was keeping me up at night… I’m a DevOps Engineer, but not long ago I was a Backend Engineer. At my previous job I had to wait months to get code released into production. Getting infrastructure changes done was even worse. I don’t want anyone to have to go through the same as me to get a port change in a firewall rule.
Fast forward to 6 months later and my colleague makes a mention of Atlantis. No, not the lost underwater city once ruled by Aquaman, I mean the fully open-source project, for managing terraform code in pull requests.
I start looking into this technology and it hits me, this is the key to solve all the problems in the world! Well, at least if you are like me and the world revolves around DevOps and developer efficiency.
After we tried it out with a couple of test projects we saw all the benefits of it:
- Integrates seamlessly with GitHub pull requests;
- It applies the terraform code in a backend, no more hours of waiting with your laptop on waiting for infrastructure to be created;
- It fully integrates with Terraform, no need to change our existing code;
- We can set it to require reviews from the DevOps team;
- And the best part, anyone can propose and apply changes!
After we implemented Atlantis in our infrastructure code, we started using it ourselves within the DevOps team, making sure it does work like we expected. Not long after, we started letting development teams know they can start proposing their own code changes.
What is trust?
Of course, trust doesn’t mean you have to blindly accept every change a team proposes. That would be like leaving a baby in a room full of candy and expect him not to get sick.
First, some easy configuration changes like changing database connection strings or adding a couple of new secrets in the secrets manager. As we’re seeing developers adopt the technology we see how much faster infrastructure changes are making it into production.
So we came up with what we consider a good middle ground:
- Almost all code changes must be reviewed by a DevOps team member;
- Some files can be reviewed by the development team owning said files;
- Destructive changes follow a more strict process requiring multi-factor authentication;
- Security-sensitive changes (like IAM) require multi-factor authentication;
Three years later…
Now you might be thinking, “Cool story but you only removed two steps from the process”. While that is true, if you take a closer look you’ll realize that not only did we remove the waiting time between ticket creation and pick up, we also removed most of the manual processes involved.
Three years later, we’re managing an infrastructure code base of:
- ~250 Pull Requests per month
- 213 Terraform projects
- 27 AWS accounts
- 1430 Terraform states
All of this with a DevOps team of only 7 engineers, who can now focus on solving bigger problems for our development teams.
And because of this shift, development teams are much more aware of the infrastructure their applications run in. Which not only leads to faster and safer development cycles but also to more experimentation of new technology.
What about you, do you think development teams should take ownership of their infrastructure?