DevOps: It’s the culture, stupid!

Posted by chetan on October 26, 2013 in development, work

Last week saw the return of DevOpsDays to New York and like many who attended, I went into day one without a solid definition for, or real understanding of, what “DevOps” actually means. Does it mean wearing both dev and ops hats? Is it a new team outside of the existing dev and ops teams? Is it a tool?

My first introduction to DevOps was probably via tools like Puppet and Chef, a new breed of configuration management tool. Though Puppet had already been around for quite a while and was steadily gaining in popularity, Chef arrived at around the same time that the DevOps movement began. So with all the overlapping blog posts on the two, it’s all too easy to take and automation tool like Chef and come up with a very literal (and very wrong) definition: Chef/Puppet = Automation Software = Infrastructure as Code = Development + Operations = DevOps! Got it! Well, not quite…

By the end of day two I think I finally had it sorted out. DevOps isn’t just about fancy tools or automation or “infrastructure as code” or anything of the sort; no, it’s really quite simple: it’s the culture, stupid!

In hindsight it’s very telling that there were nearly no scheduled talks about tools (Etsy’s Morgue being the lone exception, but I’ll come back to that) — this wasn’t a conference for learning about shiny new toys like Ansible or Salt, but instead for learning about DevOps culture.

John Willis probably describes it best with his acronym, CAMS: Culture, Automation, Measurement, Sharing. So measurement and automation facilitate a culture of collaboration and sharing between teams. Not just dev and ops, but also QA, product, and really just about every single group in the organization. So tools are important, but culture comes first. If you don’t get the culture right, the tools aren’t going to help. If your teams don’t communicate, the tools alone won’t save you. Remember, the goal is to ship better product, not to ship crap products more quickly.

(Kris Buytaert added lean to the mix to create CLAMS, which is great, but not absolutely necessary. Your org doesn’t have to be lean or agile in order to benefit from a DevOps culture, but DevOps is essential to efficiently do agile well.)

What Is DevOps Culture?

Engineering teams have traditionally been siloed in their own little worlds. The usual software development cycle goes something like this: developers work on a release for six months with no involvement from the operations team and when it’s ready to go into production, simply “throw it over the wall” to ops. So it’s no surprise that memes like this one hit all too close home:

 

 

DevOps is a reaction to the increasing tensions between various groups in the engineering organization; an attempt to call a cease fire and get them to work together, rather than continue their petty turf wars. There were repeated calls last week to break down those silos, collaborate and share more, though at the same time, while great in theory, there was very little in the way of “here’s how you do it,” only examples of how not to do it.

DevOps Culture Is Blameless Culture

One talk which I thought at least indirectly shed some light on the topic was Bethany Macri’s introduction to Morgue — Etsy’s new open source tool for collecting post-mortem information after a production failure. While the tool itself is intriguing, the most interesting part of the talk for me was the description of their “blameless” post-mortem process, which although I had seen John Allspaw talk about it a couple of years ago (blog post, video), now when presented within the larger DevOps context really started to click.

What is it?

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

— via Blameless PostMortems and a Just Culture

What it boils down to is this: once you put all blame aside, you get an honest understanding of what happened and why. Willis alludes to the same idea in his anecdote for the definition of Sharing:

Jody Mulkey, the CIO at Shopzilla, told me that they get in the war room the developers and operations teams describe the problem as the enemy, not each other.

Blameless culture can also be applied to the engineering org as a whole, throughout the entire development lifecycle; not just during post-mortems. If you can get people across various teams to step back for a second and realize that everyone is working towards the same basic goal in the interests of the company, you can hopefully move the group in a more open and collaborative direction. Teams should be willing to accept input from other teams, even if unprompted. It’s not because your code sucks, it’s because we think it can be better, and if the result is better, we all win, right?

Some actionable advice

Etsy gives us some very actionable advice for running post-mortems in a blameless way, but how do we apply this to the rest of the organization? The best answer I’ve been able to come up with is in the form of more questions, asked by people wearing different hats in an organization:

Developers:

  • What can I do to help ops? Are there changes to the build process I can make so my app is easier to deploy? Easier to monitor? A more usable log format? What can we do in the app to mitigate the production incident we just had?
  • Making architecture changes? Need to add a queue? I should involve ops in the discussion.
  • What about a tighter feedback loop with QA?

QA:

  • What’s going on in production? Can I see logs? Metrics?
  • What’s the current build running right now? What’s the configuration?

Operations:

  • What information can I provide back to dev & qa from our various environments? Do they need access to logs or metrics?
  • Do we have clear, standard processes for the build/deploy/postmortem (support) phases?

Business:

  • What’s in production right now?
  • How often do we deploy? Are they features or fixes going out?
  • What’s the error rate?
  • How’s performance?

The common thread here is visibility into what’s going on, not in any one area, but everywhere. This is where automation and tooling plays a huge role because answering a lot of these questions without the right tools would be far too time consuming to have any chance of actually getting an answer, much less with any regularity.