warning icon QCon London 2021 has been canceled. See our current virtual and in-person events.
You are viewing content from a past/completed QCon -

Presentation: Lessons From 300k+ Lines of Infrastructure Code

Track: Operationalizing Microservices: Design, Deliver, Operate

Location: Fleming, 3rd flr.

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Slides: Download Slides

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  • Learn how to write infrastructure code that you can rely on in production.
  • Get access to Gruntwork’s production-grade infrastructure checklist, a detailed list of what it takes to go to production.
  • Find out how to write, test, maintain, and release infrastructure code at scale.


This talk is a concise masterclass on how to write infrastructure code. I’ll share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild. Topics include how to design infrastructure APIs, automated tests for infrastructure code, patterns for reuse and composition, refactoring, namespacing, versioning, CI / CD for infrastructure code, and more.


What's the focus of Gruntwork?


What we're trying to do is make it easier to build software. We think software is pretty important. There are a few technologies in human history that have completely changed everything: the wheel, fire, and software. But we also believe that building software is ridiculously hard, completely unreasonably hard, given how important it is. We need to make it more accessible to more people. There's a lot of reasons it's hard to build software. We're focused on the DevOps space which seems to be especially painful. Even people that are professional programmers seem to struggle enormously with figuring out how to actually run and manage their software. At the  vast majority of companies, there's thousands of developers sitting around all doing the same thing. They're all trying to figure out how to deploy MongoDB, or Kafka. or Zookeeper, and they're all doing it basically the same way or they could be doing it the same way but for some reason every company is doing it from scratch, and it's a huge waste of time. So we've built up a library of infrastructure coe that's used in production by hundreds of companies that basically has solutions to all the common infrastructure problems that are out there including how to run a cluster of servers, how to run a database, how to configure CI/CD, monitoring, alerting, all the things that every company needs. We've done the undifferentiated heavy lifting for you so you can just focus on your actual products rather than doing a bunch of grunt work.


What are we talking about, Ansible scripts that people can deploy and build?


It's a combination of a few technologies. It's all infrastructure as code tools. We use a lot of Terraform, a good amount of code in Go for things that we want to run on multiple operating systems, some Python, and plenty of Bash scripts because that's the state of the art in 2019. And a lot of code built for running in the cloud, AWS, GCP and Azure.


What are you talking about that your library offers?


The library consists of about 50 Git repositories, each one focused on one specific problem. One repo is for setting up your network topology, another for a Docker cluster, another for a database, etc., and inside of each repo are a bunch of standalone. reusable modules. Part of what I'll be discussing in my talk is how you should organize infrastructure code. Our library consists of a bunch of these little reusable modules, and the idea is we want you to be able to combine them like Lego building blocks.

For example, say you have a module that can run an ELK (Elasticsearch, Logstash, Kibana) cluster. There might be one module that runs Kibana, a separate one that configures the firewall settings for Kibana, anod another one that configures the permissions that has to talk to a cloud system, and you can basically combine those in different ways. So maybe you'll run Elasticsearch and Kibana on separate clusters, or maybe Elasticsearch master nodes and data nodes on separate clusters, or in a dev environment you might run them all on one cluster or even a single server. But the point is we have these reusable building blocks that you can combine in a bunch of different ways. Plus a lot of the example code that shows you these different permutations.

And so you can assemble your infrastructure out of these pieces, out of these modules. All of them are code. So it's not like we're shipping a platform as a service—it's not Heroku, it's code. And we really think that the only way that companies at scale can be effective is to use code, because you can combine code anywhere you want, you can handle all the different use cases. The only way you can do that is if you have access to the code. That's what we're providing.


Regarding your talk, you're going to go through some of the lessons you've learned over the years putting this together?


Yeah, the library that we've put together, at last count, is around  300,000 lines of code. We spent quite a few years working on this stuff. Along the way we got an awful lot of things wrong, a lot of really fun mistakes. The goal of the talk is to save people a lot of pain. I think in a 40 minute talk you're going to be able to pick up several years worth of experience really quickly. And the real goal, the thing that I'm trying to get across, is what it takes to build infrastructure code that you can use in production. Not infrastructure code that's great for a five minute demo, not infrastructure code that looks really pretty, but infrastructure code that you can actually bet your company on. You're going to run your database. You're going to run your search index. You're going to run the most core things that your company does if you're a software company. You're going to put them onto this infrastructure code. What does it take to do that? What are the things you need to actually keep in mind? There's a lot of amazing tools out there these days, but if you don't use them correctly, they're not going to help you that much, and there are far more ways to use them incorrectly than correctly. So that's the goal of the talk, to help you figure out how to use these infrastructure as code tools to go to production successfully.


Can you give me an example of one of the lessons you might talk about that a lot of people get wrong?


There are a few things that I go over. One of the real highlights is what do you need to do to go to production? I'm going to share a checklist of what every piece of production  infrastructure needs to do. And some of these things I think most people are aware of. Obviously, your code needs to be able to configure the system, deploy some kind of virtual hardware for it ,and then to run it. Everyone's aware of that. But what I found is part of the reason people struggle with going into production it's the things they don't know that they don't know.

So they go and they deploy, let's say Elasticsearch, and they run it on some server, and think, "Great, I've got it running", and they put it in production and everyone assumes that they are good to go. And then three weeks later you have a gigantic outage and you lose all your data and you realize, "Oh, I forgot to set up data backups." OK, cool. So you learn that lesson the hard way. You spend a few more weeks working on a data backup system. And then you realize "My company processes personally identifiable info for patients." Turns out I need to encrypt everything over the wire. So now you're going to go and spend weeks more figuring out how to do TLS and encrypting disks and so on and so forth.

And it's a long list of things like that. There is a lot to do and we've created a checklist for ourselves. I'll be sharing that checklist with everyone. And I think it makes it fairly clear why infrastructure takes so long. That's the other really painful thing about this DevOps space: it just seems to take so much longer than you expect. You spend a few weeks building your app, you're super excited to release it. You think it's going to take you a day or two—it's 2019, it's the cloud, how hard can it be? And it takes companies six months or even 12 months to really go to production even in the cloud. I think people are taken back by that and I think having a checklist really helps you understand why that is the case.

Speaker: Yevgeniy Brikman

Co-founder @gruntwork_io

Yevgeniy (Jim) Brikman is the co-founder of Gruntwork, a company that provides DevOps as a Service. He's also the author of two books published by O'Reilly Media: Hello, Startup andTerraform: Up & Running. Previously, he worked as a software engineer at LinkedIn, TripAdvisor, Cisco Systems, and Thomson Financial and got his BS and Masters at Cornell University.

Find Yevgeniy Brikman at

Last Year's Tracks