You are viewing content from a past/completed QCon

Presentation: Lessons From 300k+ Lines of Infrastructure Code

Track: Operationalizing Microservices: Design, Deliver, Operate

Location: Fleming, 3rd flr.

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Share this on:

This presentation is now available to view on

Watch video with transcript

What You’ll Learn

  • Learn how to write infrastructure code that you can rely on in production.
  • Get access to Gruntwork’s production-grade infrastructure checklist, a detailed list of what it takes to go to production.
  • Find out how to write, test, maintain, and release infrastructure code at scale.


This talk is a concise masterclass on how to write infrastructure code. I’ll share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild. Topics include how to design infrastructure APIs, automated tests for infrastructure code, patterns for reuse and composition, refactoring, namespacing, versioning, CI / CD for infrastructure code, and more.


What's the focus of Gruntwork?


What we're trying to do is make it easier to build software. We think software is pretty important. There are a few technologies in human history that have completely changed everything: the wheel, fire, and software. But we also believe that building software is ridiculously hard, completely unreasonably hard, given how important it is. We need to make it more accessible to more people. There's a lot of reasons it's hard to build software. We're focused on the DevOps space which seems to be especially painful. Even people that are professional programmers seem to struggle enormously with figuring out how to actually run and manage their software. At the  vast majority of companies, there's thousands of developers sitting around all doing the same thing. They're all trying to figure out how to deploy MongoDB, or Kafka. or Zookeeper, and they're all doing it basically the same way or they could be doing it the same way but for some reason every company is doing it from scratch, and it's a huge waste of time. So we've built up a library of infrastructure coe that's used in production by hundreds of companies that basically has solutions to all the common infrastructure problems that are out there including how to run a cluster of servers, how to run a database, how to configure CI/CD, monitoring, alerting, all the things that every company needs. We've done the undifferentiated heavy lifting for you so you can just focus on your actual products rather than doing a bunch of grunt work.


What are we talking about, Ansible scripts that people can deploy and build?


It's a combination of a few technologies. It's all infrastructure as code tools. We use a lot of Terraform, a good amount of code in Go for things that we want to run on multiple operating systems, some Python, and plenty of Bash scripts because that's the state of the art in 2019. And a lot of code built for running in the cloud, AWS, GCP and Azure.


What are you talking about that your library offers?


The library consists of about 50 Git repositories, each one focused on one specific problem. One repo is for setting up your network topology, another for a Docker cluster, another for a database, etc., and inside of each repo are a bunch of standalone. reusable modules. Part of what I'll be discussing in my talk is how you should organize infrastructure code. Our library consists of a bunch of these little reusable modules, and the idea is we want you to be able to combine them like Lego building blocks.

For example, say you have a module that can run an ELK (Elasticsearch, Logstash, Kibana) cluster. There might be one module that runs Kibana, a separate one that configures the firewall settings for Kibana, anod another one that configures the permissions that has to talk to a cloud system, and you can basically combine those in different ways. So maybe you'll run Elasticsearch and Kibana on separate clusters, or maybe Elasticsearch master nodes and data nodes on separate clusters, or in a dev environment you might run them all on one cluster or even a single server. But the point is we have these reusable building blocks that you can combine in a bunch of different ways. Plus a lot of the example code that shows you these different permutations.

And so you can assemble your infrastructure out of these pieces, out of these modules. All of them are code. So it's not like we're shipping a platform as a service—it's not Heroku, it's code. And we really think that the only way that companies at scale can be effective is to use code, because you can combine code anywhere you want, you can handle all the different use cases. The only way you can do that is if you have access to the code. That's what we're providing.


Regarding your talk, you're going to go through some of the lessons you've learned over the years putting this together?


Yeah, the library that we've put together, at last count, is around  300,000 lines of code. We spent quite a few years working on this stuff. Along the way we got an awful lot of things wrong, a lot of really fun mistakes. The goal of the talk is to save people a lot of pain. I think in a 40 minute talk you're going to be able to pick up several years worth of experience really quickly. And the real goal, the thing that I'm trying to get across, is what it takes to build infrastructure code that you can use in production. Not infrastructure code that's great for a five minute demo, not infrastructure code that looks really pretty, but infrastructure code that you can actually bet your company on. You're going to run your database. You're going to run your search index. You're going to run the most core things that your company does if you're a software company. You're going to put them onto this infrastructure code. What does it take to do that? What are the things you need to actually keep in mind? There's a lot of amazing tools out there these days, but if you don't use them correctly, they're not going to help you that much, and there are far more ways to use them incorrectly than correctly. So that's the goal of the talk, to help you figure out how to use these infrastructure as code tools to go to production successfully.


Can you give me an example of one of the lessons you might talk about that a lot of people get wrong?


There are a few things that I go over. One of the real highlights is what do you need to do to go to production? I'm going to share a checklist of what every piece of production  infrastructure needs to do. And some of these things I think most people are aware of. Obviously, your code needs to be able to configure the system, deploy some kind of virtual hardware for it ,and then to run it. Everyone's aware of that. But what I found is part of the reason people struggle with going into production it's the things they don't know that they don't know.

So they go and they deploy, let's say Elasticsearch, and they run it on some server, and think, "Great, I've got it running", and they put it in production and everyone assumes that they are good to go. And then three weeks later you have a gigantic outage and you lose all your data and you realize, "Oh, I forgot to set up data backups." OK, cool. So you learn that lesson the hard way. You spend a few more weeks working on a data backup system. And then you realize "My company processes personally identifiable info for patients." Turns out I need to encrypt everything over the wire. So now you're going to go and spend weeks more figuring out how to do TLS and encrypting disks and so on and so forth.

And it's a long list of things like that. There is a lot to do and we've created a checklist for ourselves. I'll be sharing that checklist with everyone. And I think it makes it fairly clear why infrastructure takes so long. That's the other really painful thing about this DevOps space: it just seems to take so much longer than you expect. You spend a few weeks building your app, you're super excited to release it. You think it's going to take you a day or two—it's 2019, it's the cloud, how hard can it be? And it takes companies six months or even 12 months to really go to production even in the cloud. I think people are taken back by that and I think having a checklist really helps you understand why that is the case.

Speaker: Yevgeniy Brikman

Co-founder @gruntwork_io

Yevgeniy (Jim) Brikman is the co-founder of Gruntwork, a company that provides DevOps as a Service. He's also the author of two books published by O'Reilly Media: Hello, Startup andTerraform: Up & Running. Previously, he worked as a software engineer at LinkedIn, TripAdvisor, Cisco Systems, and Thomson Financial and got his BS and Masters at Cornell University.

Find Yevgeniy Brikman at


  • Architectures You've Always Wondered About

    Hard-earned lessons from the names you know on scalability, reliability, security, and performance.

  • Machine Learning: The Latest Innovations

    AI and machine learning is more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice.

  • Kubernetes and Cloud Architectures

    Learn about cloud native architectural approaches from the leading industry experts who have operated Kubernetes and FaaS at scale, and explore the associated modern DevOps practices.

  • Evolving Java

    JVM futures, JIT directions and improvements to the runtimes stack is the theme of this year’s JVM track.

  • Next Generation Microservices: Building Distributed Systems the Right Way

    Microservice-based applications are everywhere, but well-built distributed systems are not so common. Early adopters of microservices share their insights on how to design systems the right way.

  • Chaos and Resilience: Architecting for Success

    Making systems resilient involves people and tech. Learn about strategies being used, from cognitive systems engineering to chaos engineering.

  • The Future of the API: REST, gRPC, GraphQL and More

    The humble web-based API is evolving. This track provides the what, how, and why of future APIs.

  • Streaming Data Architectures

    Today's systems move huge volumes of data. Hear how the innovators in this space are designing systems and leveraging modern data stream processing platforms.

  • Modern Compilation Targets

    Learn about the innovation happening in the compilation target space. WebAssembly is only the tip of the iceberg.

  • Leaving the Ivory Tower: Modern CS Research in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Bare Knuckle Performance

    Crushing latency and getting the most out of your hardware.

  • Leading Distributed Teams

    Remote and distributed working are increasing in popularity, but many organisations underestimate the leadership challenges. Learn from those who are doing this effectively.

  • Full Cycle Developers: Lead the People, Manage the Process & Systems

    "Full cycle developers" is not just another catch phrase; it's about engineers taking ownership and delivering value, and doing so with the support of their entire organisation. Learn more from the pioneers.

  • JavaScript: Pushing the Client Beyond the Browser

    JavaScript is not just the language of the web. Join this track to learn how the innovators are pushing the boundaries of this classic language and ecosystem.

  • When Things Go Wrong: GDPR, Ethics, & Politics

    Privacy, confidentiality, safety and security: learning from the frontlines, from both good and bad experiences

  • Growing Unicorns in the EU: Building, Leading and Scaling Financial Tech Start Ups

    Learn how EU FinTech innovators have designed, built, and led both their technologies and organisations.

  • Building High Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Learn about cultivating culture, creating psychological safety, sharing the vision effectively, and more

  • Scaling Security, from Device to Cloud

    Implementing effective security is vitally important, regardless of where you are deploying software applications.