391 16 9MB
English Pages 413 [419] Year 2020
The DevOps Toolkit: Catalog, Patterns, And Blueprints Viktor Farcic and Darin Pope This book is for sale at http://leanpub.com/the-devops-toolkit-catalog This version was published on 2020-12-30
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do. © 2020 Viktor Farcic and Darin Pope
Contents Introduction . . . . . . . . . . I Need Your Help . . . . . Who Are We? . . . . . . . About The Requirements Off We Go . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 1 2 3 3
Infrastructure as Code (IaC) . . . . . . . . . . . . . . . . . . . . Going Back In Time . . . . . . . . . . . . . . . . . . . . . . . Back To Present . . . . . . . . . . . . . . . . . . . . . . . . . . Using Terraform To Manage Infrastructure As Code (IaC) What Are We Going To Do? . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
5 5 7 8 8
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform Preparing For The Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Defining The Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storing The State In A Remote Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Worker Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Upgrading The Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reorganizing The Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Destroying The Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
10 10 11 13 16 19 26 30 32 34 37 41
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform Preparing For The Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storing The State In A Remote Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Worker Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Upgrading The Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43 43 44 46 50 58 65 67 71
CONTENTS
Reorganizing The Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Destroying The Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform Preparing For The Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storing The State In A Remote Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating The Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploring Terraform Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Worker Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Upgrading The Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dealing With A Bug That Prevents Upgrade Of Node Pools . . . . . . . . . . . . . . . Reorganizing The Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Destroying The Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
79 79 80 82 86 92 95 97 101 104 105 109
Packaging, Deploying, And Managing Applications . . . . . . . . . . . . . Using Helm As A Package Manager For Kubernetes . . . . . . . . . . . . Defining A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preparing For The Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Helm Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding Application Dependencies . . . . . . . . . . . . . . . . . . . . . . Deploying Applications To Production . . . . . . . . . . . . . . . . . . . . Deploying Applications To Development And Preview Environments . Deploying Applications To Permanent Non-Production Environments Packaging And Deploying Releases . . . . . . . . . . . . . . . . . . . . . . Rolling Back Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Did We Do Wrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Destroying The Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
111 112 113 116 117 125 129 133 138 141 145 147 147
Using Helm As A Package Manager For Kubernetes . . . . . . . . . . . . . Defining A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preparing For The Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Helm Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding Application Dependencies . . . . . . . . . . . . . . . . . . . . . . Deploying Applications To Production . . . . . . . . . . . . . . . . . . . . Deploying Applications To Development And Preview Environments . Deploying Applications To Permanent Non-Production Environments Packaging And Deploying Releases . . . . . . . . . . . . . . . . . . . . . . Rolling Back Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Did We Do Wrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Destroying The Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
149 150 152 154 161 166 170 175 178 182 183 184
Setting Up A Local Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
CONTENTS
Which Operating System Is The Best For Laptops? Installing Windows Subsystem For Linux (WSL) . . Choosing A Shell . . . . . . . . . . . . . . . . . . . . . A Short Intermezzo . . . . . . . . . . . . . . . . . . . Choosing An IDE And A Terminal . . . . . . . . . . Using Oh My Zsh To Configure Z Shell . . . . . . . Going For A Test Drive With Oh My Zsh . . . . . . What Should We Do Next? . . . . . . . . . . . . . . . There Is More . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
185 187 190 191 192 193 196 200 200
Exploring Serverless Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Using Managed Functions As A Service (FaaS) Deploying Google Cloud Functions (GCF) . . Deploying Azure Functions (AF) . . . . . . . Deploying AWS Lambda . . . . . . . . . . . . To FaaS Or NOT To FaaS? . . . . . . . . . . . Choosing The Best Managed FaaS Provider . Personal Thoughts About Managed FaaS . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
204 206 217 232 240 242 245
Using Managed Containers As A Service (CaaS) . . . . . . . . . . . . . . . . . . . . . . Discussing The “Real” Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deploying Applications To Google Cloud Run . . . . . . . . . . . . . . . . . . . . . . Deploying Applications To Amazon Elastic Container Service (ECS) With Fargate Deploying Applications To Azure Container Instances . . . . . . . . . . . . . . . . . To CaaS Or NOT To CaaS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Personal Thoughts About Managed CaaS . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
247 248 251 262 274 282 287
Using Self-Managed Containers As A Service (CaaS) . . . . . . . . . . . . . . . . . . . . . . . . 289 Using Knative To Deploy And Manage Serverless Workloads . . . . . . . . . . . . . . . . . . 289 Self-Managed Vs. Managed CaaS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 There Is More About Serverless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Using Centralized Logging . . . . . . . . . About Vadim . . . . . . . . . . . . . . . Why Not Using The ELK Stack? . . . . Using Loki To Store And Query Logs . Destroying The Resources . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
316 316 317 317 335
Deploying Applications Using GitOps Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Discussing Deployments And Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Off We Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Applying GitOps Principles Using Argo CD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
CONTENTS
Installing And Configuring Argo CD . . . . . . . . . . . . . . . . Deploying An Application With Argo CD . . . . . . . . . . . . . Defining Whole Environments . . . . . . . . . . . . . . . . . . . . Creating An Environment As An Application Of Applications Updating Applications Through GitOps Principles . . . . . . . . Destroying The Resources . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
342 348 352 360 364 369
There Is More About GitOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Applying Progressive Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Using Argo Rollouts To Deploy Applications . . . . . . . . . . . Installing And Configuring Argo Rollouts . . . . . . . . . . . . Exploring Argo Rollouts Definitions . . . . . . . . . . . . . . . Deploying The First Release . . . . . . . . . . . . . . . . . . . . Deploying New Releases Using The Canary Strategy . . . . . Rolling Back New Releases . . . . . . . . . . . . . . . . . . . . . Exploring Prometheus Metrics And Writing Rollout Queries Exploring Automated Analysis . . . . . . . . . . . . . . . . . . Deploying Releases With Fully Automated Steps . . . . . . . What Happens Now? . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
374 374 377 382 387 393 397 402 407 411
This Is NOT The End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Introduction Unlike my other books where I typically dive into a single tool or a single process, this time, I chose a different approach. Instead of going to great lengths trying to help someone become proficient in one thing, this time, I am trying to give you a quick introduction into many different tools and processes. We will skip the potentially lengthy discussions and in-depth exercises. What I want, this time, is to help you make decisions. Which tool works the best for a given task? What should we explore in more depth, and what is a waste of time? The goal is not to learn everything about a tool in detail but rather to dive into many concepts and a plethora of tools right away. The aim is to get you up-to-speed fast while producing useful “real world” results. Think of each chapter as a crash-course into something with the outcome that you can use right away. I will assume that you don’t have time to read hundreds of pages to learn something that you are not even sure is useful. Instead, I will guess that you got up to one hour to read a summary, and then decide if a tool is worthwhile a more significant investment. This is a catalog of the tools, and the processes I believe are useful in this day and age. I will try to transfer what I think works well and what might have been the right choice in the past but is not optimal anymore. Nevertheless, even if the scope of this book is different than others, some things are still the same. This is not a book with lots of theory. Sure, there will be some text you might need to read, but most of the content consists of hands-on exercises. I always believed that the best way to learn something is through practice, and I am not giving up on that. This is a book full of real-world hands-on examples, and each chapter will let you dive into a different tool or a process. At the end of each, you will be able to say, “now I know what this is about, and now I can make a decision whether it is a worthwhile investment.” Think of this book as a catalog, combined with patterns and blueprints.
I Need Your Help I will do my best to accommodate different needs. For example, if we need a Kubernetes cluster, you will find examples for at least a few flavors (e.g., GKE, EKS, AKS, Minikube, and Docker Desktop). If we need a cloud provider, there will be examples in at least the three major ones (e.g., AWS, GCP, and Azure). And so on and so forth. Nevertheless, there is always a limit to how many variations I can include. Yet, if you do feel that the one you are using is not represented, I will gladly add it (if I can). You just need to let me know. While we are on the subject of including stuff, I prefer to drive this material by your needs. I want to hear back from you. What is the tool you would like me to explore? What did I miss? What are
Introduction
2
you interested in? Please let me know and, if that is something I feel confident working with, I will do my best to extend the material. This book will grow based on your feedback. The critical thing to note is that I want to keep it alive and to keep adding tools. But, I will do that only if you tell me to. So, it’s up to you to let me know what to add. And, to do that, you will need a way to contact me. So, here it goes. Please join the DevOps20¹ Slack workspace and post your thoughts, ask questions, or participate in a discussion. If you prefer a more one-on-one conversation, you can use Slack to send me a private message or send an email to [email protected]. All the books I have written are very dear to me, and I want you to have a good experience reading them. Part of that experience is the option to reach out to me. Don’t be shy. If none of those communication channels work for you, just Google my name, and you’ll find others. Honestly, any way you prefer to reach me is OK. You can even send a courier pigeon.
Who Are We? Before we dive further, let me introduce you to the team comprised of me, Viktor, and another guy, which I will introduce later.
Who Is Viktor? Let’s start with me. My name is Viktor Farcic. I currently work in Codefresh². However, things are changing and, by the time you are reading this, I might be working somewhere else. Change is constant, and one can never know what the future brings. At the time of this writing, I am a Principal DevOps Architect. What else can I say about myself? I am a member of the Google Developer Experts (GDE)³, Continuous Delivery Foundation Ambassadors⁴, and Docker Captains⁵ groups. You can probably guess from those that I am focused on containers, Cloud, Kubernetes, and quite a few other things. I’m a published author. I wrote quite a few books under the umbrella of The DevOps Toolkit Series⁶. I also wrote DevOps Paradox⁷ and Test-Driven Java Development⁸. Besides those, there are a few Udemy courses⁹. I am very passionate about DevOps, Kubernetes, microservices, continuous integration, and continuous delivery, and quite a few other topics. I like coding in Go. ¹http://slack.devops20toolkit.com/ ²https://codefresh.io/ ³https://developers.google.com/community/experts ⁴https://cd.foundation/ambassador-program-overview-application/community-ambassador-cohort20/ ⁵https://www.docker.com/community/captains ⁶https://www.devopstoolkitseries.com/ ⁷https://amzn.to/2myrYYA ⁸http://www.amazon.com/Test-Driven-Java-Development-Viktor-Farcic-ebook/dp/B00YSIM3SC ⁹https://www.udemy.com/user/viktor-farcic/
Introduction
3
I speak in a lot of conferences and community gatherings, and I do a lot of workshops. I have a blog TechnologyConversations.com¹⁰ where I keep my random thoughts, and I co-host a podcast DevOps Paradox¹¹. What really matters is that I’m curious about technology, and I often change what I do. A significant portion of my time is spent helping others (individuals, communities, or companies). Now, let me introduce the second person that was involved in the creation of this book. His name is Darin, and I will let him introduce himself.
Who Is Darin? My name is Darin Pope. I’m currently working at CloudBees¹² as a professional services consultant. Along with Viktor, I’m the co-host of DevOps Paradox¹³. Whether it is figuring out the latest changes with Kubernetes or creating more content to share with our listeners and viewers, I’m always learning. Always have, always will.
About The Requirements You will find the requirements near the beginning of each chapter. They vary from one subject to another, and the only constants are a laptop, Git, and a Bash terminal. I’m sure that you already have Git¹⁴. If you don’t, you and I are not living in the same century. I would not even mention it, if not for GitBash. If you are using Windows, please make sure that you have GitBash (part of the Git setup) and run all the commands from it. Other shells might work as well. Nevertheless, I tested all the commands on Windows with GitBash, so that is your safest bet. If, on the other hand, you are a macOS or Linux user, just fire up your favorite terminal. Every chapter starts from scratch and ends with the destruction of everything we created. That way, you should be able to go through any chapter independently from others. Each is self-sufficient. That should allow you to skip the parts you’re not interested in or to revisit others when in need to refresh your memory. An additional benefit of such destruction is that if you choose to run things in the cloud, you will not waste money when not working on the exercises.
Off We Go Typically, publishers would tell authors to write an introduction at the end, when all the chapters are written. That way, the author can summarize to the reader what to expect and can give an ¹⁰https://technologyconversations.com/ ¹¹https://www.devopsparadox.com/ ¹²https://www.cloudbees.com/ ¹³https://www.devopsparadox.com/ ¹⁴https://git-scm.com/
Introduction
4
impression of being in control. However, I don’t have a publisher, so I can ignore such advice. I have no idea what I will write about, and I am not in control. All I know is that I want to transmit the knowledge I have, as well as to use this opportunity to improve myself. As such, I do not have a clue about the scope. I do not know even what the next chapter will be about. I am yet to pick the first tool I will explore. I will probably pick a few tools, and then I’ll wait for your suggestions. Given all that, I don’t know whether you are reading this book while it is still in progress, or you picked it up after it was completed. I hope it is the former case. If it is, don’t pay much attention to the index. It is supposed to grow, and the direction greatly depends on whether you will suggest something or not. Expect updates. I will be publishing additional chapters as soon as they are finished. All in all, this is the beginning of an adventure. I do not know where I am going. All I know is that I hope to enjoy the journey and that you will find it useful to retrace the steps I will be taking.
Infrastructure as Code (IaC) You might have already used one of the tools to manage your infrastructure as code. That might have been Terraform, or something else. Or, maybe you didn’t. If that’s the case, you might be wondering if there is anything wrong with creating your cluster through a browser-based console provided by your hosting vendor. The short answer is “yes”. It’s very wrong. Clicking buttons and filling in some forms in a browser is a terrible idea. Such actions result in undocumented and unreproducible processes. You surely want to document what you did so that you can refer to that later. You probably want your colleagues to know what you did so that they can collaborate. Finally, you probably want to be fast. Ad hoc actions in Web-based consoles do not provide any of those things. You’d need to write Wiki pages to document the steps. If you do that, you’ll quickly find out that it is easier to write something like “execute aws ...” than to write pages filled with “go to this page”, “fill in this field with that value”, “click that button”, and similar tedious entries that are often accompanied with screenshots. We want to define the instructions on how to create and manage infrastructure as code. We want them to be executable, stored in Git, and, potentially, executed whenever we push a change. If a Web UI is not the right place to manage infrastructure, how about commands? We can surely do everything we need to do with a CLI. We can handle everything related to GCP with gcloud. We could use aws for the tasks in AWS, and Azure is fully covered with az CLI. While those options are better than the click-click-click type of operations, those are also not good options. Using a CLI might seem like a good idea at first. We can, for example, create a fully operational GKE cluster by executing gcloud container clusters create. However, that is not idempotent. The next time we want to upgrade it, the command will be different. On top of that, CLIs do not tend to provide dependency management, so we need to make sure that we execute them in the right order. They also do not have state management, so we cannot easily know what is done and what is not. They cannot show us which changes will be applied before we execute a command. The list of things that CLI commands often do not do is vast. Now, if your only choices are only click-click-click through a UI and CLI commands, choose the latter. But those are not the only options.
Going Back In Time If you’ve been in this industry for a while, you probably know that it was common to just create infrastructure. One of the sysadmins would get a server and install and configure everything. Once that’s done, others would be able to use it. That process could take anything from hours to days or even months. Those were horrible times. All our servers were different, even though we craved for uniformity. Over time, we would end up in such a state that no one knew for sure what is in each of the servers. Everything took so long that
Infrastructure as Code (IaC)
6
we’d order more than we needed, just to avoid a long waiting time when our needs would increase. It was a mess, and it was painful. Over time, people realized that it is a horrible idea to SSH into a server and “fix it”. We had undocumented and unreliable infrastructure, so we thought that it would be a good idea to document everything. That created even more significant overhead while not improving much. The only tangible effect was an increase in the effort and even less reliable infrastructure. We had an expectation that the documentation was accurate, while, in reality, it only provided a false sense of confidence. The documentation would become inaccurate in a matter of days, or even hours. It was enough that one person does something somewhere without updating Wiki pages, and the whole idea would turn into a miserable failure. Then we moved into scripts. Instead of reading some instructions and copying and pasting commands and configurations, we’d define (almost) everything as scripts. The instructions become a list of entries like “to do this, execute that script”. But that was a failure as well. Scripts were not idempotent. They would work only on “virgin” servers. They were suitable for installing something, but not for upgrading. So, we started creating more elaborate scripts that would have a bunch of if and else statement. If this does not exist, then do that. If that does exist, and if it looks like this, then do something else. That was painful as well, and it wasn’t much help. With time, the number of permutations would become so big that maintaining such scripts was harder than SSH-ing into a server and doing everything manually. Over time, we adopted configuration management tools like CFEngine, Chef, Puppet, Ansible, and a few others. They allowed us to describe the desired state, and that was a game-changer. All we’d have to do is run those tools, and they would converge the actual into the desired state. For example, we could say that the desired state is to have a specific version of the Apache server. Those tools would check the actual state in each of the servers, and, depending on the differences between the two states (desired and actual), they would install, upgrade, reconfigure, or delete resources. A bit later, virtualization took over. We moved our workloads into virtual machines, and that simplified quite a lot of things. At the same time, the adoption of configuration management tools increased. We thought that the two were a good match. We thought we had a winner, but we were wrong. The “traditional” configuration management tools are all based on the “promise theory”, and they all tried to accomplish the same objective. Their goal was to converge the actual into the desired state. That objective is as valid today, as it was in the past. However, all those tools were based on the ideas that were valid when working with bare metal servers. They all tried to modify the state of servers at runtime. While that’s the best we can do when using physical servers, that goes against one of the main benefits of virtualization. When virtual machines appeared, they provided quite a few benefits. We could split physical servers into virtual machines, and that would give us better utilization of resources, as well as separation required by different workloads. We could decide how much memory and CPU would be given to each. There were quite a few other benefits, but the critical one was the ability to create virtual servers whenever we needed them. We could also destroy them as soon as their usefulness expired.
Infrastructure as Code (IaC)
7
However, the real benefit of that was not clear to all for quite some time. The ability to easily create or destroy a VM allowed us to move towards immutable infrastructure. Instead of trying to converge the actual into the desired state by changing what’s running inside the servers, we could adopt the principles of immutability. All VMs are based on images that many understood as a base setup. Today, we see those images as the full definition of what a server is. They are supposed to be immutable, and no one is supposed to change them in any form or way. Whenever we needed to apply a change, we could create a new image. Instead of modifying servers from inside, we could converge the actual into the desired state by replacing the old VMs with new ones based on different images. Conceptually, that approach is the same as the one we’re using with containers today. When we want to upgrade an application, we do not upgrade binaries inside a container. Instead, we build a new image and replace the existing containers with new ones based on that image. The same is valid for infrastructure. If, for example, we want to upgrade servers to a newer version of Kubernetes, we tend to create a new VM image, and we do rolling upgrades of the nodes. We’d shut down a node with the older version, and create a new one based on a different image. That process would repeat until all the nodes are new. That allows us to have uniform and reliable servers, while also increasing the velocity of the process.
Back To Present Today, infrastructure as code (IaC) and immutability are closely tied together. IaC allows us to define everything as code, while immutability brings uniformity, reliability, and speed. Servers are defined as code in the form of configurations and scripts used to create VM images. Clusters are defined as code in the form of instructions on how to create and manage VMs based on those images, and how to tie them together through different services. The “traditional” configuration management tools like CFEngine, Chef, Puppet, Ansible, and others do allow us to define infrastructure as code, but they are not based on the principles of immutability. That does not mean that they cannot be used to create immutable infrastructure, but rather that they are not designed for that. As such, they tend to be suboptimal at performing such processes. We can think of them as the first generation of such tools. Even though some are newer and more advanced than others, I’m putting them in the same bucket since they are all based on the same principles. Today we have better options at our disposal. We could adopt vendor-specific tools like CloudFormation for AWS, or whatever your hosting vendor provides. But they are, in my opinion, not the right choice. They are closed and focused on a single platform. Today’s “king of infrastructure as code” is Terraform. It is designed from ground up as an immutable infrastructure as code solution. It is, by far, the most widely used, and almost every respectable service vendor created modules for their platforms. The width of its adoption can be seen from the list of currently available Terraform providers¹⁵. Everyone that matters is there. If your provider is not on that list, you should not look for a different tool, but rather change the provider. ¹⁵https://www.terraform.io/docs/providers/index.html
Infrastructure as Code (IaC)
8
On top of that, Terraform has one of the biggest and most active communities of contributors, and it is highly extensible through its plugin system. Its significance becomes even more apparent once we realize that many of the vendors abandoned their own similar tools in favor of Terraform.
Using Terraform To Manage Infrastructure As Code (IaC) Terraform’s ability to use different providers and manage various resources is combined with templating. Its system of variables allows us to easily modify aspects of our infrastructure without making changes to definitions of the resources. On top of that, it is idempotent. When we apply a set of definitions, it converges the actual into the desired state, no matter whether that means creation, destruction, or modification of our infrastructure and services. The feature I like the most is Terraform’s ability to output a plan. It shows us which resources will be created, which will be modified, and which will be destroyed if we choose to apply the changes in the definitions. That allows us to gain insight into “what will happen before it happens.” If we adopt GitOps principles and trigger Terraform only when we make a change to a repository, we can have a pipeline that would output the plan when we create a pull request. That way, we can easily review the changes that will be applied to our infrastructure and decide whether to merge the changes. If we do merge, another pipeline could apply those changes after being triggered through a webhook. That makes it a perfect candidate for a simple, yet very effective mix of Infrastructure as code principles combined with GitOps and automated through continuous delivery tools. Terraform stores the actual state in a file system. That allows it to be able to output plans and apply the changes to definitions by being able to compare the desired with the actual state. Nevertheless, storing its state locally is insecure, and it could prevent us from working as a team. Fortunately, Terraform allows us to utilize different backends where its state can be stored. Those can be a network drive, a database, or almost any other storage. Another really nice feature is Terraform’s ability to untangle dependencies between different resources. As such, it is capable of figuring out by itself in which order resources should be created, modified, or destroyed. Nevertheless, those and other Terraform features are not unique. Other tools have them as well. What makes Terraform truly special is the ease with which we can leverage its features, and the robustness of the platform and the ecosystem around it. It is not an accident that it is the de facto standard and popular choice of many.
What Are We Going To Do? I won’t go into an in-depth comparison between Terraform and other solutions. This book is not meant to do that. The goal is to provide you with quick know-how and allow you to make a decision whether a tool or a process is the right choice for your needs. So, we’ll dive straight into Terraform
Infrastructure as Code (IaC)
9
hands-on exercises. If that’s not good enough for you, contact me and let me know which other tools you’d like to explore. I’ll do my best to include it as well. We’ll create a fully operational Kubernetes cluster defined in a way that it is “production-ready”. If this is your first contact with Terraform, you’ll have the base knowledge required to work with it and, more importantly, to make a decision whether the tool and the processes behind it are the right choice for your needs. Even if you are already familiar with Terraform, you might discover a few new things, like, for example, the proper way to manage its state. We’ll explore Terraform through practical and “real world” examples. We’ll use it to create a Kubernetes cluster in Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure. Those are the three most commonly used hosting vendors, so there is a good chance that you’re already using one of those. Please let me know if you prefer some other platform, and I’ll do my best to include it. It is important to note that it doesn’t matter whether you’re planning to use Kubernetes, or you’d like to use Terraform to manage something else. A Kubernetes cluster is just a means to an end. The primary goal is to dive into Terraform. It would be great if you go through all three platforms since that would give you additional insights into each. That might clarify some doubts about the pros and cons of the “big three”. Nevertheless, that’s not mandatory. It’s OK if you choose only one (or two), and skip the rest. If you do decide to go through all of the platforms, please note that the structure is the same and that parts of the text are similar. That might sound boring. However, the goal is not only to introduce you to all those but also to keep the same structure so that they can be easily compared. Before we proceed, there are a few requirements. Please download Terraform¹⁶ CLI and install it. Besides the obvious need to have terraform CLI, you’ll also need to install and set up kubectl¹⁷. That’s it. Let’s dive into how to define, create, manage, and destroy stuff with Terraform. ¹⁶https://www.terraform.io/downloads.html ¹⁷https://kubernetes.io/docs/tasks/tools/install-kubectl/
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform Since we already had a brief overview of Terraform, we’ll skip the potentially lengthy introduction and go straight into the exercises. That will help you make a decision whether Terraform is the right choice for managing resources in Google Cloud Platform (GC) and give you sufficient base knowledge that you’ll be able to extend on your own. We’ll create a Google Kubernetes Engine (GKE) cluster and all the surrounding resources required for optimal usage of it, as well as for proper maintenance of the infrastructure through Terraform. We’re trying to understand how Terraform works, and what it’s suitable for. If you do use Kubernetes, you’ll end up with a reusable way to create and manage it in Google Cloud Platform. Nevertheless, that’s not the primary goal, so it doesn’t matter much whether Kubernetes is your thing or not. The main objective is to learn Terraform through practical examples. Let’s go.
Preparing For The Exercises All the commands from this chapter are available in the 01-01-terraform-gke.sh¹⁸ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
The code and the configurations that will be used in this chapter are available in the GitHub repository vfarcic/devops-catalog-code¹⁹. Let’s clone it. Feel free to skip the command that follows if you already cloned that repository.
¹⁸https://gist.github.com/c83d74ec70b68629b691bab52f5553a6 ¹⁹https://github.com/vfarcic/devops-catalog-code
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2
11
git clone \ https://github.com/vfarcic/devops-catalog-code.git
The code for this chapter is located in the terraform-gke directory. 1
cd devops-catalog-code
2 3
git pull
4 5
cd terraform-gke
We went to the local copy of the repository. We pulled the latest revision just in case you already had the repository from before, and I changed something in the meantime. Finally, we entered the directory with the code and configurations we’ll use in this chapter.
Exploring Terraform Variables Generally speaking, entries in Terraform definitions are split into four groups. We have provider, resource, output, and variable entries. That is not to say that there aren’t other types (there are), but that those four are the most important and most commonly used ones. For now, we’ll focus on variables, and leave the rest for later. All the configuration files we’ll use are in the file sub-directory. We’ll pull them out one by one and explore what they’re used for.
Let’s copy the file that defines the variables we’ll use and take a quick look. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4
12
variable "region" { type = string default = "us-east1" }
5 6 7 8 9
variable "project_id" { type = string default = "devops-catalog-gke" }
10 11 12 13 14
variable "cluster_name" { type = string default = "devops-catalog" }
15 16 17 18
variable "k8s_version" { type = string }
19 20 21 22 23
variable "min_node_count" { type = number default = 1 }
24 25 26 27 28
variable "max_node_count" { type = number default = 3 }
29 30 31 32 33
variable "machine_type" { type = string default = "e2-medium" }
34 35 36 37 38
variable "preemptible" { type = bool default = true }
If you focus on the names of the variables, you’ll notice that they are self-explanatory. We defined the region where our cluster will run. There’s the ID of the Google Cloud Platform (GCP) project (project_id) and the name of the cluster we want to create (cluster_name). We’re setting the Kubernetes version we’d like to use (k8s_version), and we have the minimum and the maximum
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
13
number of worker nodes (min_node_count and max_node_count). Finally, we defined the type of machines we’d like to use for worker nodes (machine_type) as well as whether we’d like to use cheap preemptible VMs. If you ever created a Google Kubernetes Engine (GKE) cluster, all those should be familiar. If you haven’t, I’m sure that you could have guessed what each of those variables means. What matters is that each of those variables has the type (e.g., string, number, bool), and the default value. The only exception is k8s_version that does not have anything set by default. That means that we’ll have to set it to some value at runtime. The reason for that is twofold. To begin with, I wanted to show you the difference between variables with and without default values. Also, I could not be sure which Kubernetes versions will be available at the time you’re going through the exercises in this chapter. Later on, we’ll see the effect of not having a default value. For now, just remember that the variable k8s_version doesn’t have a default value. I like to think of variables as a blueprint of what I’d like to accomplish. They define the information that I believe could be changed over time, and they drive the goals of infrastructure-as-code I tend to define later. For example, I want the cluster to be scalable within specific limits, so I set variables min_node_count and max_node_count. Others tend to take a different approach and refactor parts of resource definitions to use variables. They hard-code values initially and then replace those with variables later.
All in all, now we have the file that defines all the variables. Some represent the aspects of the cluster that are likely going to change over time. Others (e.g., cluster_name) are probably never going to change. I defined them as variables in an attempt to create definitions that could be useful to other teams.
Creating The Credentials Now is the time to deal with the prerequisites. Even though the goal is to use terraform for all infrastructure-related tasks, we’ll need a few things specific to Google Cloud. To be more precise, we’ll need to create a project and a service account with sufficient permissions. But, before we do that, I need to make sure that you are registered and logged in Google Cloud Platform, and that you have gcloud CLI. If this is the first time you’re using Google Cloud Platform (GCP), please open the GCP console²⁰ and follow the instructions to register. If you’re planning to use an account provided by your company, please make sure that you have the permissions to create all the resources we’ll use in the exercises that follow. ²⁰https://console.cloud.google.com
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
14
To install Google SDK (gcloud CLI), please go to the Installing Google Cloud SDK²¹ section of the documentation and follow the instructions for your operating system. Now that you have a GCP account, and that you have gcloud in your laptop, we’ll have to log in, before we start creating “stuff”. 1
gcloud auth application-default login
A new page should have opened in the browser asking for authentication. Follow the instructions. Now we’re ready, and we can proceed by creating a new project. If you’re new to Google Cloud, you should know that everything is organized in projects. They provide means to group resources. Given that I could not guess whether you already have a project you’d like to use and, if you do, what is its ID, we’ll create a new one for the exercises in this and, potentially, in other chapters. Google Cloud Projects need to be globally unique, so we will use a date to create a suffix for the project ID. Feel free to change that to any other ID. Feel free to come up with a different, and easier to remember ID.
1
export PROJECT_ID=doc-$(date +%Y%m%d%H%M%S)
2 3
gcloud projects create $PROJECT_ID
We created a new project. To be on the safe side, we’ll list all the projects we have and confirm that the newly created one is there. 1
gcloud projects list
The output is as follows. 1 2 3 4
PROJECT_ID NAME PROJECT_NUMBER ... doc-20200611104346 doc-20200611104346 933451435002 ...
You might have other projects listed in that output. They do not matter in this context. As long as the project is there, we should be able to proceed. The second prerequisite we’ll need is a service account. Through it, we’ll be able to identify as a user with sufficient permissions. ²¹https://cloud.google.com/sdk/install
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4
15
gcloud iam service-accounts \ create devops-catalog \ --project $PROJECT_ID \ --display-name devops-catalog
We created a service account devops-catalog inside the newly generated project. Let’s confirm that’s indeed true. 1 2
gcloud iam service-accounts list \ --project $PROJECT_ID
The output should be similar to the one that follows. 1 2 3
NAME EMAIL DISABLED ... devops-catalog [email protected] False
The EMAIL column is essential. It is the unique identifier that we’ll need to generate the keys, which, in turn, are required by Terraform. We need them to authenticate as that service account. 1 2 3 4
gcloud iam service-accounts \ keys create account.json \ --iam-account devops-catalog@$PROJECT_ID.iam.gserviceaccount.com \ --project $PROJECT_ID
We created the key for the service account devops-catalog stored in the local file account.json. If you’re curious to see what’s inside, feel free to output it with cat account.json and explore it yourself. Since I’m paranoid by nature, we’ll confirm that the key was indeed created by listing all those in the project. 1 2 3 4
gcloud iam service-accounts \ keys list \ --iam-account devops-catalog@$PROJECT_ID.iam.gserviceaccount.com \ --project $PROJECT_ID
You’ll see at least two keys. The ID of one of those should match the one in the account.json file. Having a service account and a key through which we can access it is of no use if we do not assign it sufficient permissions.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4
16
gcloud projects \ add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:devops-catalog@$PROJECT_ID.iam.gserviceaccount.com \ --role roles/owner
The output is as follows. 1 2 3 4 5 6 7 8
Updated IAM policy for project [devops-catalog]. bindings: - members: - serviceAccount:[email protected] ... role: roles/owner etag: BwWit5QjoSc= version: 1
What we just did is something no one should ever do. We assigned the owner role to that service account. As you can probably guess from the name, an owner can do anything within that project. That is too permissive, and we should have fine-tuned it by assigning a role with the minimum set of permissions. Or, we could have assigned multiple less permissive roles. That service should be allowed to create and manage everything we need, but not more. Being an owner capable of doing anything is easier, but is it also less secure. Nevertheless, we went with “easier” given that this chapter is about Terraform and not a crash-course in Google Cloud. Finally, we are going to create one more environment variable. 1
export TF_VAR_project_id=$PROJECT_ID
This time, we used a “special” format for the variable name. Those that start with TF_VAR will be automatically converted into Terraform variables. So, in this case, the value of that environment variable will be used as the value of the Terraform variable project_id. As such, we will be using the ID of the newly created project instead of the default value.
Defining The Provider Everything we did so far with the service account serves only one purpose. We created it so that we can use it in Terraform’s google provider. If you don’t know what that is, and if you are not good at guessing, I’ll explain it in one sentence. It allows us to configure the credentials used to authenticate with GCP, and a few other things. If that’s not enough of an explanation, please visit the Google Cloud Platform Provider²² section of the Terraform documentation for more details. Let’s copy the provider definition I prepared, and take a quick look. ²²https://www.terraform.io/docs/providers/google/index.html
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1
17
cp files/provider.tf .
2 3
cat provider.tf
The output is as follows. 1 2 3 4 5
provider "google" { credentials = file("account.json") project = var.project_id region = var.region }
We’re specifying the credentials, and the project and region where we’d like to create a cluster. The credentials is using the file function that will load the contents of account.json. Having the credentials hard-coded in that definition would be too much of a risk given that it is stored in Git. The other two fields (project and region) are using variables project_id and region. If your memory serves you, and you paid attention, you know that those are defined in the variables.tf file we explored earlier. We could have hard-coded those values, but we didn’t. To begin with, we are likely going to use them in a few other places. Also, they are probably going to change and it is not a good idea to “hunt” them throughout all the Terraform definitions we might have. Even if we decide never to change those two, someone else might likely take the configurations we’re defining, and use them to create a similar cluster in some other project or in a different region. Now we are ready to apply Terraform definitions we created so far. In Terraform terms, apply means “create what’s missing, update what’s different, and delete what’s not needed anymore.” 1
terraform apply
You’ll notice that Terraform asked you to enter a value for k8s_version. That’s the only variable we defined without the default value, so Terraform expects us to provide one. We’ll deal with that later. For now, and until we get to the part that deals with it, press the enter key whenever you’re asked that question. An empty value should be enough for now, since we are not yet using that variable in any of the definitions. The output, limited to the relevant parts, is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2
18
... Error: Could not satisfy plugin requirements
3 4 5 6 7 8
Plugin reinitialization required. Please run "terraform init". ... Error: provider.google: no suitable version installed version requirements: "(any version)" versions installed: none
Most of Terraform’s functionality is provided by plugins. More often than not, we need to figure out which plugin suits our needs and download it. In this case, we’re using the google provider. Fortunately, Terraform will do the job of downloading the plugin. We just need to initialize the project. 1
terraform init
The output, limited to the relevant parts, is as follows. 1
Initializing the backend...
2 3 4 5 6
Initializing provider plugins... - Checking for available provider plugins... - Downloading plugin for provider "google" (hashicorp/google) 3.16.0... ...
We can see that Terraform detected that we want to use the plugin for provider "google" and downloaded it. Now we should be able to apply that definition. 1
terraform apply
Remember to continue answering with the enter key whenever you’re asked to provide the value for k8s_version.
The output is as follows. 1
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
The output confirmed that the apply is complete, and we can see that it did not add, change, or destroy anything. That was to be expected. So far, we did not specify that we want to have any GCP resources. We defined some variables, and we specified that we’d like to use the google provider.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
19
Storing The State In A Remote Backend Terraform maintains its internal information about the current state. That allows it to deduce what needs to be done and to converge the actual into the desired state defined in *.tf files. Currently, that state is stored locally in the terraform.tfstate file. For now, you shouldn’t see anything exciting in it. 1
cat terraform.tfstate
The output is as follows. 1
{ "version": 4, "terraform_version": "0.12.12", "serial": 1, "lineage": "c818a17f-...", "outputs": {}, "resources": []
2 3 4 5 6 7 8
}
The field that really matters is resources. It is empty because we did not define any. We will, soon, but it’s likely not going to be something you expect. We are not going to create anything related to our GKE cluster. At least not right away. What we need right now is a storage bucket. Keeping Terraform’s state locally is a bad idea. If it’s on a laptop, we won’t be able to allow others to modify the state of our resources. We’d need to send them the terraform.tfstate file by email, keep it on some network drive, or something similar. That is impractical. We might be tempted to store it in Git, but that would not be secure. Instead, we’ll tell Terraform to keep the state in a Google bucket. Since we’re trying to define infrastructure as code, we won’t do that by executing a shell command, nor we’ll go to the GCP console. We’ll tell Terraform to create the bucket. It will be the first resource managed by it. But, before we proceed, I want you to confirm that billing for storage is enabled. Otherwise, Google would not allow us to create one. If you are a Windows user, the open command might not work. If that’s the case, please copy the address from the command that follows, and open it in your favorite browser.
1
open https://console.cloud.google.com/storage/browser?project=$PROJECT_ID
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
20
If you do not see the ENABLE BILLING button, you’re okay, and we can move on. If it is there, billing is not enabled. If that’s the case, click it and follow the instructions. We are about to explore the google_storage_bucket module. It allows us to manage Google Cloud storage buckets, and you should be able to find more information in the google_storage_bucket documentation²³. Let’s copy the file I prepared, and take a look at the definition. 1
cp files/storage.tf .
2 3
cat storage.tf
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11
resource "google_storage_bucket" "state" { name = var.state_bucket location = var.region project = var.project_id storage_class = "NEARLINE" labels = { environment = "development" created-by = "terraform" owner = "vfarcic" } }
We’re defining storage bucket referenced as state. All resource entries are followed with a type (e.g., google_storage_bucket) and a reference (e.g., state). We’ll see the usage of a reference later in one of the upcoming definitions. Just as with the provider, the resource has several fields. Some are mandatory, while others are optional and often have predefined values. We’re defining the name and the location. Further on, we specified that it should be created inside our project. Finally, we selected NEARLINE as the storage class. Please visit the Available storage classes²⁴ section of the documentation to see the full list. Just as before, the values of some of those fields are defined as variables. Others (those less likely to change) are hard-coded. The labels are there to provide our team members or other people in the organization metadata about our cluster. If we’ve forgotten about it for months, it’s easy to tell who to bother. We also state ²³https://www.terraform.io/docs/providers/google/r/storage_bucket.html ²⁴https://cloud.google.com/storage/docs/storage-classes#available_storage_classes
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
21
we manage the cluster with Terraform, hopefully preventing people from making manual changes through the UI. There is a tiny problem we need to fix before we proceed. Google Cloud Bucket names need to be globally unique. There cannot be two buckets with the same name, anywhere in Google Cloud, among all its users. We will generate a unique name using date. 1
export TF_VAR_state_bucket=doc-$(date +%Y%m%d%H%M%S)
The name of the bucket is now, more or less, unique, and we will not be in danger that someone else already claimed it. The environment variable we created will be used as the name of the bucket. Let’s apply the new definition. 1
terraform apply
The output, limited to the relevant parts, is as follows. 1 2
... Terraform will perform the following actions:
3 4 5 6 7 8 9 10 11 12 13 14 15
# google_storage_bucket.state will be created + resource "google_storage_bucket" "state" { + bucket_policy_only = (known after apply) + force_destroy = false + id = (known after apply) + location = "US-EAST1" + name = "devops-catalog" + project = "devops-catalog" + self_link = (known after apply) + storage_class = "NEARLINE" + url = (known after apply) }
16 17
Plan: 1 to add, 0 to change, 0 to destroy.
18 19 20 21
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
22 23
Enter a value: yes
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
22
In this case, we can see the full list of all the resources that will be created. The + sign indicates that something will be created. Under different conditions, we could also observe those that would be modified (∼) or destroyed (-). Right now, Terraform deduced that the actual state is missing the google_storage_bucket resource. It also shows us which properties will be used to create that resource. Some were defined by us, while others will be known after we apply that definition. Finally, we are asked whether we want to perform these actions?. Be brave and type yes, followed with the enter key. From now on, I will assume that there’s no need for me to tell you to confirm that you want to perform some Terraform actions. Answer with yes whenever you’re asked to confirm that you want to perform some Terraform actions.
After we choose to proceed, the relevant parts of the output should be as follows. 1 2
... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
We can see that 1 resource was added and that nothing was changed or destroyed. Since this is the first time we created a resource with Terraform, it would be reasonable to be skeptical. So, we’ll confirm that the bucket was indeed created by listing all those available in the project. Over time, we’ll gain confidence in Terraform and will not have to validate that everything works correctly. 1
gsutil ls -p $PROJECT_ID
Let’s imagine that someone else executed terraform apply and that we are not sure what the state of the resources is. In such a situation, we can consult Terraform by asking it to show us the state. 1
terraform show
The output is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6 7 8 9 10 11 12 13 14
23
# google_storage_bucket.state: resource "google_storage_bucket" "state" { bucket_policy_only = false default_event_based_hold = false force_destroy = false id = "devops-catalog" location = "US-EAST1" name = "devops-catalog" project = "devops-catalog" requester_pays = false self_link = "https://www.googleapis.com/storage/v1/b/devops-catalog" storage_class = "NEARLINE" url = "gs://devops-catalog" }
As you can see, there’s not much to look at. For now, we have only one resource (google_storage_bucket). As we keep progressing, that output will be increasing and, more importantly, it will always reflect the state of the resources managed by Terraform. The previous output is a human-readable format of the state currently stored in terraform.tfstate. We can inspect that file as well. 1
cat terraform.tfstate
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
{ "version": 4, "terraform_version": "0.12.12", "serial": 3, "lineage": "930b4f49-...", "outputs": {}, "resources": [ { "mode": "managed", "type": "google_storage_bucket", "name": "state", "provider": "provider.google", "instances": [ { "schema_version": 0, "attributes": { "bucket_policy_only": false,
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
"cors": [], "default_event_based_hold": false, "encryption": [], "force_destroy": false, "id": "devops-catalog", "labels": null, "lifecycle_rule": [], "location": "US-EAST1", "logging": [], "name": "devops-catalog", "project": "devops-catalog", "requester_pays": false, "retention_policy": [], "self_link": "https://www.googleapis.com/storage/v1/b/devops-catalog", "storage_class": "NEARLINE", "url": "gs://devops-catalog", "versioning": [], "website": [] }, "private": "bnVsbA=="
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
}
38
]
39
}
40
]
41 42
24
}
If we ignore the fields that are currently empty, and the few that are for Terraform’s internal usage, we can see that the state stored in that file contains the same information as what we saw through terraform show. The only important difference is that one is in Terraform’s internal format (terraform.tfstate), while the other (terraform show) is meant to be readable by humans. Even though that’s not the case right now, the state might easily contain some confidential information. It is currently stored locally, and we already decided to move it to Google Cloud bucket. That way we’ll be able to share it, it will be stored in a more reliable location, and it will be more secure. To move the state to the bucket, we’ll create a gcs²⁵ (Google Cloud Storage) backend. As you can probably guess, I already prepared a file just for that. 1
cp files/backend.tf .
2 3
cat backend.tf
The output is as follows. ²⁵https://www.terraform.io/docs/backends/types/gcs.html
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6 7
terraform { backend "gcs" bucket prefix credentials } }
25
{ = "devops-catalog" = "terraform/state" = "account.json"
There’s nothing special in that definition. We’re setting the name of the bucket, the prefix, which will be appended to the files, and the path to the credentials file. The bucket entry in that Terraform definition cannot be set to a value of a variable. It needs to be hard-coded. So, we’ll need to replace devops-catalog with the bucket name we used when we created it. 1 2 3
cat backend.tf \ | sed -e "s@devops-catalog@$TF_VAR_state_bucket@g" \ | tee backend.tf
Let’s apply the definitions and see what we’ll get. 1
terraform apply
The output, limited to the relevant parts, is as follows. 1 2 3
Backend reinitialization required. Please run "terraform init". Reason: Initial configuration of the requested backend "gcs" ...
Since we are changing the location where Terraform should store the state, we have to initialize the project again. The last time we did that, it was because a plugin (google) was missing. This time it’s because the init process will copy the state from the local file to the newly created bucket. 1
terraform init
The output, limited to the relevant parts, is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4
26
Initializing the backend... Do you want to copy existing state to the new backend? ... Enter a value: yes
Please confirm that you do want to copy the state by typing yes and pressing the enter key. The process continued. It copied the state to the remote storage, which, from now on, will be used instead of the local file. Now we should be able to apply the definitions. 1
terraform apply
The output is as follows. 1 2
... google_storage_bucket.state: Refreshing state... [id=devops-catalog]
3 4
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
As we can see, there was no need to apply the definitions. The latest addition does not define any new resources. We only added the location for the Terraform state. That change is internal, and it was applied through the init process.
Creating The Control Plane Now we have all the prerequisites. The provider is set to google, and we have the backend (for the state) pointing to the bucket. We can turn our attention to the GKE cluster itself. A Kubernetes cluster (almost) always consists of a control plane and one or more pools of worker nodes. In the case of GKE, those two are separate types of resources. We’ll start with the control plane, and move towards worker nodes later. We can use the google_container_cluster²⁶ module to create a GKE control plane. 1
cp files/k8s-control-plane.tf .
2 3
cat k8s-control-plane.tf
The output is as follows. ²⁶https://www.terraform.io/docs/providers/google/r/container_cluster.html
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6
27
resource "google_container_cluster" "primary" { name = var.cluster_name location = var.region remove_default_node_pool = true initial_node_count = 1 min_master_version = var.k8s_version
7
resource_labels environment created-by owner }
8 9 10 11 12 13
= = = =
{ "development" "terraform" "vfarcic"
}
The meaning of some of the fields is probably easy to guess. However, we do have a few that might not be that obvious and intuitive. Specifically, it might be hard to understand the meaning of remove_default_node_pool and initial_node_count. If we’d remove them, a GKE cluster would be created with the default node pool for worker nodes. But we don’t want that. More often than not, it is better to have it separate from the control plane. That way, we have better control over it, especially if we choose to have multiple pools. The problem, however, is that the default node pool is mandatory during the creation of the control plane. So, given that we have to have it initially, even though we don’t want to, the best option is to remove it after the control plane is created. That’s the function of the remove_default_node_pool field. Given that the default node pool is only temporary, we’re setting the initial_node_count to 1. That way, the default node pool will have only one node. To be more precise, it will have three, one in each zone of the region. In any case, we’re setting it to be the smallest possible value (1) so that we save a bit of time, as well as to reduce the cost. The resource_labels are similar to the labels we used earlier with the google_storage_bucket resource. In this case, labels would mean something else, so the field is called resource_labels instead. It is generally a good idea to always look for this field with Terraform definitions, and supply any helpful metadata about the resource. Let’s apply the definitions. 1
terraform apply
Just as before, we’re asked to provide a valid Kubernetes version. We’ll still ignore that question by pressing the enter key. The output, limited to the relevant parts, is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6
28
... # google_container_cluster.primary will be created + resource "google_container_cluster" "primary" { ... Plan: 1 to add, 0 to change, 0 to destroy. ...
We can see that only one resource will be added and that none will be removed or modified. That was to be expected since we did not change any of the other resources. We just added one more to the mix. After confirming that we want to proceed by typing yes and pressing the enter key, the process continued, only to fail a few moments later. The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6
... Error: googleapi: Error 403: Kubernetes Engine API has not been used in project 8626\ 82488723 before or it is disabled. Enable it by visiting https://console.developers.\ google.com/apis/api/container.googleapis.com/overview?project=862682488723 then retr\ y. ...
We did not yet enable Kubernetes Engine API, so Terraform cannot create the cluster for us. Enabling an API for a GCP project is a one-time deal. We will not be asked to enable the same API for the same project twice.
Please follow the link from the output and enable the API. There’s one more thing we need to do. We will not be able to keep pressing the enter key when asked which version of Kubernetes we want to have. We could get away with that before because we were not creating a Kubernetes cluster. Now, however, we do have to provide a valid version. But, which one is it? Instead of guessing which Kubernetes versions are available in GKE, we’re going to ask Google to output the list of all those that are currently supported in our region. 1 2 3
gcloud container get-server-config \ --region us-east1 \ --project $PROJECT_ID
The output, limited to the relevant parts, is as follows.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6 7 8 9 10 11
29
... validMasterVersions: - 1.15.11-gke.3 - 1.15.11-gke.1 - 1.15.9-gke.26 - 1.15.9-gke.24 - 1.14.10-gke.34 - 1.14.10-gke.32 - 1.14.10-gke.31 - 1.14.10-gke.27 ...
Pick any of the valid master versions, except the newest one. You’ll see later why it cannot be the most recent version. If you have difficulty making a decision, the second to newest is a good option. Since we are likely going to have to provide a valid Kubernetes version to all the commands we’ll execute from now on, we’ll store it in an environment variable. Please replace [...] with the selected version in the command that follows.
1
export K8S_VERSION=[...]
Now we should be able to apply the definition and create the control plane. Hopefully, there is nothing else missing. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9
... # google_container_cluster.primary will be created + resource "google_container_cluster" "primary" { ... } ... Enter a value: yes ... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
As expected, yet another resource was added, and none were changed or destroyed.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
30
Exploring Terraform Outputs We’ll retrieve the nodes of the newly created Kubernetes cluster and see what we’ve got. But, before we do that, we need to create a kubeconfig file that will provide kubectl the information on how to access the cluster. We could do that right away with gcloud, but we’ll make it a bit more complicated. To create kubeconfig, we need to know the name of the cluster, and the region and project in which it’s running. We might have that information in our heads. But, we’ll imagine that’s not the case. I’ll assume that you forgot it, or that you did not pay attention. That will give me a perfect opportunity to introduce you to yet another Terraform feature. We can define outputs with the information we need, as long as that information is available in Terraform state. 1
cp files/output.tf .
2 3
cat output.tf
The output is as follows. 1 2 3
output "cluster_name" { value = var.cluster_name }
4 5 6 7
output "region" { value = var.region }
8 9 10 11
output "project_id" { value = var.project_id }
We’re specifying which data should be output by Terraform. Such outputs are generated at the end of the terraform apply process, and we’ll see that later. For now, we’re interested only in the outputs, so that we can use them to deduce the name of the cluster, the project ID, and the region, so that we can retrieve the credentials for kubeconfig. If we want to see all the outputs, we can simply refresh. That would update the state file with the information about the physical resources Terraform is tracking and, more importantly, show us those outputs.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2
31
terraform refresh \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2
... Outputs:
3 4 5 6
cluster_name = devops-catalog project_id = devops-catalog region = us-east1
We can clearly see the name of the cluster, the project ID, and the region. But that’s not what we really need. We’re not interested in seeing that information, but rather in using it to construct the command that will retrieve the credentials. We can accomplish that with the terraform output command. 1
terraform output cluster_name
The output is as follows. 1
devops-catalog
Now we know how to retrieve the output of a single value, so let’s use that to construct the command that will retrieve the credentials. 1
export KUBECONFIG=$PWD/kubeconfig
2 3 4 5 6 7 8 9
gcloud container clusters \ get-credentials \ $(terraform output cluster_name) \ --project \ $(terraform output project_id) \ --region \ $(terraform output region)
We specified that kubeconfig should be in the current directory by exporting the environment variable KUBECONFIG. Further on, we retrieved the credentials using gcloud. What matters, apart from the obvious need to retrieve the credentials, is that we used terraform output to retrieve the data we need and pass them to gcloud. The only thing left is to give us admin permissions to the cluster.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4 5 6
32
kubectl create clusterrolebinding \ cluster-admin-binding \ --clusterrole \ cluster-admin \ --user \ $(gcloud config get-value account)
Now we should be able to check the cluster that Terraform created for us. 1
kubectl get nodes
The output states that no resources were found in default namespace. That was to be expected. We retrieved the nodes of the cluster, and we got none. GKE does not allow us to access the control plane. On the other hand, we did not yet create worker nodes, so there are none, for now.
Creating Worker Nodes We can manage worker nodes through the google_container_node_pool²⁷ module. As you can expect, I prepared yet another definition that we can use. 1
cp files/k8s-worker-nodes.tf .
2 3
cat k8s-worker-nodes.tf
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
resource "google_container_node_pool" "primary_nodes" { name = var.cluster_name location = var.region cluster = google_container_cluster.primary.name version = var.k8s_version node_config { preemptible = var.preemptible machine_type = var.machine_type oauth_scopes = [ "https://www.googleapis.com/auth/cloud-platform" ] } autoscaling { min_node_count = var.min_node_count ²⁷https://www.terraform.io/docs/providers/google/r/container_node_pool.html
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
max_node_count = var.max_node_count
15
} management { auto_upgrade = false } timeouts { create = "15m" update = "1h" }
16 17 18 19 20 21 22 23 24
33
}
That definition is a bit bigger than those we used before. There are more things we might want to define for worker nodes, so that definition has a few fields more than others. At the top, we are defining the name of the node pool and the location. The value of the cluster is interesting, though. Instead of hard-coding it or setting to a value, we’re telling Terraform to use the name field of the google_container_cluster.primary resource. Further on, we have the version and the initial node_count. The node_config block defines the properties of the node group we’re about to create. Then we’re defining autoscaling of the cluster. Through management, we’re specifying that we do not want to upgrade the cluster automatically. That would defy the idea that everything is defined as code. Finally, there are a few timeouts, just in case creation or update of the cluster hangs. Let’s apply the definitions, including the new one, and see what we’ll get. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2
... Terraform will perform the following actions:
3 4 5 6 7
# google_container_node_pool.primary_nodes will be created + resource "google_container_node_pool" "primary_nodes" { ... Plan: 1 to add, 0 to change, 0 to destroy.
8 9 10 11
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
12 13
Enter a value:
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
34
The process started by presenting us with all the changes required to converge the actual into the desired state. Since we did not change any of the existing definitions, the only modification to the desired state is the addition of the google_container_node_pool referenced as primary_nodes. Confirm that you want to proceed by typing yes and pressing the enter key, and the process will continue. The output, limited to the relevant parts, is as follows. 1 2
... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
3 4
Outputs:
5 6 7 8
cluster_name = devops-catalog project_id = devops-catalog region = us-east1
It finished by adding one resource, and without changing or destroying anything. At the end of it, we got the familiar output with the name of the cluster, the project ID, and the region. Let’s see what we’ll get this time when we retrieve the nodes. 1
kubectl get nodes
The output is as follows. 1 2 3 4
NAME gke-devops-... gke-devops-... gke-devops-...
STATUS Ready Ready Ready
ROLES
AGE 37s 35s 35s
VERSION v1.15.11-gke.1 v1.15.11-gke.1 v1.15.11-gke.1
That’s it. We created a cluster using infrastructure as code with Terraform. This is the moment when we should push the changes to Git and ensure that they are available to whoever might need to change our cluster and the surrounding infrastructure. I’ll assume that you know how to work with Git, so we’ll skip this part. Just remember that, from now on, we should be pushing all the changes to Git. Even better, we should be creating pull requests so that others can review them before merging them to the master branch. Ideally, we’d do that through one of the continuous delivery tools. But that’s out of the scope (at least for now).
Upgrading The Cluster Changing any aspect of the resources we created is easy and straight forward. All we have to do is modify Terraform definitions, and apply the changes. We could add resources, we could remove them, or we could change them in any way we want. To illustrate that, we’ll upgrade the Kubernetes version. But, before we do that, let’s see which version we’re running right now.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1
35
kubectl version --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... serverVersion: ... gitVersion: v1.15.11-gke.1 ...
I am currently running Kubernetes version v1.15.11-gke.1 (yours might be different). To upgrade the version, we need to find which newer versions are available. 1 2 3 4 5
gcloud container get-server-config \ --region \ $(terraform output region) \ --project \ $(terraform output project_id)
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... validMasterVersions: - 1.15.11-gke.3 - 1.15.11-gke.1 ...
In my case, there is indeed a newer version 1.15.11-gke.3 (yours might be different). So, we’ll change the value of the environment variable K8S_VERSION we used so far. Please replace [...] with the selected newer version in the command that follows.
1
export K8S_VERSION=[...]
As I already mentioned, we should try to avoid changing aspects of Terraform definitions through --var arguments. Instead, we should modify variables.tf, push the change to Git, and then apply it. But we will use --var for simplicity. The result will be the same as if we changed that value in the variables.tf.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2
36
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9
... # google_container_cluster.primary will be updated in-place ~ resource "google_container_cluster" "primary" { ... master_version = "1.15.11-gke.1" ~ min_master_version = "1.15.11-gke.1" -> "1.15.11-gke.3" monitoring_service = "monitoring.googleapis.com/kubernetes" ... }
10 11 12 13 14 15 16 17
# google_container_node_pool.primary_nodes will be updated in-place ~ resource "google_container_node_pool" "primary_nodes" { ... project = "devops-catalog" ~ version = "1.15.11-gke.1" -> "1.15.11-gke.3" ... }
18 19 20 21
Plan: 0 to add, 2 to change, 0 to destroy. ... Enter a value: yes
This time, we are not adding resources, but updating some of the properties. We can see that through the ∼ sign next to those that will be modified. In this case, we are about to change the definition of google_container_cluster.primary and google_container_node_pool.primary_nodes resources. Specifically, we are modifying min_master_version in the former, and version in the latter. All the other properties will stay intact. We can observe the same in the Plan section that states that there is nothing to add, that there are 2 resources to change, and that there is nothing to destroy. Type yes and press the enter key. It will take a while until the rest of the process is finished. It is performing the rolling upgrade by draining and shutting down one node at a time and creating new ones based on the newer version. On top of that, it needs to confirm that the system is healthy before continuing with the next iteration. The process started with the control plane, and it should take around half an hour until it is upgraded and fully operational. After that, it’ll do a rolling upgrade of the worker nodes. Assuming that you
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform
37
are still running three nodes (one in each zone), that should take around nine minutes (three minutes per node). The whole process takes more than half an hour, and that might sound like too long, especially since it takes considerably less time to create a cluster. However, creating a new cluster is much easier. Terraform could create all the nodes at once. With upgrades, such a strategy would result in downtime. On top of that, the process is not only performing rolling upgrades but is also immutable. Even though there is the word upgrade in the name, it is not really upgrading existing nodes, but destroying the old ones, and creating new ones in their place. It takes time to do things in a safe way and without downtime. Once the process is finished, we can confirm that it was indeed successful by outputting the current version. 1
kubectl version --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... serverVersion: ... gitVersion: v1.15.11-gke.3 ...
We can see that, this time, we are having a different (newer) version of the cluster. Hurray!
Reorganizing The Definitions Every resource we defined so far is currently in a different file. That is a perfectly valid way to use Terraform. It doesn’t really care whether we have one or one thousand files. It concatenates all those with the tf extension. What makes Terraform unique is its dependency management. No matter how we organize definitions of different resources, it will figure out the dependency tree, and it will create, update, or delete resources in the correct order. That way, we do not need to bother planning what should be created first, or in which order those resources are defined. That gives us the freedom to work and to organize in a myriad of ways. I, for example, tend to have only three files; one for variables, one for outputs, and one for all the providers and resources. Given that I decide how the exercises look like, we’re going to reorganize things my way. We’ll start by removing all Terraform definitions.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1
rm -f *.tf
Next, we’ll concatenate all providers and resources into a single file main.tf. 1 2 3 4 5 6 7
cat \ files/backend.tf \ files/k8s-control-plane.tf \ files/k8s-worker-nodes.tf \ files/provider.tf \ files/storage.tf \ | tee main.tf
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13
terraform { backend "gcs" { bucket = "devops-catalog" prefix = "terraform/state" credentials = "account.json" } } resource "google_container_cluster" "primary" { name = var.cluster_name location = var.region remove_default_node_pool = true initial_node_count = 1 min_master_version = var.k8s_version
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
resource_labels environment created-by owner }
= = = =
{ "development" "terraform" "vfarcic"
} resource "google_container_node_pool" "primary_nodes" { name = var.cluster_name location = var.region cluster = google_container_cluster.primary.name version = var.k8s_version node_config { preemptible = var.preemptible machine_type = var.machine_type
38
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
oauth_scopes = [ "https://www.googleapis.com/auth/cloud-platform" ] } autoscaling { min_node_count = var.min_node_count max_node_count = var.max_node_count } management { auto_upgrade = false } timeouts { create = "15m" update = "1h" } } provider "google" { credentials = file("account.json") project = var.project_id region = var.region } resource "google_storage_bucket" "state" { name = var.state_bucket location = var.region force_destroy = false project = var.project_id storage_class = "NEARLINE" labels = { environment = "development" created-by = "terraform" owner = "vfarcic" } }
Next, we’ll copy the variables. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows.
39
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2 3 4
variable "region" { type = string default = "us-east1" }
5 6 7 8 9
variable "project_id" { type = string default = "devops-catalog" }
10 11 12 13 14
variable "cluster_name" { type = string default = "devops-catalog" }
15 16 17 18
variable "k8s_version" { type = string }
19 20 21 22 23
variable "min_node_count" { type = number default = 1 }
24 25 26 27 28
variable "max_node_count" { type = number default = 3 }
29 30 31 32 33
variable "machine_type" { type = string default = "e2-medium" }
34 35 36 37 38
variable "preemptible" { type = bool default = true }
Finally, we’ll copy the outputs as well.
40
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1
41
cp files/output.tf .
2 3
cat output.tf
The output is as follows. 1 2 3
output "cluster_name" { value = var.cluster_name }
4 5 6 7
output "region" { value = var.region }
8 9 10 11
output "project_id" { value = var.project_id }
That’s it. Everything we need to create and manage our GKE cluster is now neatly organized. It’s split into main.tf (contains all the modules and resources), variables.tf, and output.tf. To demonstrate that Terraform does not care how we organize the definitions nor their order, we’ll apply them again. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3
... Apply complete! Resources: 0 added, 0 changed, 0 destroyed. ...
As you can see, there is nothing to add, change, or destroy. We did not change any of the definitions. We only organized them in a way I like.
Destroying The Resources We’re (almost) finished with the quick exploration of Terraform using GKE as an example. We saw how to add and change resources, and the only thing missing is to see how to destroy them. If we’d like to delete some of the resources, all we’d have to do is remove their definitions, and execute terraform apply. However, in some cases, we might want to destroy everything. As you probably guessed, there is a command for that as well.
Creating And Managing Google Kubernetes Engine (GKE) Clusters With Terraform 1 2
42
terraform destroy \ --var k8s_version=$K8S_VERSION
At the end of the process, you might see an error stating that it couldn’t delete the bucket with Terraform state without force_destroy set to true. Don’t be alarmed. That’s normal. After Terraform destroyed everything, it tried to destroy the bucket where we keep the state. However, we did not specify that the bucket can be removed if it contains files. The process failed to remove that bucket, and that’s a good thing. That will allow us to recreate the same cluster in the sections that follow. The cluster and all the other resources we defined are now gone. The exception is storage with the state that we left intact and will continue using in the exercises that follow. Please note that we removed only the resources created through Terraform, excluding the bucket. Those that were created with gcloud (e.g., project, service account, etc.) are still there. Google will not charge you (almost) anything for them so, unlike those we created with Terraform, there is no good reason to remove them. On the other hand, you might want to use the definitions from this chapter to create a cluster that will be used for the exercises in the others. Keeping those created with gcloud will simplify the process. All you’ll have to do is execute terraform apply. The last thing we’ll do is go out of the local copy of the repository. 1
cd ../../
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform The goal of this book is to guide you through decisions. It aims to get you up-to-speed fast with a technology or a process. Since we already had a brief introduction into Terraform, we’ll skip the potentially lengthy introduction and go straight into the exercises. That will help you make a decision whether Terraform is the right choice and give you sufficient base knowledge that you’ll be able to extend on your own. We’ll create an AWS Elastic Kubernetes Service (EKS) cluster and all the surrounding resources required for optimal usage of it, as well as for proper maintenance of the infrastructure through Terraform. We’re trying to understand how Terraform works, and what it’s suitable for. If you do use Kubernetes, you’ll end up with a reusable way to create and manage it in Amazon Web Services (AWS). Nevertheless, that’s not the primary goal, so it doesn’t matter much whether Kubernetes is your thing or not. The main objective is to learn Terraform through practical examples. Let’s go.
Preparing For The Exercises All the commands from this chapter are available in the 01-02-terraform-eks.sh²⁸ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
The code and the configurations that will be used in this cluster are available in the GitHub repository vfarcic/devops-catalog-code²⁹. Let’s clone it. Feel free to skip the command that follows if you already cloned that repository.
²⁸https://gist.github.com/ad78a643e5ccf7bf5fd87b16b29306eb ²⁹https://github.com/vfarcic/devops-catalog-code
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1 2
44
git clone \ https://github.com/vfarcic/devops-catalog-code.git
The code for this chapter is located in the terraform-eks directory. 1
cd devops-catalog-code
2 3
git pull
4 5
cd terraform-eks
We went to the local copy of the repository. We pulled the latest revision just in case you already had the repository from before, and I changed something in the meantime. Finally, we entered the directory with the code and configurations we’ll use in this chapter.
Exploring Terraform Variables Generally speaking, entries in Terraform definitions are split into four groups. We have provider, resource, output, and variable entries. That is not to say that there aren’t other types (there are), but that those four are the most important and most commonly used ones. For now, we’ll focus on variables, and leave the rest for later. All the configuration files we’ll use are in the file sub-directory. We’ll pull them out one by one and explore what they’re used for.
Let’s copy the file that defines the variables we’ll use and take a quick look. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1 2 3 4
45
variable "region" { type = string default = "us-east-1" }
5 6 7 8 9
variable "cluster_name" { type = string default = "devops-catalog" }
10 11 12 13
variable "k8s_version" { type = string }
14 15 16 17
variable "release_version" { type = string }
18 19 20 21 22
variable "min_node_count" { type = number default = 3 }
23 24 25 26 27
variable "max_node_count" { type = number default = 9 }
28 29 30 31 32
variable "machine_type" { type = string default = "t2.small" }
If you focus on the names of the variables, you’ll notice that they are self-explanatory. We defined the region where our cluster will run, and there’s the name of the cluster we want to create (cluster_name). We’re setting the Kubernetes version we’d like to use (k8s_version), as well as the release version (release_version). The latter is the version of the AMI that we’ll use for the worker node pool. Further on, we have the minimum and the maximum number of worker nodes (min_node_count and max_node_count). Finally, we defined the type of machines we’d like to use for worker nodes (machine_type).
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
46
Unlike other managed Kubernetes solutions (e.g., GKE, AKS), the minimum and the maximum number of worker nodes do not provide Cluster Autoscaler feature. Those are used to create AWS Autoscaling Group. We’d need to deploy Cluster Autoscaler³⁰ separately if we’d like our cluster to scale automatically depending on the requested resources. That is hopefully going to change in the future, but, as of now (May 2020), Kubernetes Cluster Autoscaler is not an integral part of EKS.
What matters is that each of those variables has the type (e.g., string, number, bool), and the default value. The only exceptions are k8s_version and release_version that do not have anything set by default. That means that we’ll have to set them to some value at runtime. The reason for that is twofold. To begin with, I wanted to show you the difference between variables with and without default values. Also, I could not be sure which Kubernetes versions will be available at the time you’re going through the exercises in this chapter. Later on, we’ll see the effect of not having a default value. For now, just remember that the variables k8s_version and release_version do not have a default value. I like to think of variables as a blueprint of what I’d like to accomplish. They define the information that I believe could be changed over time, and they drive the goals of infrastructure-as-code I tend to define later. For example, I want the cluster to be scalable within specific limits, so I set variables min_node_count and max_node_count. Others tend to take a different approach and refactor parts of resource definitions to use variables. They hard-code values initially and then replace those with variables later.
All in all, now we have the file that defines all the variables. Some represent the aspects of the cluster that are likely going to change over time. Others (e.g., cluster_name) are probably never going to change. I defined them as variables in an attempt to create definitions that could be useful to other teams.
Creating The Credentials Now is the time to deal with the prerequisites. Even though the goal is to use terraform for all infrastructure-related tasks, we’ll need a few things specific to AWS. To be more precise, we’ll need to create access key ID and secret access key. Please open the AWS Console³¹. Register if this is the first time you’re using AWS. Otherwise, log in if your session expired. Expand the menu with your name. ³⁰https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html ³¹https://console.aws.amazon.com
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
47
Figure 1-2-1: AWS Personal menu
Select My Security Credentials, and you will be presented with different options to create credentials. Please expand the Access keys (access key ID and secret access key) tab, and click the Create New Access Key button. You should see the message that your access key has been created successfully. Do NOT dismiss that popup by clicking the *Close button. We’ll need the access key, and this is the only time it is available to us.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
48
Figure 1-2-2: Create Access Key confirmation dialog
Please expand the Show Access Key section, and you’ll see the access key ID and the secret access key. We’ll store them in environment variables and construct a file that we’ll be able to source whenever we need those keys. Please replace the first occurrence of [...] with the access key ID, and the second with the secret access key in the commands that follow.
1
export AWS_ACCESS_KEY_ID=[...]
2 3
export AWS_SECRET_ACCESS_KEY=[...]
Now that we have the keys, we’ll store them in a file, together with the region we’re going to use. 1 2 3 4
echo "export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY export AWS_DEFAULT_REGION=us-east-1" \ | tee creds
The aws CLI, as well as terraform, will look for those variables, and use them as credentials. From now on, both in this chapter and those that follow will use those variables. Please execute the command that follows to (re)generate the environment variables. 1
source creds
The key we created serves only one purpose. We created it so that we can use it in Terraform’s aws provider. If you don’t know what that is, and if you are not good at guessing, I’ll explain it in one sentence. It allows us to configure the credentials used to authenticate with AWS, and a few other things. If that’s not enough of an explanation, please visit the AWS Provider³² section of the ³²https://www.terraform.io/docs/providers/aws/index.html
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
49
Terraform documentation for more details. One of the ways the provider can authenticate is through environment variables we set a few moments ago. Let’s copy the provider definition I prepared, and take a quick look. 1
cp files/provider.tf .
2 3
cat provider.tf
The output is as follows. 1 2 3
provider "aws" { region = var.region }
We’re specifying only the region. We could have defined quite a few other things, but we didn’t. What we really need is a region where we’ll create the cluster and authentication. We could have set the latter inside that file, but there was no need for that since the provider can use the environment variables we set a few moments ago. Now we are ready to apply Terraform definitions we created so far. In Terraform terms, apply means “create what’s missing, update what’s different, and delete what’s not needed anymore.” 1
terraform apply
You’ll notice that Terraform asked you to enter a value for k8s_version and, further on, for release_version. Those are the only variables we defined without the default value, so Terraform expects us to provide them. We’ll deal with that later. For now, and until we get to the part that deals with it, press the enter key whenever you’re asked those questions. Empty values should be enough for now since we are not yet using those variables in any of the definitions. The output, limited to the relevant parts, is as follows. 1 2
... Error: Could not satisfy plugin requirements
3 4 5 6 7 8
Plugin reinitialization required. Please run "terraform init". ... Error: provider.aws: no suitable version installed version requirements: "(any version)" versions installed: none
Most of Terraform’s functionality is provided by plugins. More often than not, we need to figure out which plugin suits our needs and download it. In this case, we’re using the aws provider. Fortunately, Terraform will do the job of downloading the plugin. We just need to initialize the project.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
50
terraform init
The output, limited to the relevant parts, is as follows. 1
Initializing the backend...
2 3 4 5 6
Initializing provider plugins... - Checking for available provider plugins... - Downloading plugin for provider "aws" (hashicorp/aws) 2.57.0... ...
We can see that Terraform detected that we want to use the plugin for provider "aws" and downloaded it. Now we should be able to apply that definition. 1
terraform apply
Remember to continue answering with the enter key whenever you’re asked to provide the values for k8s_version and release_version.
The output is as follows. 1
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
The output confirmed that the apply is complete, and we can see that it did not add, change, or destroy anything. That was to be expected. So far, we did not specify that we want to have any AWS resources. We defined some variables, and we specified that we would like to use the aws provider.
Storing The State In A Remote Backend Terraform maintains its internal information about the current state. That allows it to deduce what needs to be done and to converge the actual into the desired state defined in *.tf files. Currently, that state is stored locally in the terraform.tfstate file. For now, you shouldn’t see anything exciting in it. 1
cat terraform.tfstate
The output is as follows.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
{ "version": 4, "terraform_version": "0.12.12", "serial": 1, "lineage": "34476f12-...", "outputs": {}, "resources": []
2 3 4 5 6 7 8
51
}
The field that really matters is resources. It is empty because we did not define any. We will, soon, but it’s likely not going to be something you expect. We are not going to create anything related to our EKS cluster. At least not right away. What we need right now is a storage bucket. Keeping Terraform’s state locally is a bad idea. If it’s on a laptop, we won’t be able to allow others to modify the state of our resources. We’d need to send them the terraform.tfstate file by email, keep it on some network drive, or something similar. That is impractical. We might be tempted to store it in Git, but that would not be secure. Instead, we’ll tell Terraform to keep the state in an AWS S3 bucket. Since we’re trying to define infrastructure as code, we won’t do that by executing a shell command, nor we’ll go to the AWS console. We’ll tell Terraform to create the bucket. It will be the first resource managed by it. We are about to explore the aws_s3_bucket module. As the name suggests, it allows us to manage AWS S3 buckets, and you should be able to find more information in the aws_s3_bucket documentation³³. Let’s copy the file I prepared, and take a look at the definition. 1
cp files/storage.tf .
2 3
cat storage.tf
The output is as follows. 1 2 3 4 5
resource "aws_s3_bucket" "state" { bucket = var.state_bucket acl = "private" region = var.region }
We’re defining the storage bucket referenced as state. All resource entries are followed with a type (e.g., aws_s3_bucket) and a reference (e.g., state). We’ll see the usage of a reference later in one of the upcoming definitions. ³³https://www.terraform.io/docs/providers/aws/r/s3_bucket.html
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
52
Just as with the provider, the resource has several fields. Some are mandatory, while others are optional and often have predefined values. We’re defining the name of the bucket and the acl. Also, we specified that it should be created inside a specific region. The value of one of those fields is defined as a variable (var.region). Others (those less likely to change) are hard-coded. There is a tiny problem we need to fix before we proceed. AWS S3 bucket names need to be globally unique. There cannot be two buckets with the same name, anywhere within the same partition. We will generate a unique name using date. 1
export TF_VAR_state_bucket=doc-$(date +%Y%m%d%H%M%S)
The name of the bucket is now, more or less, unique, and we will not be in danger that someone else already claimed it. The environment variable we created will be used as the name of the bucket. Let’s apply the new definition. 1
terraform apply
The output, limited to the relevant parts, is as follows. 1 2
... Terraform will perform the following actions:
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# aws_s3_bucket.state will be created + resource "aws_s3_bucket" "state" { + acceleration_status = (known after apply) + acl = "private" + arn = (known after apply) + bucket = "devops-catalog" + bucket_domain_name = (known after apply) + bucket_regional_domain_name = (known after apply) + force_destroy = false + hosted_zone_id = (known after apply) + id = (known after apply) + region = "us-east-1" + request_payer = (known after apply) + website_domain = (known after apply) + website_endpoint = (known after apply)
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
+ versioning { + enabled = (known after apply) + mfa_delete = (known after apply) }
20 21 22 23 24
53
}
25 26
Plan: 1 to add, 0 to change, 0 to destroy.
27 28 29 30
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
31 32
Enter a value: yes
In this case, we can see the full list of all the resources that will be created. The + sign indicates that something will be created. Under different conditions, we could also observe those that would be modified (∼) or destroyed (-). Right now, Terraform deduced that the actual state is missing the aws_s3_bucket resource. It also shows us which properties will be used to create that resource. Some were defined by us, while others will be known after we apply that definition. Finally, we are asked whether we want to perform these actions?. Be brave and type yes, followed with the enter key. From now on, I will assume that there’s no need for me to tell you to confirm that you want to perform some Terraform actions. Answer with yes whenever you’re asked to confirm that you want to perform some Terraform actions.
After we choose to proceed, the relevant parts of the output should be as follows. 1 2
... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
We can see that 1 resource was added and that nothing was changed or destroyed. Since this is the first time we created a resource with Terraform, it would be reasonable to be skeptical. So, we’ll confirm that the bucket was indeed created by listing all those available. Over time, we’ll gain confidence in Terraform and will not have to validate that everything works correctly. 1
aws s3api list-buckets
In my case, the output is as follows.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
{ "Buckets": [ { "Name": "devops-catalog", "CreationDate": "2020-04-13T..." } ], "Owner": { "DisplayName": "viktor", "ID": "c6673..." }
2 3 4 5 6 7 8 9 10 11 12
54
}
Let’s imagine that someone else executed terraform apply and that we are not sure what the state of the resources is. In such a situation, we can consult Terraform by asking it to show us the state. 1
terraform show
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# aws_s3_bucket.state: resource "aws_s3_bucket" "state" { acl = "private" arn = "arn:aws:s3:::devops-catalog" bucket = "devops-catalog" bucket_domain_name = "devops-catalog.s3.amazonaws.com" bucket_regional_domain_name = "devops-catalog.s3.amazonaws.com" force_destroy = false hosted_zone_id = "Z3AQBSTGFYJSTF" id = "devops-catalog" region = "us-east-1" request_payer = "BucketOwner" versioning { enabled = false mfa_delete = false } }
As you can see, there’s not much to look at. For now, we have only one resource (aws_s3_bucket). As we keep progressing, that output will be increasing and, more importantly, it will always reflect the state of the resources managed by Terraform. The previous output is a human-readable format of the state currently stored in terraform.tfstate. We can inspect that file as well.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
cat terraform.tfstate
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
{ "version": 4, "terraform_version": "0.12.12", "serial": 3, "lineage": "34476f12-...", "outputs": {}, "resources": [ { "mode": "managed", "type": "aws_s3_bucket", "name": "state", "provider": "provider.aws", "instances": [ { "schema_version": 0, "attributes": { "acceleration_status": "", "acl": "private", "arn": "arn:aws:s3:::devops-catalog", "bucket": "devops-catalog", "bucket_domain_name": "devops-catalog.s3.amazonaws.com", "bucket_prefix": null, "bucket_regional_domain_name": "devops-catalog.s3.amazonaws.com", "cors_rule": [], "force_destroy": false, "grant": [], "hosted_zone_id": "Z3AQBSTGFYJSTF", "id": "devops-catalog", "lifecycle_rule": [], "logging": [], "object_lock_configuration": [], "policy": null, "region": "us-east-1", "replication_configuration": [], "request_payer": "BucketOwner", "server_side_encryption_configuration": [], "tags": null, "versioning": [ {
55
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
"enabled": false, "mfa_delete": false
40 41
} ], "website": [], "website_domain": null, "website_endpoint": null }, "private": "bnVsbA=="
42 43 44 45 46 47 48
}
49
]
50
}
51
]
52 53
56
}
If we ignore the fields that are currently empty, and the few that are for Terraform’s internal usage, we can see that the state stored in that file contains the same information as what we saw through terraform show. The only important difference is that one is in Terraform’s internal format (terraform.tfstate), while the other (terraform show) is meant to be readable by humans. Even though that’s not the case right now, the state might easily contain some confidential information. It is currently stored locally, and we already decided to move it to AWS S3 bucket. That way we’ll be able to share it, it will be stored in a more reliable location, and it will be more secure. To move the state to the bucket, we’ll create an s3³⁴ backend. As you can probably guess, I already prepared a file just for that. 1
cp files/backend.tf .
2 3
cat backend.tf
The output is as follows. 1 2 3 4 5 6
terraform { backend "s3" { bucket = "devops-catalog" key = "terraform/state" } }
There’s nothing special in that definition. We’re setting the name of the bucket and the key, which will be appended to the files. ³⁴https://www.terraform.io/docs/backends/types/s3.html
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
57
The bucket entry in that Terraform definition cannot be set to a value of a variable. It needs to be hard-coded. So, we’ll need to replace devops-catalog with the bucket name we used when we created it. 1 2 3
cat backend.tf \ | sed -e "s@devops-catalog@$TF_VAR_state_bucket@g" \ | tee backend.tf
Let’s apply the definitions and see what we’ll get. 1
terraform apply
The output, limited to the relevant parts, is as follows. 1 2 3
Backend reinitialization required. Please run "terraform init". Reason: Initial configuration of the requested backend "s3" ...
Since we are changing the location where Terraform should store the state, we have to initialize the project again. The last time we did that, it was because a plugin (aws) was missing. This time it’s because the init process will copy the state from the local file to the newly created bucket. 1
terraform init
The output, limited to the relevant parts, is as follows. 1 2 3 4
Initializing the backend... Do you want to copy existing state to the new backend? ... Enter a value: yes
Please confirm that you do want to copy the state by typing yes and pressing the enter key. The process continued. It copied the state to the remote storage, which, from now on, will be used instead of the local file. Now we should be able to apply the definitions. 1
terraform apply
The output is as follows.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1 2
58
... aws_s3_bucket.state: Refreshing state... [id=devops-catalog]
3 4
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
As we can see, there was no need to apply the definitions. The latest addition does not define any new resources. We only added the location for the Terraform state. That change is internal, and it was applied through the init process.
Creating The Control Plane Now we have all the prerequisites. The provider is set to aws, and we have the backend (for the state) pointing to the bucket. We can turn our attention to the EKS cluster itself. A Kubernetes cluster (almost) always consists of a control plane and one or more pools of worker nodes. In the case of EKS, those two are separate types of resources. We’ll start with the control plane, and move towards worker nodes later. We can use the aws_eks_cluster³⁵ module to create an EKS control plane. However, unlike other major providers (e.g., GKE, AKS), EKS cannot be created alone. It requires quite a few other resources. Specifically, we need to create a role ARN, a security group, and a subnet. Those, in turn, might require a few other resources. 1
cp files/k8s-control-plane.tf .
2 3
cat k8s-control-plane.tf
The output is as follows. 1 2 3 4
resource "aws_eks_cluster" "primary" { name = "${var.cluster_name}" role_arn = "${aws_iam_role.control_plane.arn}" version = var.k8s_version
5 6 7 8 9
vpc_config { security_group_ids = ["${aws_security_group.worker.id}"] subnet_ids = aws_subnet.worker[*].id }
10 11 12
depends_on = [ "aws_iam_role_policy_attachment.cluster", ³⁵https://www.terraform.io/docs/providers/aws/r/eks_cluster.html
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
"aws_iam_role_policy_attachment.service",
13
]
14 15
}
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
resource "aws_iam_role" "control_plane" { name = "devops-catalog-control-plane" assume_role_policy = "1.16.8-20200423" ... ~ version = "1.15" -> "1.16" ... }
18 19 20 21
Plan: 0 to add, 2 to change, 0 to destroy. ... Enter a value: yes
This time, we are not adding resources, but updating some of the properties. We can see that through the ∼ sign next to those that will be modified. In this case, we are about to change the definition of aws_eks_cluster.primary and aws_eks_node_group.primary resources. Specifically, we are modifying the version in both, and the release_version for the node group. All the other properties will stay intact. We can observe the same in the Plan section that states that there is nothing to add, that there are 2 resources to change and that there is nothing to destroy.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform
74
Type yes and press the enter key. It will take a while until the rest of the process is finished. It should be performing rolling updates. The process started with the control plane, and it should take a while until it is upgraded and fully operational. After that, it’ll do a rolling upgrade of the worker nodes. You might think that the whole process is too long, especially since it takes considerably less time to create a cluster. However, creating a new cluster is much easier. Terraform could create all the nodes at once. With upgrades, such a strategy would result in downtime. On top of that, the process is not only performing rolling upgrades but is also immutable. Even though there is the word upgrade in the name, it is not really upgrading existing nodes, but destroying the old ones, and creating new ones in their place. It takes time to do things in a safe way and without downtime. Once the process is finished, we can confirm that it was indeed successful by outputting the current version. 1
kubectl version --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... serverVersion: ... gitVersion: v1.16.8-eks-af3caf ...
We can see that this time, we are having a different (newer) version of the cluster. Hurray!
Reorganizing The Definitions Every resource we defined so far is currently in a different file. That is a perfectly valid way to use Terraform. It doesn’t really care whether we have one or one thousand files. It concatenates all those with the tf extension. What makes Terraform unique is its dependency management. No matter how we organize definitions of different resources, it will figure out the dependency tree, and it will create, update, or delete resources in the correct order. That way, we do not need to bother planning what should be created first, or in which order those resources are defined. That gives us the freedom to work and to organize in a myriad of ways. I, for example, tend to have only three files; one for variables, one for outputs, and one for all the providers and resources. Given that I decide how the exercises look like, we’re going to reorganize things my way. We’ll start by removing all Terraform definitions.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
75
rm -f *.tf
Next, we’ll concatenate all providers and resources into a single file main.tf. 1 2 3 4 5 6 7
cat \ files/backend.tf \ files/k8s-control-plane.tf \ files/k8s-worker-nodes.tf \ files/provider.tf \ files/storage.tf \ | tee main.tf
8 9 10 11
cat main.tf \ | sed -e "s@bucket = \"devops-catalog\"@bucket = \"$TF_VAR_state_bucket\"@g" \ | tee main.tf
The output is too big for a book, so we’ll skip presenting it. You should be able to see it on your screen. Next, we’ll copy the variables. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows. 1 2 3 4
variable "region" { type = string default = "us-east-1" }
5 6 7 8 9
variable "cluster_name" { type = string default = "devops-catalog" }
10 11 12 13
variable "k8s_version" { type = string }
14 15 16
variable "release_version" { type = string
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 17
76
}
18 19 20 21 22
variable "min_node_count" { type = number default = 3 }
23 24 25 26 27
variable "max_node_count" { type = number default = 9 }
28 29 30 31 32
variable "machine_type" { type = string default = "t2.small" }
Finally, we’ll copy the outputs as well. 1
cp files/output.tf .
2 3
cat output.tf
The output is as follows. 1 2 3
output "cluster_name" { value = var.cluster_name }
4 5 6 7
output "region" { value = var.region }
That’s it. Everything we need to create and manage our EKS cluster is now neatly organized. It’s split into main.tf (contains all the modules and resources), variables.tf, and output.tf. To demonstrate that Terraform does not care how we organize the definitions nor their order, we’ll apply them again.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1 2 3
77
terraform apply \ --var k8s_version=$K8S_VERSION \ --var release_version=$RELEASE_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3
... Apply complete! Resources: 0 added, 0 changed, 0 destroyed. ...
As you can see, there is nothing to add, change, or destroy. We did not change any of the definitions. We only organized them in a way I like.
Destroying The Resources We are almost finished with the quick exploration of Terraform using EKS as an example. We saw how to add and change resources, and the only thing missing is to see how to destroy them. If we’d like to delete some of the resources, all we’d have to do is remove their definitions, and execute terraform apply. However, in some cases, we might want to destroy everything. As you probably guessed, there is a command for that as well. 1 2 3
terraform destroy \ --var k8s_version=$K8S_VERSION \ --var release_version=$RELEASE_VERSION
At the end of the process, you might see an error stating that it couldn’t delete the bucket with Terraform state without force_destroy set to true. Don’t be alarmed. That’s normal. After Terraform destroyed everything, it tried to destroy the bucket where we keep the state. However, we did not specify that the bucket can be removed if it contains files. The process failed to remove that bucket, and that’s a good thing. That will allow us to recreate the same cluster in the sections that follow. The cluster and all the other resources we defined are now gone. The exception is storage with the state that we left intact and that we will continue using in the exercises that follow. Please note that we removed only the resources created through Terraform. Those that were created without it (e.g., keys) are still there. AWS will not charge you (almost) anything for them so, unlike those we created with Terraform, there is no good reason to remove them. On the other hand, you might want to use the definitions from this chapter to create a cluster that will be used for the exercises in the others. Keeping those created outside Terraform will simplify the process. All you’ll have to do is execute terraform apply. The last thing we’ll do is go out of the local copy of the repository.
Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform 1
cd ../../
78
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform The goal of this book is to guide you through decisions. It aims to get you up-to-speed fast with a technology or a process. Since we already had a brief introduction into Terraform, we’ll skip the potentially lengthy introduction and go straight into the exercises. That will help you make a decision whether Terraform is the right choice and give you sufficient base knowledge that you’ll be able to extend on your own. We’ll create an Azure Kubernetes Service (AKS) cluster and all the surrounding resources required for optimal usage of it, as well as for proper maintenance of the infrastructure through Terraform. We’re trying to understand how Terraform works, and what it’s suitable for. If you do use Kubernetes, you’ll end up with a reusable way to create and manage it in Azure. Nevertheless, that’s not the primary goal, so it doesn’t matter much whether Kubernetes is your thing or not. The main objective is to learn Terraform through practical examples. Let’s go.
Preparing For The Exercises All the commands from this chapter are available in the 01-03-terraform-aks.sh³⁹ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
The code and the configurations that will be used in this cluster are available in the GitHub repository vfarcic/devops-catalog-code⁴⁰. Let’s clone it. Feel free to skip the command that follows if you already cloned that repository.
³⁹https://gist.github.com/7d7ead3378d65b22eb1d3e13e53dd8d6 ⁴⁰https://github.com/vfarcic/devops-catalog-code
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2
80
git clone \ https://github.com/vfarcic/devops-catalog-code.git
The code for this chapter is located in the terraform-aks directory. 1
cd devops-catalog-code
2 3
git pull
4 5
cd terraform-aks
We went to the local copy of the repository. We pulled the latest revision just in case you already had the repository from before, and I changed something in the meantime. Finally, we entered the directory with the code and configurations we’ll use in this chapter.
Exploring Terraform Variables Generally speaking, entries in Terraform definitions are split into four groups. We have provider, resource, output, and variable entries. That is not to say that there aren’t other types (there are), but that those four are the most important and most commonly used ones. For now, we’ll focus on variables, and leave the rest for later. All the configuration files we’ll use are in the file sub-directory. We’ll pull them out one by one and explore what they’re used for.
Let’s copy the file that defines the variables we’ll use and take a quick look. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3 4
81
variable "region" { type = string default = "eastus" }
5 6 7 8 9
variable "resource_group" { type = string default = "devops-catalog-aks" }
10 11 12 13 14
variable "cluster_name" { type = string default = "docatalog" }
15 16 17 18 19
variable "dns_prefix" { type = string default = "docatalog" }
20 21 22 23
variable "k8s_version" { type = string }
24 25 26 27 28
variable "min_node_count" { type = number default = 3 }
29 30 31 32 33
variable "max_node_count" { type = number default = 9 }
34 35 36 37 38
variable "machine_type" { type = string default = "Standard_D1_v2" }
If you focus on the names of the variables, you’ll notice that they are self-explanatory. We defined the region where our cluster will run. There’s the resource group (resource_group), the name of the cluster we want to create (cluster_name), and the DNS prefix. We’re setting the Kubernetes version we’d like to use (k8s_version), and we have the minimum and the maximum number of worker
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
82
nodes (min_node_count and max_node_count). Finally, we defined the type of machines we’d like to use for worker nodes (machine_type). If you ever created an Azure Kubernetes Service (AKS) cluster, all those should be familiar. If you haven’t, I’m sure that you could have guessed what each of those variables means. What matters is that each of those variables has the type (e.g., string, number, bool), and the default value. The only exception is k8s_version that does not have anything set by default. That means that we’ll have to set it to some value at runtime. The reason for that is twofold. To begin with, I wanted to show you the difference between variables with and without default values. Also, I could not be sure which Kubernetes versions will be available at the time you’re going through the exercises in this chapter. Later on, we’ll see the effect of not having a default value. For now, just remember that the variable k8s_version doesn’t have a default value. I like to think of variables as a blueprint of what I’d like to accomplish. They define the information that I believe could be changed over time, and they drive the goals of infrastructure-as-code I tend to define later. For example, I want the cluster to be scalable within specific limits, so I set variables min_node_count and max_node_count. Others tend to take a different approach and refactor parts of resource definitions to use variables. They hard-code values initially and then replace those with variables later.
All in all, now we have the file that defines all the variables. Some represent the aspects of the cluster that are likely going to change over time. Others (e.g., cluster_name) are probably never going to change. I defined them as variables in an attempt to create definitions that could be useful to other teams.
Creating The Credentials Now is the time to deal with the prerequisites. Even though the goal is to use terraform for all infrastructure-related tasks, we’ll need a few things specific to Azure. To be more precise, we’ll need to create a resource group. But, before we do that, I need to make sure that you are registered and logged in Azure, and that you have az CLI. If this is the first time you’re using Azure, please open the portal⁴¹ and follow the instructions to sign up. If you’re planning to use an account provided by your company, please make sure that you have the permissions to create all the resources we’ll use in the exercises that follow.
To install Azure (az) CLI, please go to the Install the Azure CLI⁴² page of the documentation and follow the instructions for your operating system. ⁴¹https://portal.azure.com/ ⁴²https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
83
Now that you have an Azure account, and that you have az in your laptop, we’ll have to log in, before we start creating “stuff”. 1
az login
A new page should have opened in the browser asking for authentication. Follow the instructions. Now we’re ready, and we can proceed by creating a new resource group. If you’re new to Azure, you should know that everything is organized in resource groups. They provide means to group resources. Given that I could not guess whether you already have a resource group you’d like to use and, if you do, what is its name, we’ll create a new one for the exercises in this and, potentially, in other chapters. If a resource group is deleted, a new one cannot have the same name right away.
1 2 3
az group create \ --name devops-catalog-aks \ --location eastus
We created a resource group called devops-catalog-aks. To be on the safe side, we’ll list all the resource groups we have and confirm that the newly created one is there. 1
az group list
The output, limited to the relevant parts, is as follows. 1
[ ... { "id": "/subscriptions/.../resourceGroups/devops-catalog-aks", "location": "eastus", "managedBy": null, "name": "devops-catalog-aks", "properties": { "provisioningState": "Succeeded" }, "tags": null, "type": "Microsoft.Resources/resourceGroups" }
2 3 4 5 6 7 8 9 10 11 12 13 14
]
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
84
You might have other resource groups listed in that output. They do not matter in this context. As long as devops-catalog-aks is there, we should be able to proceed. Almost everything we do needs a provider. In this case, we’ll need azurerm. If you don’t know what a provider is, and if you are not good at guessing, I’ll explain it in one sentence. It allows us to configure the credentials used to authenticate with Azure, and a few other things. If that’s not enough of an explanation, please visit the Azure Provider⁴³ section of the Terraform documentation for more details. Let’s copy the provider definition I prepared, and take a quick look. 1
cp files/provider.tf .
2 3
cat provider.tf
The output is as follows. 1 2 3
provider "azurerm" { features {} }
The provider is almost empty. Normally, we would specify the Client ID, Subscription ID, or Tenant ID, but we’re not going to do that right now. Instead, we’ll rely on the authentication of the az CLI we obtained when we executed az login. That should make things much simpler, at least when running Terraform from a laptop. Just bear in mind that you might change that strategy if you decide to run Terraform as, for example, part of automated pipelines triggered by making changes to definitions in a Git repository. The only mandatory argument is features. Since we don’t need any, and that argument cannot be skipped, we’re setting it to an empty array ({}). Now we are ready to apply Terraform definitions we created so far. In Terraform terms, apply means “create what’s missing, update what’s different, and delete what’s not needed anymore.” 1
terraform apply
You’ll notice that Terraform asked you to enter a value for k8s_version. That’s the only variable we defined without the default value, so Terraform expects us to provide one. We’ll deal with that later. For now, and until we get to the part that deals with it, press the enter key whenever you’re asked that question. An empty value should be enough for now. We are not yet using that variable in any of the definitions. The output, limited to the relevant parts, is as follows. ⁴³https://www.terraform.io/docs/providers/azurerm/index.html
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2
85
... Error: Could not satisfy plugin requirements
3 4 5 6 7 8
Plugin reinitialization required. Please run "terraform init". ... Error: provider.azurerm: no suitable version installed version requirements: "(any version)" versions installed: none
Most of Terraform’s functionality is provided by plugins. More often than not, we need to figure out which plugin suits our needs and download it. In this case, we’re using the azurerm provider. Fortunately, Terraform will do the job of downloading the plugin. We just need to initialize the project. 1
terraform init
The output, limited to the relevant parts, is as follows. 1
Initializing the backend...
2 3 4 5 6
Initializing provider plugins... - Checking for available provider plugins... - Downloading plugin for provider "azurerm" (hashicorp/azurerm) 2.7.0... ...
We can see that Terraform detected that we want to use the plugin for provider "azurerm" and downloaded it. Now we should be able to apply that definition. 1
terraform apply
Remember to continue answering with the enter key whenever you’re asked to provide the value for k8s_version.
The output is as follows. 1
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
The output confirmed that the apply is complete, and we can see that it did not add, change, or destroy anything. That was to be expected. So far, we did not specify that we want to have any Azure resources. We defined some variables, and we specified that we’d like to use the azurerm provider.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
86
Storing The State In A Remote Backend Terraform maintains its internal information about the current state. That allows it to deduce what needs to be done and to converge the actual into the desired state defined in *.tf files. Currently, that state is stored locally in the terraform.tfstate file. For now, you shouldn’t see anything exciting in it. 1
cat terraform.tfstate
The output is as follows. 1
{ "version": 4, "terraform_version": "0.12.12", "serial": 1, "lineage": "02968943-...", "outputs": {}, "resources": []
2 3 4 5 6 7 8
}
The field that really matters is resources. It is empty because we did not define any. We will, soon, but it’s likely not going to be something you expect. We are not going to create anything related to our AKS cluster. At least not right away. What we need right now is a storage bucket, or, to use Azure terms, we need a storage account and a storage container. Keeping Terraform’s state locally is a bad idea. If it’s on a laptop, we won’t be able to allow others to modify the state of our resources. We would need to send them the terraform.tfstate file by email, keep it on some network drive, or something similar. That is impractical. We might be tempted to store it in Git, but that would not be secure. Instead, we’ll tell Terraform to keep the state in a storage container. Since we’re trying to define infrastructure as code, we won’t do that by executing a shell command, nor we’ll go to the Azure console. We’ll tell Terraform to create the container. It will be the first resource managed by it. We are about to explore the azurerm_storage_account and azurerm_storage_container modules. They allow us to manage Azure storage, and you should be able to find more information in the azurerm_storage_account⁴⁴ and azurerm_storage_container⁴⁵ documentation. Let’s copy the file I prepared, and take a look at the definition.
⁴⁴https://www.terraform.io/docs/providers/azurerm/r/storage_account.html ⁴⁵https://www.terraform.io/docs/providers/azurerm/r/storage_container.html
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
87
cp files/storage.tf .
2 3
cat storage.tf
The output is as follows. 1 2 3 4 5 6 7
resource "azurerm_storage_account" "state" { name = "devopscatalog" resource_group_name = var.resource_group location = var.region account_tier = "Standard" account_replication_type = "LRS" }
8 9 10 11 12 13
resource "azurerm_storage_container" "state" { name = "devopscatalog" storage_account_name = azurerm_storage_account.state.name container_access_type = "blob" }
We’re defining storage account and storage container, and both are referenced as state. All resource entries are followed with a type (e.g., azurerm_storage_account) and a reference (e.g., state). We’ll see the usage of a reference later in one of the upcoming definitions. Just as with the provider, the resource has several fields. Some are mandatory, while others are optional and often have predefined values. We’re defining the name and the resource_group_name for the account. We also specified that it should be created inside a specific location. Finally, we set the tier (account_tier) and the replication type (account_replication_type). The container also has a name (almost all resources do). Since a container always lives inside an account, we have to specify the storage_account_name. And that is where resource references come into play. Instead of hard-coding the name of the account, we’re using the name of the azurerm_storage_account referenced as state. Finally, we set the access type (container_access_type) to blob. Just as before, the values of some of those fields are defined as variables. Others (those less likely to change) are hard-coded. Feel free to explore those two resource types in more detail through the Terraform documentation. But do that later. For now, we’ll move forward and apply the new definition. 1
terraform apply
The output, limited to the relevant parts, is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2
88
... Terraform will perform the following actions:
3 4 5 6 7 8 9 10
# azurerm_storage_account.state will be created + resource "azurerm_storage_account" "state" { ... # azurerm_storage_container.state will be created + resource "azurerm_storage_container" "state" { ... Plan: 2 to add, 0 to change, 0 to destroy.
11 12 13 14
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
15 16
Enter a value: yes
In this case, we can see the full list of all the resources that will be created. The + sign indicates that something will be created. Under different conditions, we could also observe those that would be modified (∼) or destroyed (-). Right now, Terraform deduced that the actual state is missing the azurerm_storage_account and azurerm_storage_container resources. It also shows us which properties will be used to create those resources. Some were defined by us, while others will be known after we apply that definition. Finally, we are asked whether we want to perform these actions?. Be brave and type yes, followed with the enter key. From now on, I will assume that there’s no need for me to tell you to confirm that you want to perform some Terraform actions. Answer with yes whenever you’re asked to confirm that you want to perform some Terraform actions.
After we choose to proceed, the relevant parts of the output should be as follows. 1 2
... Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
We can see that 2 resources were added and that nothing was changed or destroyed. Since this is the first time we created a resource with Terraform, it would be reasonable to be skeptical. So, we’ll confirm that the account was indeed created by listing all. Over time, we’ll gain confidence in Terraform and will not have to validate that everything works correctly.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
89
az storage account list
We can see from the output that the storage account devopscatalog exists. You might have others, but they are not relevant for our exercises. Next, let’s see whether the storage container was created inside the account. 1 2
az storage container list \ --account-name devopscatalog
The output is as follows. 1
[ {
2
"metadata": null, "name": "devopscatalog", "properties": { "etag": "\"0x8D7E2DD42155C0F\"", "hasImmutabilityPolicy": "false", "hasLegalHold": "false", "lastModified": "2020-04-17...", "lease": { "duration": null, "state": null, "status": null }, "leaseDuration": null, "leaseState": "available", "leaseStatus": "unlocked", "publicAccess": "blob" }
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
}
20 21
]
The storage container was indeed created. Hurray! Let’s imagine that someone else executed terraform apply and that we are not sure what the state of the resources is. In such a situation, we can consult Terraform by asking it to show us the state. 1
terraform show
The output, limited to the relevant parts, is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3 4 5 6
90
# azurerm_storage_account.state: resource "azurerm_storage_account" "state" { ... # azurerm_storage_container.state: resource "azurerm_storage_container" "state" { ...
As you can see, there’s not much to look at. For now, we have only two resources (azurerm_storage_account and (azurerm_storage_container). As we keep progressing, that output will be increasing and, more importantly, it will always reflect the state of the resources managed by Terraform. The previous output is a human-readable format of the state currently stored in terraform.tfstate. We can inspect that file as well. 1
cat terraform.tfstate
Parts of the output are as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
{ "version": 4, "terraform_version": "0.12.12", "serial": 4, "lineage": "02968943-...", "outputs": {}, "resources": [ { "mode": "managed", "type": "azurerm_storage_account", "name": "state", "provider": "provider.azurerm", "instances": [ ... ] }, { "mode": "managed", "type": "azurerm_storage_container", "name": "state", "provider": "provider.azurerm", "instances": [ ... ] }
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
]
26 27
91
}
If we ignore the fields that are currently empty, and the few that are for Terraform’s internal usage, we can see that the state stored in that file contains the same information as what we saw through terraform show. The only important difference is that one is in Terraform’s internal format (terraform.tfstate), while the other (terraform show) is meant to be readable by humans. Even though that’s not the case right now, the state might easily contain some confidential information. It is currently stored locally, and we already decided to move it to Azure Storage Container. That way we’ll be able to share it, it will be stored in a more reliable location, and it will be more secure. To move the state to the storage container, we’ll create an azurerm⁴⁶ backend. As you can probably guess, I already prepared a file just for that. 1
cp files/backend.tf .
2 3
cat backend.tf
The output is as follows. 1 2 3 4 5 6 7 8
terraform { backend "azurerm" { resource_group_name storage_account_name container_name key } }
= = = =
"devops-catalog-aks" "devopscatalog" "devopscatalog" "terraform.tfstate"
There’s nothing special in that definition. We’re setting the resource group (resource_group_name), the storage account name (storage_account_name), the container name (container_name), and the key, which represents the blob where Terraform state will be stored. Let’s apply the definitions and see what we’ll get. 1
terraform apply
The output, limited to the relevant parts, is as follows.
⁴⁶https://www.terraform.io/docs/backends/types/azurerm.html
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3
92
Backend reinitialization required. Please run "terraform init". Reason: Initial configuration of the requested backend "azurerm" ...
Since we are changing the location where Terraform should store the state, we have to initialize the project again. The last time we did that, it was because a plugin (azurerm) was missing. This time it’s because the init process will copy the state from the local file to the newly created bucket. 1
terraform init
The output, limited to the relevant parts, is as follows. 1 2 3 4
Initializing the backend... Do you want to copy existing state to the new backend? ... Enter a value: yes
Please confirm that you do want to copy the state by typing yes and pressing the enter key. The process continued. It copied the state to the remote storage, which, from now on, will be used instead of the local file. Now we should be able to apply the definitions. 1
terraform apply
The output is as follows. 1 2
... Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
As we can see, there was no need to apply the definitions. The latest addition does not define any new resources. We only added the location for the Terraform state. That change is internal, and it was applied through the init process.
Creating The Control Plane Now we have all the prerequisites. The provider is set to azurerm, and we have the backend (for the state) pointing to the blob in the storage container. We can turn our attention to the AKS cluster itself. A Kubernetes cluster (almost) always consists of a control plane and one or more pools of worker nodes. In the case of AKS, we can set the default worker node pool while creating the control plane,
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
93
and others can be added separately. We’ll start with the control plane and the default node pool, and move towards additional pools later. We can use the azurerm_kubernetes_cluster⁴⁷ module to create an AKS control plane and a default worker node pool. 1
cp files/k8s-control-plane.tf .
2 3
cat k8s-control-plane.tf
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
resource "azurerm_kubernetes_cluster" "primary" { name = var.cluster_name location = var.region resource_group_name = var.resource_group dns_prefix = var.dns_prefix kubernetes_version = var.k8s_version default_node_pool { name = var.cluster_name vm_size = var.machine_type enable_auto_scaling = true max_count = var.max_node_count min_count = var.min_node_count } identity { type = "SystemAssigned" } }
The meaning of most (if not all) of the fields is probably easy to guess. What matters is that we are creating not only the control plane but also the default worked node pool through the default_node_pool block. While, right now (May 2020), it is optional, it is scheduled to become mandatory soon, so we’ll keep it in an attempt to be future-proof. The worker node pool will be scaling automatically, and it will oscillate, depending on the workload, between the specified minimum and the maximum number of nodes. There’s one important thing we need to do before we apply the definitions. We will not be able to keep pressing the enter key when asked which version of Kubernetes we want to have. We could get away with that before because we were not creating a Kubernetes cluster. Now, however, we do have to provide a valid version. But, which one is it? Instead of guessing which Kubernetes versions are available in AKS, we’re going to ask Azure to output the list of all those that are currently supported in our region. ⁴⁷https://www.terraform.io/docs/providers/azurerm/r/kubernetes_cluster.html
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
94
az aks get-versions --location eastus
The output, limited to the last few entries, is as follows. 1
... {
2
"default": null, "isPreview": null, "orchestratorType": "Kubernetes", "orchestratorVersion": "1.16.7", "upgrades": [ { "isPreview": true, "orchestratorType": "Kubernetes", "orchestratorVersion": "1.17.3" } ] }, { "default": null, "isPreview": true, "orchestratorType": "Kubernetes", "orchestratorVersion": "1.17.3", "upgrades": null } ], "type": "Microsoft.ContainerService/locations/orchestrators"
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
}
We can see that there are quite a few versions (orchestratorVersion) we can choose from. We can also see what the next upgrade version is. Kubernetes recommends that we always upgrade to the next minor version and that output helps us understand which one it should be. If we choose the latest, we could not upgrade it until Azure adds the support for a newer Kubernetes version. Pick any of the valid master versions (orchestratorType), except the newest one. You’ll see later why it cannot be the most recent version. If you have difficulty making a decision, the second to newest is a good option. Since we are likely going to have to provide a valid Kubernetes version to all the commands we’ll execute from now on, we’ll store it in an environment variable. Please replace [...] with the selected version in the command that follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
95
export K8S_VERSION=[...] # e.g., 1.16.7
Now we should be able to apply the definition and create the control plane. Hopefully, there is nothing else missing. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9
... # azurerm_kubernetes_cluster.primary will be created + resource "azurerm_kubernetes_cluster" "primary" { ... } ... Enter a value: yes ... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
As expected, yet another resource was added, and none were changed or destroyed.
Exploring Terraform Outputs We’ll retrieve the nodes of the newly created Kubernetes cluster and see what we’ve got. But, before we do that, we need to create a kubeconfig file that will provide kubectl the information on how to access the cluster. We could do that right away with az CLI, but we’ll make it a bit more complicated. To create kubeconfig, we need to know the name of the cluster, and the resource group in which it is running. We might have that information in our heads. But, we’ll imagine that’s not the case. I’ll assume that you forgot it, or that you did not pay attention. That will give me a perfect opportunity to introduce you to yet another Terraform feature. We can define outputs with the information we need, as long as that information is available in Terraform state. 1
cp files/output.tf .
2 3
cat output.tf
The output is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3
96
output "cluster_name" { value = var.cluster_name }
4 5 6 7
output "region" { value = var.region }
8 9 10 11
output "resource_group" { value = var.resource_group }
We’re specifying which data should be output by Terraform. Such outputs are generated at the end of the terraform apply process, and we’ll see that later. For now, we’re interested only in the outputs, so that we can use them to deduce the name of the cluster and the resource group so that we can retrieve the credentials for kubeconfig. If we want to see all the outputs, we can simply refresh. That would update the state file with the information about the physical resources Terraform is tracking and, more importantly, show us those outputs. 1 2
terraform refresh \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2
... Outputs:
3 4 5 6
cluster_name = docatalog region = eastus resource_group = devops-catalog-aks
We can clearly see the name of the cluster, the region, and the resource group. But that’s not what we really need. We’re not interested in seeing that information, but rather in using it to construct the command that will retrieve the credentials. We can accomplish that with the terraform output command. 1
terraform output cluster_name
The output is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
97
docatalog
Now we know how to retrieve the output of a single value, so let’s use that to construct the command that will retrieve the credentials. 1
export KUBECONFIG=$PWD/kubeconfig
2 3 4 5 6 7 8 9
az aks get-credentials \ --name \ $(terraform output cluster_name) \ --resource-group \ $(terraform output resource_group) \ --file \ $KUBECONFIG
We specified that kubeconfig should be in the current directory by exporting the environment variable KUBECONFIG. Further on, we retrieved the credentials using az. What matters, apart from the obvious need to retrieve the credentials, is that we used terraform output to retrieve the data we need and pass them to az. Now we should be able to check the cluster that Terraform created for us. 1
kubectl get nodes
1
NAME aks-docatalog-... aks-docatalog-... aks-docatalog-...
2 3 4
STATUS Ready Ready Ready
ROLES agent agent agent
AGE 93m 93m 92m
VERSION v1.16.7 v1.16.7 v1.16.7
We can see that there are three worker nodes in the cluster. That number coincides with the minimum number of nodes we specified for the default worker node pool. Our cluster is now ready for use. Nevertheless, we should explore how to create additional worker node pools.
Creating Worker Nodes We can manage worker node pools (beyond the default one) through the azurerm_kubernetes_cluster_node_pool⁴⁸ module. As you can expect, I prepared yet another definition that we can use.
⁴⁸https://www.terraform.io/docs/providers/azurerm/r/kubernetes_cluster_node_pool.html
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
98
cp files/k8s-worker-nodes.tf .
2 3
cat k8s-worker-nodes.tf
The output is as follows. 1 2 3 4 5 6 7 8
resource "azurerm_kubernetes_cluster_node_pool" "secondary" { name = "${var.cluster_name}2" kubernetes_cluster_id = azurerm_kubernetes_cluster.primary.id vm_size = var.machine_type enable_auto_scaling = true max_count = var.max_node_count min_count = var.min_node_count }
At the top, we are defining the name of the node pool. The ID of the kubernetes cluster (kubernetes_cluster_id) it should be attached to is interesting, though. Instead of hard-coding it or setting to a value, we’re telling Terraform to use the id field of the azurerm_kubernetes_cluster.primary resource. Further on, we have the size of the VMs (vm_size), whether auto scaling should be enabled (enable_auto_scaling), and the maximum (max_count) and minimum (min_count) number of the nodes. Let’s apply the definitions, including the new one, and see what we’ll get. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2
... Terraform will perform the following actions:
3 4 5 6 7
# azurerm_kubernetes_cluster_node_pool.secondary will be created + resource "azurerm_kubernetes_cluster_node_pool" "secondary" { ... Plan: 1 to add, 0 to change, 0 to destroy.
8 9 10 11
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
12 13
Enter a value:
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
99
The process started by presenting us with all the changes required to converge the actual into the desired state. Since we did not change any of the existing definitions, the only modification to the desired state is the addition of the azurerm_kubernetes_cluster_node_pool referenced as secondary. Confirm that you want to proceed by typing yes and pressing the enter key, and the process will continue. The output, limited to the relevant parts, is as follows. 1 2
... Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
3 4
Outputs:
5 6 7 8
cluster_name = docatalog region = eastus resource_group = devops-catalog-aks
It finished by adding one resource, and without changing or destroying anything. At the end of it, we got the familiar output with the name of the cluster, the region, and the resource group. Let’s see what we’ll get this time when we retrieve the nodes. 1
kubectl get nodes
The output is as follows. 1 2 3 4 5 6 7
NAME aks-docatalog-... aks-docatalog-... aks-docatalog-... aks-docatalog2-... aks-docatalog2-... aks-docatalog2-...
STATUS Ready Ready Ready Ready Ready Ready
ROLES agent agent agent agent agent agent
AGE 103m 103m 102m 2m1s 2m30s 112s
VERSION v1.16.7 v1.16.7 v1.16.7 v1.16.7 v1.16.7 v1.16.7
We created a second node pool, and, as a result, we got three new nodes, in addition to those available through the default node pool. That’s it. We created a cluster using infrastructure as code with Terraform. But, now we might have more nodes than what we need for our exercises. I wanted to show you how to create additional node pools. We might end up with a need to have nodes based on VMs of different types. For example, some processes might need faster CPUs than others. There could be many other reasons, but, as I already mentioned, we won’t need more than the nodes provided by the default node pool. So, we’ll delete the additional pool that we just created.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
100
Whenever any part of the definitions changes, Terraform will converge the actual into the desired state. We already saw that in action when we were adding resources. We can accomplish the opposite effect by removing them. For example, if we don’t want to have the non-default node pool, all we have to do is remove the resource from the definitions. Let’s remove the k8s-worker-nodes.tf file that contains the definition of the additional node pool, and apply the changes. 1
rm -f k8s-worker-nodes.tf
2 3 4
terraform apply \ --var k8s_version=$K8S_VERSION
The output is as follows. 1 2 3 4
... # azurerm_kubernetes_cluster_node_pool.secondary will be destroyed - resource "azurerm_kubernetes_cluster_node_pool" "secondary" { ...
5 6 7 8
Plan: 0 to add, 0 to change, 1 to destroy. ... Enter a value:
We can see from the - sign that the resource azurerm_kubernetes_cluster_node_pool.secondary will be removed. That is further confirmed with the information from the plan stating that 1 resource will be destroyed. Please type yes, and press the enter key. Be patient until the process finishes. Let’s retrieve the nodes and confirm that those from the recently removed node pool are now gone. 1
kubectl get nodes
The output is as follows. 1 2 3 4
NAME aks-docatalog-... aks-docatalog-... aks-docatalog-...
STATUS Ready Ready Ready
ROLES agent agent agent
AGE 111m 110m 110m
VERSION v1.16.7 v1.16.7 v1.16.7
We can see that the nodes from the non-default node pool are now gone. This is the moment when we should push the changes to Git and ensure that they are available to whoever might need to change our cluster and the surrounding infrastructure. I’ll assume that you
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
101
know how to work with Git, so we’ll skip this part. Just remember that, from now on, we should be pushing all the changes to Git. Even better, we should be creating pull requests so that others can review them before merging them to the master branch. Ideally, we’d do that through one of the continuous delivery tools. But that’s out of the scope (at least for now).
Upgrading The Cluster Changing any aspect of the resources we created is easy and straight forward. All we have to do is modify Terraform definitions, and apply the changes. We could add resources, we could remove them, or we could change them in any way we want. To illustrate that, we’ll upgrade the Kubernetes version. But, before we do that, let’s see which version we’re running right now. 1
kubectl version --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... serverVersion: ... gitVersion: v1.16.7 ...
I am currently running Kubernetes version v1.16.7 (yours might be different). To upgrade the version, we need to find which newer versions are available. 1 2 3
az aks get-versions \ --location \ $(terraform output region)
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9
... { "default": true, "isPreview": null, "orchestratorType": "Kubernetes", "orchestratorVersion": "1.16.7", "upgrades": [ { "isPreview": null,
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
"orchestratorType": "Kubernetes", "orchestratorVersion": "1.17.3"
10 11
} ] }, { "default": null, "isPreview": true, "orchestratorType": "Kubernetes", "orchestratorVersion": "1.17.3", "upgrades": null } ], "type": "Microsoft.ContainerService/locations/orchestrators"
12 13 14 15 16 17 18 19 20 21 22 23 24
102
}
In my case, there is indeed a newer version 1.17.3 (yours might be different). So, we’ll change the value of the environment variable K8S_VERSION we used so far. Please replace [...] with the selected newer version in the command that follows.
1
export K8S_VERSION=[...]
As I already mentioned, we should try to avoid changing aspects of Terraform definitions through --var arguments. Instead, we should modify variables.tf, push the change to Git, and then apply it. But, we will use --var for simplicity. The result will be the same as if we changed that value in the variables.tf. 1 2
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3 4 5 6 7 8 9 10
103
... # azurerm_kubernetes_cluster.primary will be updated in-place ~ resource "azurerm_kubernetes_cluster" "primary" { ... ~ kubernetes_version = "1.16.7" -> "1.17.3" location = "eastus" ... Plan: 0 to add, 1 to change, 0 to destroy. ... Enter a value:
This time, we are not adding resources, but updating some of the properties. We can see that through the ∼ sign next to those that will be modified. In this case, we are about to change the definition of the azurerm_kubernetes_cluster.primary resource. Specifically, we are modifying kubernetes_version. All the other properties will stay intact. We can observe the same in the Plan section that states that there is nothing to add, that there is 1 resource to change and that there is nothing to destroy. Type yes and press the enter key. It will take a while until the rest of the process is finished. It is performing the rolling upgrade by draining and shutting down one node at a time and creating new ones based on the newer version. On top of that, it needs to confirm that the system is healthy before continuing with the next iteration. Once the process is finished, we can confirm that it was indeed successful by outputting the current version. 1
kubectl version --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... serverVersion: ... gitVersion: v1.17.3 ...
We can see that, this time, we are having a different (newer) version of the cluster. Hurray!
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
104
Dealing With A Bug That Prevents Upgrade Of Node Pools You might have been a victim of a bug described in the issue 5541⁴⁹. The short version is that the worker node pool might not have been upgraded together with the control plane. Let us see whether that is indeed the case. 1
kubectl get nodes
The output, in my case, is as follows. 1 2 3 4
NAME aks-docatalog-... aks-docatalog-... aks-docatalog-...
STATUS Ready Ready Ready
ROLES agent agent agent
AGE 111m 110m 110m
VERSION v1.16.7 v1.16.7 v1.16.7
In my case, we can see that the nodes are running the old Kubernetes version. They were not upgraded. If, in your case, the nodes are running the correct (the newer) version, the bug was fixed, and you can skip this section. The good news is that there is already a pull request 6047⁵⁰ with a potential fix. Until it is merged, we’ll overcome the issue by applying a workaround. We’ll upgrade the node pool manually. 1 2 3 4 5 6 7 8 9
az aks nodepool upgrade \ --cluster-name \ $(terraform output cluster_name) \ --name \ $(terraform output cluster_name) \ --resource-group \ $(terraform output resource_group) \ --kubernetes-version \ $K8S_VERSION
It will take a while until all the nodes are upgraded. Once the process is finished, we can confirm that it worked by outputting the nodes (again).
⁴⁹https://github.com/terraform-providers/terraform-provider-azurerm/issues/5541 ⁵⁰https://github.com/terraform-providers/terraform-provider-azurerm/pull/6047
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1
105
kubectl get nodes
The output, in my case, is as follows. 1 2 3 4
NAME aks-docatalog-... aks-docatalog-... aks-docatalog-...
STATUS Ready Ready Ready
ROLES agent agent agent
AGE 134m 133m 133m
VERSION v1.17.3 v1.17.3 v1.17.3
The nodes were indeed upgraded. Please note that this is not the final solution. We should control everything, including upgrades of node pools, through Terraform. The action we just performed is only a quick fix until the issue is resolved.
Reorganizing The Definitions Every resource we defined so far is currently in a different file. That is a perfectly valid way to use Terraform. It doesn’t really care whether we have one or one thousand files. It concatenates all those with the tf extension. What makes Terraform unique is its dependency management. No matter how we organize definitions of different resources, it will figure out the dependency tree, and it will create, update, or delete resources in the correct order. That way, we do not need to bother planning what should be created first, or in which order those resources are defined. That gives us the freedom to work and to organize in a myriad of ways. I, for example, tend to have only three files; one for variables, one for outputs, and one for all the providers and resources. Given that I decide how the exercises look like, we’re going to reorganize things my way. We’ll start by removing all Terraform definitions. 1
rm -f *.tf
Next, we’ll concatenate all providers and resources into a single file main.tf.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3 4 5 6
cat \ files/backend.tf \ files/k8s-control-plane.tf \ files/provider.tf \ files/storage.tf \ | tee main.tf
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
terraform { backend "azurerm" { resource_group_name = "devops-catalog-aks" storage_account_name = "devopscatalog" container_name = "devopscatalog" key = "terraform.tfstate" } } resource "azurerm_kubernetes_cluster" "primary" { name = var.cluster_name location = var.region resource_group_name = var.resource_group dns_prefix = var.dns_prefix kubernetes_version = var.k8s_version default_node_pool { name = var.cluster_name vm_size = var.machine_type enable_auto_scaling = true max_count = var.max_node_count min_count = var.min_node_count } identity { type = "SystemAssigned" } } provider "azurerm" { features {} } resource "azurerm_storage_account" "state" { name = "devopscatalog" resource_group_name = var.resource_group location = var.region account_tier = "Standard" account_replication_type = "LRS"
106
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform
lifecycle { prevent_destroy = true }
35 36 37 38
}
39 40 41 42 43 44 45 46 47
resource "azurerm_storage_container" "state" { name = "devopscatalog" storage_account_name = azurerm_storage_account.state.name container_access_type = "blob" lifecycle { prevent_destroy = true } }
Next, we’ll copy the variables. 1
cp files/variables.tf .
2 3
cat variables.tf
The output is as follows. 1 2 3 4
variable "region" { type = string default = "eastus" }
5 6 7 8 9
variable "resource_group" { type = string default = "devops-catalog-aks" }
10 11 12 13 14
variable "cluster_name" { type = string default = "docatalog" }
15 16 17 18 19 20
variable "dns_prefix" { type = string default = "docatalog" }
107
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 21 22 23
108
variable "k8s_version" { type = string }
24 25 26 27 28
variable "min_node_count" { type = number default = 3 }
29 30 31 32 33
variable "max_node_count" { type = number default = 9 }
34 35 36 37 38
variable "machine_type" { type = string default = "Standard_D1_v2" }
Finally, we’ll copy the outputs as well. 1
cp files/output.tf .
2 3
cat output.tf
The output is as follows. 1 2 3
output "cluster_name" { value = var.cluster_name }
4 5 6 7
output "region" { value = var.region }
8 9 10 11
output "resource_group" { value = var.resource_group }
That’s it. Everything we need to create and manage our AKS cluster is now neatly organized. It’s split into main.tf (contains all the modules and resources), variables.tf, and output.tf. To demonstrate that Terraform does not care how we organize the definitions nor their order, we’ll apply them again.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2
109
terraform apply \ --var k8s_version=$K8S_VERSION
The output, limited to the relevant parts, is as follows. 1 2 3
... Apply complete! Resources: 0 added, 0 changed, 0 destroyed. ...
As you can see, there is nothing to add, change, or destroy. We did not change any of the definitions. We only organized them in a way I like.
Destroying The Resources We’re (almost) finished with the quick exploration of Terraform using AKS as an example. We saw how to add and change resources, and the only thing missing is to see how to destroy them. If we’d like to delete some of the resources, all we’d have to do is remove their definitions, and execute terraform apply. However, in some cases, we might want to destroy everything. As you probably guessed, there is a command for that as well. However, in this specific case, we might not want to destroy everything. We need to keep the storage where the Terraform state is stored. That will allow us to re-create the same cluster for the rest of the exercises. If we would be using AWS or GCP, we would simply execute terraform destroy because those two do not allow us to destroy storage if there are files inside it. Or, to be more precise, that’s the default behavior, and we would need to specify explicitly that we do want to destroy storage even if it contains files, by setting the argument force_destroy to true. However, Azure does not have such a flag. So, if we would execute terraform destroy, everything would be gone, including the storage with Terraform state. Since we want to keep that storage, we will need to tell Terraform which targets to destroy, instead of wiping out everything. All in all, we do want to destroy all the resources except the storage with Terraform state. We can do that through the --target argument. Fortunately for us, AKS is simple, and the whole cluster is defined as a single resource azurerm_kubernetes_cluster. 1 2 3
terraform destroy \ --var k8s_version=$K8S_VERSION \ --target azurerm_kubernetes_cluster.primary
The output, limited to the relevant parts, is as follows.
Creating And Managing Azure Kubernetes Service (AKS) Clusters With Terraform 1 2 3 4 5
110
... # azurerm_kubernetes_cluster.primary will be destroyed - resource "azurerm_kubernetes_cluster" "primary" { ... Plan: 0 to add, 0 to change, 1 to destroy.
6 7 8 9
Warning: Resource targeting is in effect ... Enter a value:
We can see that only the AKS cluster will be destroyed. Please type yes and press the enter key. The cluster and all the other resources we defined are now gone. The exception is storage with the state that we left intact and will continue using in the exercises that follow. Please note that we removed only the resources created through Terraform. Those that were created with az (e.g., resource group) are still there. Azure will not charge you (almost) anything for them so, unlike those we created with Terraform, there is no good reason to remove them. On the other hand, you might want to use the definitions from this chapter to create a cluster that will be used for the exercises in the others. Keeping those created with az will simplify the process. All you’ll have to do is execute terraform apply. The last thing we’ll do is go out of the local copy of the repository. 1
cd ../../
Packaging, Deploying, And Managing Applications A long time ago in a galaxy far, far away… we were copying files into servers. That wasn’t that bad, mostly because our needs were relatively simple. We’d had a couple of servers, we were releasing once a year (or even less frequently), and we did not worry that much about many of the things that we consider bare minimum today. If we were advanced enough, we might go crazy and create a ZIP file with all the files we’d need, use SSH to copy them to servers, and uncompress them. That was about it. That’s what we did, and that worked just fine. But that was many years ago. You, dear reader, might not have been even born at that time. In the meantime, we developed different mechanisms and formats to package, deploy, and manage our applications. We got RPM, APT, Brew, Chocolatey, and a plethora of others. And those were badly needed since the requirements changed. It was not enough anymore to simply copy some files. We’d need to be able to start, restart, and shutdown processes. More importantly, we needed different ways to deploy applications in different operating systems. That greatly increased complexity, especially when packaging applications. We’d have to build binaries for each operating system, we’d have to package them in different formats native to those systems, we’d need to distribute them, and so on, and so forth. Containers changed all that, mostly through Docker’s effort that brought an already existing capability to the masses. It made containers easy, which raised its popularity. With increased usage, containers extended beyond Linux systems. We can now run them in Windows, Raspberry Pies, and quite a few other systems. And all we’d have to do is build a container image, and let people run it through Docker or any other container engine. The world became a better place, and the processes became much simpler. We got one format to rule them all. Nevertheless, that turned out not to be enough. We quickly understood that a container image alone is only a fraction of what we need to run an application. Images contain processes that run inside containers. But processes alone hardly do what we need them to do. Containers do not scale by themselves, they do not enable external communication, and they do not magically attach external volumes. Thinking that all we need is a command like docker container run is a misunderstanding, at best. Running applications successfully (by today’s standards) requires much more than the capability to execute a binary. We need to set up networking, we need to enable communication, and, more often than not, we need some form of TLS. We often need to replicate the application in an attempt to make it highly available. We might need to define when and how it should scale up and down. We might need to attach external storage. We might need to provide it with environment-specific configuration and, potentially, inject a secret or two. The list of “we might need to” type of things can be quite large.
Packaging, Deploying, And Managing Applications
112
Today, the definition of an application and everything needed to deploy, run, and maintain it, is fairly complex and is far from a single container image. We can choose from a myriad of schedulers, like Docker Swarm, Mesos, Nomad, and a few others. But they do not matter much anymore. There was a war, and they all lost. The one that matters today (May 2020) is Kubernetes. It is the de facto standard, and it allows us to define everything we need for an application in YAML format. But that is also not enough. We need to package Kubernetes YAML files in a way that they can be retrieved and used easily. We need some form of templating so that the differences can be defined and modified effortlessly, instead of constantly modifying endless lines in YAML files. That’s where Helm comes in.
Using Helm As A Package Manager For Kubernetes Every operating system has at least one packaging mechanism. Some Linux distributions use RPM or APT. Windows uses Chocolatey, and macOS uses Brew. Those are all package mechanisms for operating systems. But why does that matter in the context of Kubernetes? You might say that Kubernetes is a scheduler, while Windows, Linux, and macOS are operating systems. You would be right in thinking so, but that would be only partly true. We can think of Kubernetes as an operating system for clusters, while Linux, Windows, and macOS are operating systems of individual machines. The scheduler is only one of the many features of Kubernetes and saying that it is an operating system for a cluster would be a more precise way to define it. It serves a similar purpose as, let’s say, Linux, except that it operates over a group of servers. As such, it needs a packaging mechanism designed to leverage its modus operandi. While there are quite a few tools we can use to package, install, and manage applications in Kubernetes, none is as widely used as Helm. It is the de facto standard in the Kubernetes ecosystem. Helm uses “charts” to help us define, install, upgrade, and manage apps and all the surrounding resources. It simplifies versioning, publishing, and sharing of applications. It was donated to the Cloud Native Computing Foundation (CNCF)⁵¹, thus landing in the same place as most other projects that matter in the Kubernetes ecosystem. Helm is mighty, yet simple to use. If we’d need to explain Helm in a single sentence, we could say that it is a templating and packaging mechanism for Kubernetes resources. But it is much more than that, even though those are the primary objectives of the project. It organizes packages into charts. We can create them from scratch with helm create command, or we can convert existing Kubernetes definitions into Helm templates. Helm uses a naming convention, which, as long as we spend a few minutes learning it, simplifies not only creation and maintenance of packages, but also navigation through those created by others. In its simplest form, the bulk of the work in Helm is about defining variables in values.yaml, and injecting them into Kubernetes YAML definitions by using entries like {{ .Values.ingress.host }}, ⁵¹https://www.cncf.io/
Packaging, Deploying, And Managing Applications
113
instead of hard-coded values. But Helm is not only about templating Kubernetes definitions but also about adding dependencies as additional charts. Those can be charts we developed and packaged, or charts maintained by others, usually for third-party software. The latter case is compelling since it allows us to leverage collective effort and knowledge of a much wider community. One of the main advantages of Helm is that it empowers us to have variations of our applications, without the need to change their definitions. It accomplishes that by allowing us to define different properties of our application running in different environments. It does that through the ability to overwrite default values, which, in turn, modify the definitions propagated to Kubernetes. As such, it greatly simplifies the process of deploying variations of our applications to different environments, which can be separate Namespaces or even other clusters. Once a chart is packaged, it can be stored in a registry and easily retrieved and used by people and teams that have sufficient permissions to access it. On top of those main features, it offers many other goodies, like, for example, the mechanism to roll back releases. There are many others, and it can take quite some time to learn them all. Fortunately, it takes almost no effort to get up-to-speed with those that are most commonly used. It is a helpful and easy to learn tool that solves quite a few problems that were not meant to be solved with Kubernetes alone. Now, I could continue rambling about Helm for quite some time, but that would be too boring. Instead, we will define a few requirements and see whether we can fulfill them with Helm.
Defining A Scenario I like to start with objectives before diving into any specific tools. That helps me evaluate whether it fits my needs, or if I should look elsewhere. Yours are likely going to be different than mine. Nevertheless, there is a common theme for (almost) all of us, and the differences tend to be smaller than we think. They are often details, rather than substantial differentiation. Most of us are deploying applications to multiple environments. We might have some personal development environment. We might have others for previews generated when creating pull requests, and a few permanent environments like, for example, staging and production. The number might differ from one company to another, and we might call them differently. Nevertheless, we all tend to have more than one environment, with some being dynamic (e.g., personal environments) and others being permanent (e.g., production). What matters is that applications in different environments tend to have different needs. On the one hand, it would be great if a release in each of the environments is exactly the same. That would give us an additional peace of mind knowing that what we’re testing is exactly the same as what will run in production. That’s the promise of containers. A container based on an image should be the same no matter where it runs. On the other hand, we also want to be pragmatic and cost-effective. Those two needs are often at odds.
Packaging, Deploying, And Managing Applications
114
We cannot use the same address when we want to access an application in, let’s say, staging and production. One could be staging.acme.com, while the other might be only acme.com. Even though that’s an insignificant difference, it already illustrates that what we run in one environment is not exactly the same as what we’re running in another. Similarly, it would be too expensive to run the same number of replicas in a personal environment as in production. If the latter has a hundred replicas, it would be too expensive to use the same amount while developing. There are quite a few other differences. We tend to keep applications in different environments as similar as possible while being realistic in what makes sense and which trade-offs are worth making. So, let’s try to define a scenario of what we might need for an application. That might not fully fit your needs, but it should be close enough for you to get an understanding that will serve as a base that you should be able to extend on your own. To begin with, we will need three different types of environments. We’ll create one for personal development, and we’ll call it dev. Every member of our team would have its own, and they would all be temporary. That would allow us to be productive while being cost-conscious at the same time. We’ll create such environments when we need them, and destroy them when we don’t. Normally, we will also need environments where we will deploy applications as a result of making pull requests. They would also need to be temporary. Just like dev environments, they would be generated when pull requests are created and destroyed when PRs are merged or closed. Conceptually, those would be almost the same as dev environments, so we’ll skip them in our examples. Further on, we will need at least one permanent environment that would serve as production. While that could be enough, more often than not, we need at least another similar environment where we will deploy applications for, let’s say, integration tests. The goal is often to promote an application from one environment to another until it reaches production. There could be more than one permanent environment, but we should be able to illustrate the differences with two. So, we will have staging and production environments. Now that we established that we’ll have three environments (dev, staging, and production), we should explore what should be the differences in how we run an application in each of those. But, before we do that, let’s go quickly through the application itself. The demo application we will use in our examples will be a relatively simple one. It’s an API that needs to be accessible from outside the cluster. It is scalable, even though we might not always run multiple replicas. It uses a database, which is scalable. Since the database is stateful, it needs storage. The app is called go-demo-9. The reason for such a name lies in my inability to be creative. So, I call all my demo applications go-demo with an index as a suffix. The one I used in the previous book was called go-demo-8, so the one we’ll use in this one is go-demo-9. My creativity does not go beyond increasing a number as a prefix to an already dull name. Now, let’s see how that application should behave in each of the environments. When running in a personal development or a preview environment, the domain of the application needs to be dynamic. If, for example, my GitHub username is vfarcic, the domain through which I might want to access the application could be vfarcic.go-demo-9.acme.com. That way, it would be
115
Packaging, Deploying, And Managing Applications
unique, and it would not clash with the same application running in John Doe’s environment since he’d use a domain jdoe.go-demo-9.acme.com. Further on, there should be no need to have more than one replica of the application or the database. I wouldn’t expect to have the production setup for the development. Given that one replica is enough, there’s probably no need to have HorizontalPodAutoscaler either. Finally, there’s no need to have persistent storage for the database in the development environment. That database would be shut down when I finish working on the application, and it should be OK if I always start with a fresh default dataset. Let’s move into permanent environments. The domains could be fixed to staging.go-demo-9.acme.com when running in staging, and go-demo-9.acme.com when in production. Those are permanent environments, so the domains can be permanent as well. The application running in staging should be almost the same as when running in production. That would give us higher confidence that what we test is what will run in production. There’s no need to be on the same scale, but it should have all the elements that constitute production. As such, the API should have HorizontalPodAutoscaler (HPA) enabled, and the database should have persistent storage in both environments. On the other hand, if the API in production could oscillate between three and six replicas, two should be enough as the minimum in staging. That way, we will be able to confirm that HPA works in both and that running multiple replicas works as expected, while, at the same time, we will not spend more money than needed for staging. The database in production will run as two replicas, and that means that we should have an equal number in staging as well. Otherwise, we’d risk not being able to validate whether database replication works, before we promote changes to production. The summarized requirements can be seen in the table that follows. Feature Ingress host HPA App replicas DB replicas DB persistance
Dev/PR [GH_USER].godemo-9.acme.com false 1 1 false
Staging staging.go-demo9.acme.com true 2 2 true
Production go-demo9.acme.com true 3-6 2 true
Those are all the requirements. You might have others for your application, but those should be enough to demonstrate how Helm works, and how to make applications behave in different environments. Now, let us see whether we can package, deploy, and manage our application in a way that fulfills those requirements. But, first, we need to make sure that we have the prerequisites required for the exercises that follow.
Packaging, Deploying, And Managing Applications
116
Preparing For The Exercises All the commands from this chapter are available in the 02-helm.sh⁵² Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
The code and the configurations that will be used in this chapter are available in the GitHub repository vfarcic/devops-catalog-code⁵³. Let’s clone it. Feel free to skip the command that follows if you already cloned that repository.
1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
Next, we’ll go into the local copy of the repository, and, to be on the safe side, we’ll pull the latest revision just in case you already had the repository from before, and I changed something in the meantime. 1
cd devops-catalog-code
2 3
git pull
We’ll need a Kubernetes cluster, with NGINX Ingress controller, and the environment variable INGRESS_HOST with the address through which we can access applications that we’ll deploy inside the cluster. If you meet those requirements, you should be able to use any Kubernetes cluster. However, bear in mind that I tested everything in Docker Desktop, Minikube, Google Kubernetes Engine (GKE), Amazon Kubernetes Service (EKS), and Azure Kubernetes Service (AKS). That does not mean that you cannot use a different Kubernetes flavor. You most likely can, but I cannot guarantee that without testing it myself. For your convenience, I created scripts that will create a Kubernetes cluster in the flavors I mentioned. All you have to do is follow the instructions from one of the Gists that follows. The Gists for GKE, EKS, and AKS, assume that you followed the exercises for using Terraform. If you didn’t, you might want to go through the Infrastructure as Code (IaC) chapter first. Or, if you are confident in your Terraform skills, you might skip that chapter, but, in that case, you might need to make a few modifications to the Gist you choose. ⁵²https://gist.github.com/c9e05ce1b744c0aad5d10ee5158099fa ⁵³https://github.com/vfarcic/devops-catalog-code
Packaging, Deploying, And Managing Applications
• • • • •
117
Docker Desktop: docker.sh⁵⁴ Minikube: minikube.sh⁵⁵ GKE: gke.sh⁵⁶ EKS: eks.sh⁵⁷ AKS: aks.sh⁵⁸
You will also need Helm CLI. If you do not have it already, please visit the Installing Helm⁵⁹ page and follow the instructions for your operating system. The only thing missing is to go to the helm directory, which contains all definitions we’ll use in this chapter. 1
cd helm
Creating Helm Charts Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of Kubernetes resources. Helm relies heavily on naming conventions, so charts are created as files laid out in a particular directory tree, and with some of the files using pre-defined names. Charts can be packaged into versioned archives to be deployed. But that’s not our current focus. We’ll explore packaging later. For now, our goal is to create a chart. We can create a basic one through the CLI. 1
helm create my-app
Helm created a directory with the same name as the one we specified. Let’s see what we got. 1
ls -1 my-app
The output is as follows.
⁵⁴https://gist.github.com/9f2ee2be882e0f39474d1d6fb1b63b83 ⁵⁵https://gist.github.com/2a6e5ad588509f43baa94cbdf40d0d16 ⁵⁶https://gist.github.com/68e8f17ebb61ef3be671e2ee29bfea70 ⁵⁷https://gist.github.com/200419b88a75f7a51bfa6ee78f0da592 ⁵⁸https://gist.github.com/0e28b2a9f10b2f643502f80391ca6ce8 ⁵⁹https://helm.sh/docs/intro/install/
Packaging, Deploying, And Managing Applications 1 2 3 4
118
Chart.yaml charts templates values.yaml
You probably expect me to explain what each of those files and directories means. I will do that, but not through the chart we created. It is too simple for our use case, and we’d need to change quite a few things. Instead of doing that, we’ll explore a chart I prepared. For now, just remember that you can easily create new charts. We’ll explore through my example the most important files and directories, and what they’re used for. Given that we will not use the my-app chart we created, we’ll delete the whole directory. It served its purpose of demonstrating how to create new charts, and not much more. 1
rm -rf my-app
The chart we’ll use is in the go-demo-9 subdirectory. Let’s see what’s inside. 1
ls -1 go-demo-9
The output is as follows. 1 2 3 4 5
Chart.yaml charts requirements.yaml templates values.yaml
You’ll notice that the files and the directories are the same as those we got when we created a new chart. As I already mentioned, Helm uses naming convention and expects things to be named the way it likes. The only difference between my chart and the one we created earlier is that now we have an additional file requirements.yaml. We’ll get to it later. In most cases, a repo like the one we’re using is not a good place for Helm charts. More often than not, they should be stored in the same repository as the application it defines. We have it in the same repository as all the other examples we’re using in this book, mostly for simplicity reasons.
Let’s take a look at the first file.
Packaging, Deploying, And Managing Applications 1
119
cat go-demo-9/Chart.yaml
The output is as follows. 1 2 3 4 5
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.1 appVersion: 0.0.1 Chart.yaml contains meta-information about the chart. It is mostly for Helm’s internal use, and it
does not define any Kubernetes resources. The apiVersion is set to v1. The alternative would be to set it to v2, which would indicate that the chart is compatible only with Helm 3. Everything we’ll use is compatible with earlier Helm versions, so we’re keeping v1 as a clear indication that the chart can be used with any Helm, at least at the time of this writing (May 2020). The description and the name should be self-explanatory, so we’ll skip those. The version field is mandatory, and it defines the version of the chart. The appVersion, on the other hand, is optional, and it contains the version of the application that this chart defines. What matters is that both must use semantic versioning 2⁶⁰. There are a few other fields that we could have defined, but we didn’t. I’m assuming that you will read the full documentation later on. This is a quick dive into Helm, and it is not meant to show you everything you can do with it. Let’s see what’s inside the templates directory. 1
ls -1 go-demo-9/templates
The output is as follows. 1 2 3 4 5 6
NOTES.txt _helpers.tpl deployment.yaml hpa.yaml ingress.yaml service.yaml
That is the directory where the action is defined. There’s NOTES.txt, which contains templated usage information that will be output when we deploy the chart. The _helpers.tpl field defines “template” ⁶⁰https://semver.org/
Packaging, Deploying, And Managing Applications
120
partials or, to put it in other words, functions that can be used in templates. We’ll skip explaining both by giving you the homework to explore them later on your own. The rest of the files define templates that will be converted into Kubernetes resource definitions. Unlike most other Helm files, those can be named any way we like. If you’re already familiar with Kubernetes, you should be able to guess what’s in those files from their names. There is a Kubernetes Deployment (deployment.yaml), HorizontalPodAutoscaler (hpa.yaml), Ingress (ingress.yaml), and Service (service.yaml). Let’s take a look at the deployment.yaml file. 1
cat go-demo-9/templates/deployment.yaml
The output is as follows. 1
---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
apiVersion: apps/v1 kind: Deployment metadata: name: {{ template "fullname" . }} labels: app: {{ template "fullname" . }} spec: selector: matchLabels: app: {{ template "fullname" . }} template: metadata: labels: app: {{ template "fullname" . }} {{- if .Values.podAnnotations }} annotations: {{ toYaml .Values.podAnnotations | indent 8 }} {{- end }} spec: containers: - name: {{ .Chart.Name }} image: {{ .Values.image.repository }}:{{ .Values.image.tag }} imagePullPolicy: {{ .Values.image.pullPolicy }} env: - name: DB value: {{ template "fullname" . }}-db - name: VERSION
Packaging, Deploying, And Managing Applications 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
121
value: {{ .Values.image.tag }} ports: - containerPort: {{ .Values.service.internalPort }} livenessProbe: httpGet: path: {{ .Values.probePath }} port: {{ .Values.service.internalPort }} initialDelaySeconds: {{ .Values.livenessProbe.initialDelaySeconds }} periodSeconds: {{ .Values.livenessProbe.periodSeconds }} successThreshold: {{ .Values.livenessProbe.successThreshold }} timeoutSeconds: {{ .Values.livenessProbe.timeoutSeconds }} readinessProbe: httpGet: path: {{ .Values.probePath }} port: {{ .Values.service.internalPort }} periodSeconds: {{ .Values.readinessProbe.periodSeconds }} successThreshold: {{ .Values.readinessProbe.successThreshold }} timeoutSeconds: {{ .Values.readinessProbe.timeoutSeconds }} resources: {{ toYaml .Values.resources | indent 12 }} terminationGracePeriodSeconds: {{ .Values.terminationGracePeriodSeconds }}
We will not discuss specifics of that Deployment, or anything else directly related to Kubernetes. I will assume that you have at least basic Kubernetes knowledge. If that’s not the case, you might want learn a bit about it first. One possible source of information could be The DevOps 2.3 Toolkit: Kubernetes⁶¹.
If you ignore the entries surrounded with curly braces ({{ and }}), that would be a typical definition of a Kubernetes Deployment. The twist is that we replaced parts of it with variables, functions, and conditionals, with values and functions surrounded by curly braces. The templates (like the one in front of you) will be converted into Kubernetes manifest files that are YAML-formatted resource descriptions. In other words, those templates will be converted into typical Kubernetes YAML files. Templating itself is a combination of Go template language⁶² and Sprig template library⁶³. It might take a while to learn them both, but the good news is that you might never have to go that deep. Most of the Helm definitions you will find, and most of those you will define will use a few simple syntaxes. Most of the values inside curly braces start with .Values. For example, we have an entry like path: {{ .Values.probePath }}. That means that the value of the path entry will be, by default, the value ⁶¹https://www.devopstoolkitseries.com/posts/devops-23/ ⁶²https://godoc.org/text/template ⁶³https://masterminds.github.io/sprig/
Packaging, Deploying, And Managing Applications
122
of the variable probePath defined in values.yaml. Pay attention that the previous sentence said; “by default”. We’ll see later what that really means. Let’s take a look at values.yaml, and try to locate probePath. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3
... probePath: / ...
If we do not overwrite that variable, the entry path: {{ .Values.probePath }} in templates/deployment.yaml will be converted into path: /. We could easily spend a few chapters only on Helm templating, but we won’t. This is supposed to be a quick dive into Helm. We have a few objectives we decided to accomplish, so let’s get back to them. We already said that the Ingress host should be different depending on the environment where the application will run. Let’s see how we can accomplish that through Helm. 1
cat go-demo-9/templates/ingress.yaml
The output is as follows. 1
---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: name: {{ template "fullname" . }} annotations: kubernetes.io/ingress.class: nginx spec: rules: - host: {{ .Values.ingress.host }} http: paths: - backend: serviceName: {{ template "fullname" . }} servicePort: {{ .Values.service.externalPort }}
Packaging, Deploying, And Managing Applications
123
As you can see, that is a typical Ingress definition. I’m still assuming that you have basic Kubernetes knowledge and that, therefore, you know what Ingress is. What makes that definition “special” is that a few values that would typically be hard-coded are changed to variables and templates (functions). The important part, in the context of our objectives, is that the first (and the only entry) of the spec.rules array has host set to {{ .Values.ingress.host }}. Let’s try to locate that in the values.yaml file. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
... ingress: host: go-demo-9.acme.com ...
We can see that there is the variable host nested inside the ingress entry and that it is set to go-demo-9.acme.com. That is the host we decided to use for production, and it represents the pattern I prefer to follow. More often than not, I define charts in a way that the default values (those defined in values.yaml) always represent an application in production. That allows me to have easy insight into the “final” state of the application while keeping the option to overwrite those values for other environments. We’ll see later how to do that. Another objective we have is to enable or disable HorizontalPodAutoscaler, depending on the environment of the application. So, let’s take a quick look at its definition. 1
cat go-demo-9/templates/hpa.yaml
The output is as follows. 1
---
2 3 4 5 6 7 8 9 10 11
{{- if .Values.hpa.enabled }} apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: {{ template "fullname" . }} labels: app: {{ template "fullname" . }} spec: scaleTargetRef:
Packaging, Deploying, And Managing Applications 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
124
apiVersion: apps/v1 kind: Deployment name: {{ template "fullname" . }} minReplicas: {{ .Values.hpa.minReplicas }} maxReplicas: {{ .Values.hpa.maxReplicas }} metrics: - type: Resource resource: name: cpu targetAverageUtilization: {{ .Values.hpa.cpuTargetAverageUtilization }} - type: Resource resource: name: memory targetAverageUtilization: {{ .Values.hpa.memoryTargetAverageUtilization }} {{- end }}
Since we want to be able to control whether HPA should or shouldn’t be created depending on the environment, we surrounded the whole definition inside the block defined with {{- if .Values.hpa.enabled }} and {{- end }} entries. It is the Go template equivalent of a simple if statement. The definition inside that block will be used only if hpa.enabled value is set to true. We also said that, when HPA is indeed used, it should be configured to have a minimum of two replicas in staging, and that it should oscillate between three and six in production. We’re accomplishing that through minReplicas and maxReplicas entries set to hpa.minReplicas and hpa.maxReplicas values. Let’s confirm that those variables are indeed defined in values.yaml, and that the values are set to what we expect to have in production. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6
... hpa: enabled: true minReplicas: 3 maxReplicas: 6 ...
As we can see, if we don’t overwrite the default values, HPA will be enabled, the minimum number of replicas will be 3, and the maximum will be 6. That’s what we said we should have in production. There’s one more thing we need to explore before we deploy our application. We need to figure out how to add the database as a dependency.
Packaging, Deploying, And Managing Applications
125
Adding Application Dependencies As you already saw, Helm allows us to define templates. While that works great for our applications, it might not be the best idea for third-party apps. Our application requires a database. To be more precise, it needs MongoDB. Now, you might say, “it should be easy to define the resources for MongoDB”, but that could quickly turn into a false statement. Running MongoDB is not only about creating a Kubernetes StatefulSet and a Service. It’s much more than that. We might need to have a Deployment when it is running as a single replica, and a StatefulSet when it is running in ReplicaSet mode. We might need to set up autoscaling. We might need an operator that will join replicas into a ReplicaSet. We might need different storage options, and we might need to be able to choose not to use storage at all. There are many other things that we might need to define or, even worse, to write a custom operator that would tie different resources and make sure that they are working as expected. But, the real question is not whether we could define everything we need to deploy and manage MongoDB. Rather, the question is whether that is a worthwhile investment. More often than not, it is not worth our time. Whenever possible, we should focus on what brings differentiating value. That, in most cases, means that we should focus mostly on developing, releasing, and running our own applications, and using services from other vendors and community knowledge for everything else. MongoDB is not an exception. Helm contains a massive library of charts maintained both by the community and vendors. All we have to do is find the one we need, and add it as a dependency. Let’s see yet another YAML file. 1
cat go-demo-9/requirements.yaml
The output is as follows. 1 2 3 4 5
dependencies: - name: mongodb alias: go-demo-9-db version: 7.13.0 repository: https://charts.bitnami.com/bitnami
The requirements.yaml file is optional. If we do have it, we can use it to specify one or more dependencies. In our case, there’s only one. The name of the dependency must match the name of the chart that we want to use as a dependency. However, that alone might result in conflicts, given that multiple applications running in the same Namespace might use the same chart as a dependency. To avoid such a potential issue, we specified
126
Packaging, Deploying, And Managing Applications
the alias that provides the unique identifier that should be used when deploying that chart instead of the “official” name of the chart. Further on, we can see that we are using the specific version of that chart and that it is defined in the specific repository. That is indeed an easy way to add almost any third-party application as a dependency. Furthermore, we could use the same mechanism to add internal applications as dependencies. But, the real question is how I knew which values to add? How did I know that the name of the chart is mongodb? How did I figure out that the version is indeed 7.13.0, and that the chart is in that repository? Let’s go a few steps back and explore the process that made me end up with that specific config. The first step in finding a chart is often to search for it. That can be done through a simple Google search like “MongoDB Helm chart”, or through the helm search command. I tend to start with the latter and resort to Google only if what I need is not that easy to find. 1 2
helm repo add stable \ https://kubernetes-charts.storage.googleapis.com
3 4
helm search repo mongodb
We added the stable repository. We’ll explore repositories in more detail soon. For now, the only thing that matters is that it is the location where most of the charts are located. Further on, we searched for mongodb in all the repositories we currently have. The output, in my case, is as follows (yours might differ). 1
NAME
CHART VERSION APP VERSION DESCRIPTION
\
stable/mongodb nt-oriented database tha... stable/mongodb-replicaset database that stores JS... stable/prometheus-mongodb-exporter or MongoDB metrics stable/unifi i Controller
7.8.10
4.2.4
DEPRECATED NoSQL docume\
3.15.0
3.6
NoSQL document-oriented\
2.4.0
v0.10.0
A Prometheus exporter f\
0.7.0
5.12.35
Ubiquiti Network's Unif\
2 3 4 5 6 7 8 9 10
There are quite a few things we can observe from that output. To begin with, all the charts, at least those related to mongodb, are coming from the stable repo. That’s the repository that is often the best starting point when searching for a Helm chart. We can see that there are at least four charts that contain the word mongodb. But, judging from the names, the stable/mongodb chart sounds like something we might need.
Packaging, Deploying, And Managing Applications
127
Further on, we have the latest version of the chart (CHART VERSION) and the version of the application it uses (APP VERSION). Finally, we have a description, and the one for the mongodb chart is kind of depressing. It starts with DEPRECATED, giving us a clear indication that it is no longer maintained. Let’s take a quick look at the chart’s README and see whether we can get a clue as to why it was deprecated. 1
helm show readme stable/mongodb
If we scroll up to the top, we’ll see that there is a whole sub-section with the header This Helm chart is deprecated. In a nutshell, it tells us that the maintenance of the chart is moved to Bitnami’s repository, followed with short instructions on how to add and use their repository. While that might be seen as an additional complication, such a situation is great, since it provides me with the perfect opportunity to show you how to add and use additional repositories. The first step is to add the bitnami repository to the Helm CLI, just as the instructions tell us. 1 2
helm repo add bitnami \ https://charts.bitnami.com/bitnami
Next, we will confirm that the repo was indeed added by listing all those available locally through the Helm CLI. 1
helm repo list
The output is as follows. 1 2 3
NAME URL stable https://kubernetes-charts.storage.googleapis.com/ bitnami https://charts.bitnami.com/bitnami
Let’s see what happens if we search for MongoDB again? 1
helm search repo mongodb
The output is as follows.
128
Packaging, Deploying, And Managing Applications 1
NAME
CHART VERSION APP VERSION DESCRIPTION
\
bitnami/mongodb database that stores JS... bitnami/mongodb-sharded database that stores JS... stable/mongodb nt-oriented database tha... stable/mongodb-replicaset database that stores JS... stable/prometheus-mongodb-exporter or MongoDB metrics bitnami/mean -source JavaScript softw... stable/unifi i Controller
7.13.0
4.2.6
NoSQL document-oriented\
1.1.4
4.2.6
NoSQL document-oriented\
7.8.10
4.2.4
DEPRECATED NoSQL docume\
3.15.0
3.6
NoSQL document-oriented\
2.4.0
v0.10.0
A Prometheus exporter f\
6.1.1
4.6.2
MEAN is a free and open\
0.7.0
5.12.35
Ubiquiti Network's Unif\
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
We can see that, besides those from the stable repo, we got a few new results from bitnami. Please note that one of the charts is bitnami/mongodb and that the chart version is, in my case, 7.13.0 (yours might be newer). Let’s take another look at the requirements.yaml file we explored earlier. 1
cat go-demo-9/requirements.yaml
The output is as follows. 1 2 3 4 5
dependencies: - name: mongodb alias: go-demo-9-db version: 7.13.0 repository: https://charts.bitnami.com/bitnami
The dependency we explored earlier should now make more sense. The repository is the same as the one we added earlier, and the version matches the latest one we observed through my output of helm search. But we’re not yet done with the mongodb dependency. We might need to customize it to serve our needs. Given that we’re trying to define production values as defaults in values.yaml, and that we said that the database should be replicated, we might need to add value or two to make that happen. So, the next step is to explore which values we can use to customize the mongodb chart. We can easily retrieve all the values available in a chart through yet another helm command.
Packaging, Deploying, And Managing Applications 1 2
129
helm show values bitnami/mongodb \ --version 7.13.0
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... replicaSet: ## Whether to create a MongoDB replica set for high availability or not enabled: false ...
We can see that MongoDB replication (ReplicaSet) is disabled by default. All we’d have to do to make it replicated is to change the replicaSet.enabled value to true. And I already did that for you in the values.yaml file, so let’s take another look at it. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
... go-demo-9-db: replicaSet: enabled: true
This time, we are not trying to define a value of the main application, but of one of its dependencies. So, those related to the MongoDB are prefixed with go-demo-9-db. That matches the alias we defined for the dependency. Within that segment, we set replicaSet.enabled to true. As you will see soon, when we deploy the application with the go-demo-9-db dependency, the database will be replicated.
Deploying Applications To Production It might sound strange that we’re starting with production. It would make much more sense to deploy it first to a development environment, from there to promote it to staging, and only then to run it in production. From the application lifecycle perspective, that would be, more or less, the correct flow of events. Nevertheless, we’ll start with production, because that is the easiest use case. Since the default values match what we want to have in production, we can deploy the application to production as-is, without worrying about the tweaks like those we’ll have to make for development and staging environments. Normally, we’d split environments into different Namespaces or even different clusters. Since the latter would require us to create new clusters, and given that I want to keep the cost to the bare
Packaging, Deploying, And Managing Applications
130
minimum, we’ll stick with Namespaces as a way to separate the environments. The process would be mostly the same if we’d run multiple clusters, and the only substantial difference would be in kubeconfig, which would need to point to the desired Kube API. We’ll start by creating the production Namespace. 1
kubectl create namespace production
We’ll imagine that we do not know whether the application has any dependencies, so we will retrieve the list and see what we might need. 1
helm dependency list go-demo-9
The output is as follows. 1 2
NAME VERSION REPOSITORY STATUS mongodb 7.13.0 https://charts.bitnami.com/bitnami missing
We can see that the chart has only one dependency. We already knew that. What is distinguishing in that list is that the status is missing. We need to download the missing dependency before we try to apply the chart. One way to do that is by updating all the dependencies of the chart. 1
helm dependency update go-demo-9
The output is as follows. 1 2 3 4 5 6 7
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "bitnami" chart repository ...Successfully got an update from the "stable" chart repository Update Complete. ⎈Happy Helming!⎈ Saving 1 charts Downloading mongodb from repo https://charts.bitnami.com/bitnami Deleting outdated charts
We can confirm that it is now available by re-listing the dependencies. 1
helm dependency list go-demo-9
The output is as follows.
Packaging, Deploying, And Managing Applications 1 2
131
NAME VERSION REPOSITORY STATUS mongodb 7.13.0 https://charts.bitnami.com/bitnami ok
We can see that the status is now ok, meaning that the only dependency of the chart is ready to be used. We can install charts through helm install, or we can upgrade them with helm upgrade. But I don’t like using those commands since that would force me to find out the current status. If the application is already installed, helm install will fail. Similarly, we would not be able to upgrade the app if it was not already installed. In my experience, it is best if we don’t worry about the current status and tell Helm to upgrade the chart if it’s already installed or to install it if it isn’t. That would be equivalent to the kubectl apply command. 1 2 3 4 5
helm --namespace production \ upgrade --install \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
The output is as follows. 1 2 3 4 5 6 7 8 9
Release "go-demo-9" does not exist. Installing it now. NAME: go-demo-9 LAST DEPLOYED: Tue Apr 21 21:41:16 2020 NAMESPACE: production STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Get the application URL by running these commands:
10 11
kubectl get ingress go-demo-9-go-demo-9
We applied the chart. Since we used the --install argument, Helm figured out whether it should upgrade it or install it. We named the applied chart go-demo-9, and we told it to use the directory with the same name as the source. The --wait argument made the process last longer because it forced Helm to wait until all the Pods are running and healthy. It is important to note that Helm is Namespace-scoped. We had to specify --namespace production to ensure that’s where the resources are created. Otherwise, it would use the default Namespace. If you are in doubt which charts are running in a specific Namespace, we can retrieve them through the helm list command.
Packaging, Deploying, And Managing Applications 1
132
helm --namespace production list
The output is as follows. 1 2
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION go-demo-9 production 1 2020-04-21 ... deployed go-demo-9-0.0.1 0.0.1
You should be able to figure out the meaning of each of the columns, so we’ll move on right away. We set some objectives for the application running in production. Let’s see whether we fulfilled them. As a refresher, they are as follows. • • • • •
Ingress host should be go-demo-9.acme.com. HPA should be enabled. The number of replicas of the application should oscillate between three and six. The database should have two replicas. The database should have persistent volumes attached to all the replicas.
Let’s see whether we accomplished those objectives. Is our application accessible through the host go-demo-9.acme.com? 1 2
kubectl --namespace production \ get ingresses
The output is as follows. 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 go-demo-9.acme.com 192.168.64.56 80 8m21s
The host does seem to be correct, and we should probably double-check that the application is indeed accessible through it by sending a simple HTTP request. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
Since you probably do not own the domain acme.com, we’re “faking” it by injecting the Host header into the request.
The output is as follows.
Packaging, Deploying, And Managing Applications 1
133
Version: 0.0.1; Release: unknown
We got a response, so we can move to the next validation. Do we have HorizontalPodAutoscaler (HPA), and is it configured to scale between three and six replicas? 1 2
kubectl --namespace production \ get hpa
The output is as follows. 1 2 3 4
NAME REFERENCE TARGETS MINP\ ODS MAXPODS REPLICAS AGE go-demo-9-go-demo-9 Deployment/go-demo-9-go-demo-9 /80%, /80% 3 \ 6 3 9m58s
Is the database replicated as well, and does it have persistent volumes? 1
kubectl get persistentvolumes
The output is as follows. 1 2 3 4 5 6
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM S\ TORAGECLASS REASON AGE pvc-86dc84f4... 8Gi RWO Delete Bound production/datadir-... s\ tandard 10m pvc-f834886d... 8Gi RWO Delete Bound production/datadir-... s\ tandard 10m
We don’t need to waste our time by checking whether there are two replicas of the database. We can conclude that from persistent volumes since each of the two is attached to a different replica. Let’s move into development and pull request environments and see whether we can deploy the application there, but with different characteristics.
Deploying Applications To Development And Preview Environments We saw how we can deploy an application and its dependencies to production. We saw that using Helm to deploy something to production is easy, given that we set all the default values to be those we want to use in production. We’ll move into a slightly more complicated scenario next.
Packaging, Deploying, And Managing Applications
134
We need to figure out how to deploy the application in temporary development and preview (pull request) environments. The challenge is that the requirements of our application in those environments are not the same. The host name should be dynamic. Given that every developer might have a different environment, and that there might be any number of open pull requests, the host of the application needs to be auto-generated, and always unique. We also need to disable HPA, to have a single replica of both the application and the DB dependency, and we don’t want persistent storage for something that is temporary and can exist for anything from minutes to days or weeks. While the need to have different hosts is aimed at allowing multiple releases of the application to run in parallel, the rest of the requirements are mostly focused on cost reduction. There is probably no need to run a production-size application in development environments. Otherwise, we might go bankrupt, or we might need to reduce the number of such environments. We probably do not want either of the outcomes. Going bankrupt is terrible for obvious reasons, while not having the freedom to get an environment whenever we need it might severely impact our (human) performance. Let’s get going. Let’s create a personal development environment just for you. We’ll start by defining a variable with your GitHub username. Given that each is unique, that will allow us to create a Namespace without worrying whether it will clash with someone else’s. We’ll also use it to generate a unique host. Please replace [...] with your GitHub username or with any other unique identifier.
1
export GH_USER=[...]
2 3
kubectl create namespace $GH_USER
We created a new Kubernetes Namespace based on your GitHub username. Next, we’ll take another look at the values.yaml file. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows.
Packaging, Deploying, And Managing Applications 1 2 3 4 5 6 7
135
image: repository: vfarcic/go-demo-9 tag: 0.0.1 pullPolicy: IfNotPresent ... ingress: host: go-demo-9.acme.com
8 9 10 11 12 13 14
hpa: enabled: true ... go-demo-9-db: replicaSet: enabled: true
We’ll need to change quite a few variables. To begin with, we probably do not want to use a specific tag of the image, but rather the latest one that we’ll be building whenever we want to see the results of our changes to the code. To avoid collisions with others, we might also want to change the repository as well. But, we’ll skip that part since it would introduce unnecessary complexity to our examples. While we’re at the subject of images, we should consider changing pullPolicy to Always. Otherwise, we’d need to build a different tag every time we create a new image. We should also define a unique host and disable hpa. Finally, we should disable database replicaSet as well as persistence. All those changes are likely going to be useful to all those working on this application. So, we have a separate values.yaml file that can be used by anyone in need of a personal development environment. 1
cat dev/values.yaml
The output is as follows.
Packaging, Deploying, And Managing Applications 1 2 3 4 5 6 7 8 9 10
136
image: tag: latest pullPolicy: Always hpa: enabled: false go-demo-9-db: replicaSet: enabled: false persistence: enabled: false
That file contains all the values that we want to use to overwrite those defined as defaults in go-demo-9/values.yaml. The only variable missing is ingress.host. We could not pre-define it since it will differ from one person to another. Instead, we will assign it to an environment variable that we will use to set the value at runtime. 1
export ADDR=$GH_USER.go-demo-9.acme.com
Now we are ready to create the resources in the newly created personal Namespace. 1 2 3 4 5 6 7
helm --namespace $GH_USER \ upgrade --install \ --values dev/values.yaml \ --set ingress.host=$ADDR \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
We used the --namespace argument to ensure that the resources are created inside the correct place. Moreover, we added two new arguments to the mix. The --values argument specified the path to dev/values.yaml that contains the environment-specific variables that should be overwritten. Further on, we passed the value of ingress.host through --set. The output is almost the same as when we installed the application in production, so we can safely skip commenting it. What matters is that the application is now (probably) running and that we can focus on confirming that the requirements for the development environment were indeed fulfilled. As a refresher, we’re trying to accomplish the following objectives. • • • • •
Ingress host should be unique. HPA should be disabled. The application should have only one replica. The database should have only one replica. The database should NOT have persistent volumes attached.
Let’s see whether we accomplished those objectives.
Packaging, Deploying, And Managing Applications 1 2
137
kubectl --namespace $GH_USER \ get ingresses
The output, in my case, is as follows. 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 vfarcic.go-demo-9.acme.com 192.168.64.56 80 85s
We can confirm that the host is indeed unique (vfarcic.go-demo-9.acme.com), so let’s check whether the application is reachable through it. 1 2
curl -H "Host: $ADDR" \ "http://$INGRESS_HOST"
We received the response, thus confirming that the application is indeed accessible. Let’s move on. 1 2
kubectl --namespace $GH_USER \ get hpa
The output claims that no resources were found, so we can confirm that HPA was not created. How about the number of replicas? 1 2
kubectl --namespace $GH_USER \ get pods
The output is as follows. 1 2 3
NAME READY STATUS RESTARTS AGE go-demo-9-go-demo-9-... 1/1 Running 0 7m47s go-demo-9-go-demo-9-db-... 1/1 Running 0 7m47s
That also seems to be in line with our objectives. We’re running one replica for the API and one for the database. Finally, the only thing left to validate is whether the database in the personal development environment did not attach any persistent volumes. 1
kubectl get persistentvolumes
The output is as follows.
Packaging, Deploying, And Managing Applications 1 2 3 4 5 6
138
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM \ STORAGECLASS REASON AGE pvc-86dc84f4-... 8Gi RWO Delete Bound production/datadir-... \ standard 25m pvc-f834886d-... 8Gi RWO Delete Bound production/datadir-... \ standard 25m
PersistentVolumes are not Namespaced, so we got all those in the cluster. What matters is that there are only two and that both are from the claims from the production Namespace. None were created from the database in the personal development environment. That’s it. We saw how to deploy the application in a development or a preview (pull request) environment. The only thing left to do is to delete it. Those environments, or, at least, the resources in those environments, should be temporary. If, for example, we create resources for the purpose of developing something, it stands to reason that we should delete them when we’re finished. Since it’s so easy and fast to deploy anything we need, there’s no need to keep things running longer than needed and to unnecessarily increase costs. 1 2
helm --namespace $GH_USER \ delete go-demo-9
We deleted the application. We can just as well keep the empty Namespace, or delete it as well. We will not need it anymore, so we will do the latter. 1
kubectl delete namespace $GH_USER
That’s it. It’s gone as if it never existed. There was no need to delete the application. If we’re planning to delete the whole Namespace, it is not necessary to start with helm delete. Removal of a Namespace means the removal of everything in it.
Let’s move forward and see what we’d have to do to run the application in a permanent nonproduction environment.
Deploying Applications To Permanent Non-Production Environments We already learned everything we need to know to deploy an application to a permanent nonproduction environment like, for example, staging. It’s similar to deploying it to a development
Packaging, Deploying, And Managing Applications
139
environment. As a matter of fact, it should be even easier since it is permanent, so there should be no dynamic values involved. All we have to do is define yet another set of values. With that in mind, we could just as well skip this part. But we won’t, mostly so that we can close the circle and go through all the commonly used permutations. Think of this section as a refresher of what we learned. I promise to go through it fast without wasting too much of your time. We’ll start by creating the Namespace 1
kubectl create namespace staging
Next, we’ll take a quick look at the values.yaml specific to the staging environment. 1
cat staging/values.yaml
The output is as follows. 1 2 3 4 5 6
image: tag: 0.0.2 ingress: host: staging.go-demo-9.acme.com hpa: minReplicas: 2
The values should be obvious and easy to understand. The only one that might be confusing is that we are setting minReplicas to 2, but we are not changing the maxReplicas (the default is 6). We did say that the staging environment should have only two replicas, but there should be no harm done if more are created under certain circumstances. We are unlikely going to generate sufficient traffic in the staging environment for the HPA to scale up the application. If we do, for example, run some form of load testing, jumping above 2 replicas would be welcome, and it would be temporary anyway. The number of replicas should return to 2 soon after the load returns to normal. Let’s apply the go-demo-9 chart with those staging-specific values. 1 2 3 4 5 6
helm --namespace staging \ upgrade --install \ --values staging/values.yaml \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
All that’s left is to validate that the objectives for the staging environment are met. But, before we do that, let’s have a quick refresher.
Packaging, Deploying, And Managing Applications
• • • • •
140
Ingress host should be staging.go-demo-9.acme.com. HPA should be enabled. The number of replicas of the application should be two (unless more is required) The database should have two replicas. The database should have persistent volumes attached to all the replicas.
Let’s see whether we fulfilled those objectives. 1 2
kubectl --namespace staging \ get ingresses
The output is as follows. 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 staging.go-demo-9.acme.com 192.168.64.56 80 2m15s
We can see that the host is indeed staging.go-demo-9.acme.com, so let’s check whether the application is reachable through it. 1 2
curl -H "Host: staging.go-demo-9.acme.com" \ "http://$INGRESS_HOST"
The output should be a “normal” response, so we can move on and validate whether the HPA was created and whether the minimum number of replicas is indeed two. 1 2
kubectl --namespace staging \ get hpa
The output is as follows. 1 2 3 4
NAME REFERENCE TARGETS MINP\ ODS MAXPODS REPLICAS AGE go-demo-9-go-demo-9 Deployment/go-demo-9-go-demo-9 /80%, /80% 2 \ 6 2 2m59s
We can see that the HPA was created and that the minimum number of Pods is 2. The only thing missing is to confirm whether the persistent volumes were created as well. 1
kubectl get persistentvolumes
The output is as follows.
141
Packaging, Deploying, And Managing Applications 1 2 3 4 5 6 7 8 9 10
NAME CAPACITY STORAGECLASS REASON AGE pvc-11c49634-... 8Gi standard 3m45s pvc-86dc84f4-... 8Gi standard 37m pvc-bf1c2970-... 8Gi standard 3m45s pvc-f834886d-... 8Gi standard 37m
ACCESS MODES RECLAIM POLICY STATUS CLAIM
\
RWO
Delete
Bound
staging/datadir-...
\
RWO
Delete
Bound
production/datadir-... \
RWO
Delete
Bound
staging/datadir-...
RWO
Delete
Bound
production/datadir-... \
\
We can observe that two new persistent volumes were added through the claims from the staging Namespace. That’s it. We saw how we can deploy the application in different environments, each with different requirements. But that’s not all we should do.
Packaging And Deploying Releases If we are going to follow the GitOps principles, each release should result in a change of, at least, the tag of the image and the version of the chart. Also, we might want to package the chart in a way that it can be distributed to all those who need it. We’ll explore that through a simulation of the development of a new feature that could result in a new release. Let’s start by creating a new branch, just as you would normally do when working on something new. 1
git checkout -b my-new-feature
Now, imagine that we spent some time writing code and tests and that we validated that the new feature works as expected in a personal development environment and/or in a preview environment created through a pull request. Similarly, please assume that we decided to make a release of that feature. The next thing we would probably want to do is change the version of the chart and the application. As you already saw, that information is stored in Chart.yaml, so let’s output is as a refresher. 1
cat go-demo-9/Chart.yaml
The output is as follows.
Packaging, Deploying, And Managing Applications 1 2 3 4 5
142
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.1 appVersion: 0.0.1
All we’d have to do is change the version and appVersion values. Normally, we’d probably do that by opening the file in an editor and changing it there. But, since I’m a freak for automation, and since I prefer doing as much as possible from a terminal, we’ll accomplish the same with a few sed commands. 1 2 3 4
cat go-demo-9/Chart.yaml \ | sed -e "s@version: 0.0.1@version: 0.0.2@g" \ | sed -e "s@appVersion: 0.0.1@appVersion: 0.0.2@g" \ | tee go-demo-9/Chart.yaml
We retrieved the content of go-demo-9/Chart.yaml and piped the output to two sed commands that replaced 0.0.1 with 0.0.2 for both the version and the appVersion fields. The final output was sent to tee that stored it in the same Chart.yaml file. The output is as follows. 1 2 3 4 5
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.2 appVersion: 0.0.2
Now that we replaced the versions, we should turn our attention to the tag that we want to use. As you already know, it is one of the variables in values.yaml, so let’s output it. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
image: repository: vfarcic/go-demo-9 tag: 0.0.1 ...
We’ll replace the value of the image.tag variable with a yet another sed command.
Packaging, Deploying, And Managing Applications 1 2 3
143
cat go-demo-9/values.yaml \ | sed -e "s@tag: 0.0.1@tag: 0.0.2@g" \ | tee go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
image: repository: vfarcic/go-demo-9 tag: 0.0.2 ...
That’s it. Those are all the modifications we had to do, even though they were not mandatory. We could have accomplished the same by specifying those values as --set arguments at runtime. But that would result in undocumented and hard to track changes. This way, we could (and we should) push those changes to a Git repository. Normally, we would need to build a new image, and we would probably execute quite a few other steps like testing, creating release notes, etc. But that’s not the subject of this chapter, so we’re skipping them. I already built the image vfarcic/go-demo-9:0.0.2, so it’s ready for us to use it.
Before we commit to making a new release based on, among other things, that chart, we should validate whether the syntax we’re using is correct and that there are no obvious issues with it. We’ll do that by “linting” the chart. 1
helm lint go-demo-9
The output is as follows. 1 2
==> Linting go-demo-9 [INFO] Chart.yaml: icon is recommended
3 4
1 chart(s) linted, 0 chart(s) failed
If we ignore the fact that the icon of the chart does not exist, we can conclude that the chart was linted successfully. Now that the chart seems to be defined correctly, we can package it. 1
helm package go-demo-9
Packaging, Deploying, And Managing Applications
144
We should have added --sign to the helm package command. That would provide a way to confirm its authenticity. However, we’d need to create the private key, and I do not want to lead us astray from the main subject. I’ll leave that up to you as homework for later.
We can see, from the output, that the chart was packaged, and that it was saved as go-demo-9-0.0.2.tgz. From now on, we can use that file to apply the chart, instead of pointing to a directory. 1 2 3 4 5
helm --namespace production \ upgrade --install \ go-demo-9 go-demo-9-0.0.2.tgz \ --wait \ --timeout 10m
We upgraded the go-demo-9 in production with the new release 0.0.2. Right now, you might be wondering what’s the advantage of packaging a chart. It’s just as easy to reference a directory as to use a tgz file. And that would be true if we would always be looking for charts locally. With a package, we can store it remotely. That can be a network drive, or it can be an artifact repository, just like the one we used to add MongoDB as a dependency. Those repositories can be purposely built to store Helm charts, like, for example, ChartMuseum⁶⁴. We could also use a more generic artifacts repository like Artifactory⁶⁵, and many others. I’ll leave you to explore repositories for Helm charts later. Think of it as yet another homework. Let’s move on and check whether the application was indeed upgraded to the new release. 1
helm --namespace production list
The output is as follows. 1 2
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION go-demo-9 production 2 2020-04-21... deployed go-demo-9-0.0.2 0.0.2
We can see that go-demo-9 is now running the second revision and that the app version is 0.0.2. If you’re still uncertain whether the application was indeed upgraded, we can retrieve all the information of the chart running in the cluster, and confirm that it looks okay. 1 2
helm --namespace production \ get all go-demo-9 ⁶⁴https://chartmuseum.com/ ⁶⁵https://jfrog.com/artifactory/
Packaging, Deploying, And Managing Applications
145
We probably got more information than we need. We could have retrieved only hooks, manifests, notes, or values if we were looking for something more specific. In this case, we retrieved all because I wanted to show you that we can get all the available information.
If, for example, we would like to confirm that the image of the Deployment is indeed correct, we can observe that from the fragment of the output, that is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13
... apiVersion: apps/v1 kind: Deployment ... spec: ... template: ... spec: containers: - name: go-demo-9 image: vfarcic/go-demo-9:0.0.2 ...
If you’re curious, feel free to explore that output in its entirety. You should be able to navigate through it since most of it are either Kubernetes definitions that you’re already familiar with or Helm-specific things like variables that we explored earlier. Finally, before we move on, we’ll confirm that the new release is indeed reachable by sending a simple curl request. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
This time, the output states that the version is 0.0.2.
Rolling Back Releases Now that we figured out how to move forward by installing or upgrading releases, let’s see whether we can change the direction. Can we roll back in case there is a problem that we did not detect before applying a new release? We’ll start by taking a quick look at the go-demo-9 releases we deployed so far in the production Namespace.
146
Packaging, Deploying, And Managing Applications 1 2
helm --namespace production \ history go-demo-9
The output is as follows. 1 2 3
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Tue Apr 21... superseded go-demo-9-0.0.1 0.0.1 Install complete 2 Tue Apr 21... deployed go-demo-9-0.0.2 0.0.2 Upgrade complete
We can see that we made two releases of the application. The revision 1 was the initial install of the version 0.0.1, while the revision 2 was the upgrade to the version 0.0.2. Now, let’s imagine that there is something terribly wrong with the second revision, and that, for whatever reason, we cannot roll forward with a fix. In such a case, we’d probably choose to roll back. We can roll back to a specific revision, or we can go to the previous release, whichever it is. We’ll choose the latter, with a note that rolling back to a specific revision would require that we add the revision number to the command we are about to execute. 1 2
helm --namespace production \ rollback go-demo-9
The response clearly stated that the rollback was a success, so let’s take another look at history. 1 2
helm --namespace production \ history go-demo-9
The output is as follows. 1 2 3 4
REVISION 1 2 3
UPDATED STATUS CHART Tue Apr 21... superseded go-demo-9-0.0.1 Tue Apr 21... superseded go-demo-9-0.0.2 Tue Apr 21... deployed go-demo-9-0.0.1
APP VERSION 0.0.1 0.0.2 0.0.1
DESCRIPTION Install complete Upgrade complete Rollback to 1
We can see that it did not roll back literally. Instead, it rolled forward (upgraded the app), but to the older release. The alternative way to see the current revision is through the status command.
Packaging, Deploying, And Managing Applications 1 2
147
helm --namespace production \ status go-demo-9
We can see from the output that the current revision is 3. Finally, to be on the safe side, we’ll send a request to the application and confirm that we’re getting the response from the older release 0.0.1. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
The output is Version: 0.0.1, so we can confirm that the rollback was indeed successful.
What Did We Do Wrong? There are quite a few things that we could have done better, but we didn’t. That would have forced us to go astray from the main subject. Instead, we will go through a few notes that might prove to be useful in your usage of Helm. We upgraded the release directly in production, without going through staging, or whichever other environments we might have. Think of that as a shortcut meant to avoid distracting us from the subject of this chapter, and not as something you should do. When working on “real” projects, you should follow the process, whatever it is, instead of taking shortcuts. If the process is not right, change it, instead of skipping the steps. We should not apply changes to clusters directly. Instead, we should be pushing them to Git and applying only those things that were reviewed and merged to the master branch. While at the subject of Git, we should keep Helm charts close to applications, preferably in the same repository where the code and other application-specific files are located. Or, to be more precise, we should be doing that for the “main” chart, while environment-specific values and dependencies should probably be in the repositories that define those environments, no matter whether they are Namespaces or full-blown clusters. Finally, we should not run helm commands at all, except, maybe, while developing. They should all be automated through whichever continuous delivery (CD) tool you’re using. The commands are simple and straightforward, so you shouldn’t have a problem extending your pipelines.
Destroying The Resources We will not need the changes we did in the my-new-feature branch, so let’s stash it, checkout the master, and remove that branch.
Packaging, Deploying, And Managing Applications 1
148
git stash
2 3
git checkout master
4 5
git branch -d my-new-feature
The next steps will depend on whether you’re planning to destroy the cluster or to keep it running. If you choose the latter, please execute the commands that follow to delete the Namespaces we created and everything that’s in them. If you’re using Docker Desktop and you’re not planning to reset Kubernetes cluster, better execute the commands that follow no matter whether you’ll keep the cluster running or shut it down.
1
kubectl delete namespace staging
2 3
kubectl delete namespace production
Let’s go back to the root of the local repository. 1
cd ../
If you chose to destroy the cluster, feel free to use the commands at the bottom of the Gist you used to create it. Finally, let’s get out of the local repository, and get back to where we started. 1
cd ../
Using Helm As A Package Manager For Kubernetes Every operating system has at least one packaging mechanism. Some Linux distributions use RPM or APT. Windows uses Chocolatey, and macOS uses Brew. Those are all package mechanisms for operating systems. But why does that matter in the context of Kubernetes? You might say that Kubernetes is a scheduler, while Windows, Linux, and macOS are operating systems. You would be right in thinking so, but that would be only partly true. We can think of Kubernetes as an operating system for clusters, while Linux, Windows, and macOS are operating systems of individual machines. The scheduler is only one of the many features of Kubernetes and saying that it is an operating system for a cluster would be a more precise way to define it. It serves a similar purpose as, let’s say, Linux, except that it operates over a group of servers. As such, it needs a packaging mechanism designed to leverage its modus operandi. While there are quite a few tools we can use to package, install, and manage applications in Kubernetes, none is as widely used as Helm. It is the de facto standard in the Kubernetes ecosystem. Helm uses “charts” to help us define, install, upgrade, and manage apps and all the surrounding resources. It simplifies versioning, publishing, and sharing of applications. It was donated to the Cloud Native Computing Foundation (CNCF)⁶⁶, thus landing in the same place as most other projects that matter in the Kubernetes ecosystem. Helm is mighty, yet simple to use. If we’d need to explain Helm in a single sentence, we could say that it is a templating and packaging mechanism for Kubernetes resources. But it is much more than that, even though those are the primary objectives of the project. It organizes packages into charts. We can create them from scratch with helm create command, or we can convert existing Kubernetes definitions into Helm templates. Helm uses a naming convention, which, as long as we spend a few minutes learning it, simplifies not only creation and maintenance of packages, but also navigation through those created by others. In its simplest form, the bulk of the work in Helm is about defining variables in values.yaml, and injecting them into Kubernetes YAML definitions by using entries like {{ .Values.ingress.host }}, instead of hard-coded values. But Helm is not only about templating Kubernetes definitions but also about adding dependencies as additional charts. Those can be charts we developed and packaged, or charts maintained by others, usually for third-party software. The latter case is compelling since it allows us to leverage collective effort and knowledge of a much wider community. One of the main advantages of Helm is that it empowers us to have variations of our applications, without the need to change their definitions. It accomplishes that by allowing us to define different ⁶⁶https://www.cncf.io/
Using Helm As A Package Manager For Kubernetes
150
properties of our application running in different environments. It does that through the ability to overwrite default values, which, in turn, modify the definitions propagated to Kubernetes. As such, it greatly simplifies the process of deploying variations of our applications to different environments, which can be separate Namespaces or even other clusters. Once a chart is packaged, it can be stored in a registry and easily retrieved and used by people and teams that have sufficient permissions to access it. On top of those main features, it offers many other goodies, like, for example, the mechanism to roll back releases. There are many others, and it can take quite some time to learn them all. Fortunately, it takes almost no effort to get up-to-speed with those that are most commonly used. It is a helpful and easy to learn tool that solves quite a few problems that were not meant to be solved with Kubernetes alone. Now, I could continue rambling about Helm for quite some time, but that would be too boring. Instead, we will define a few requirements and see whether we can fulfill them with Helm.
Defining A Scenario I like to start with objectives before diving into any specific tools. That helps me evaluate whether it fits my needs, or if I should look elsewhere. Yours are likely going to be different than mine. Nevertheless, there is a common theme for (almost) all of us, and the differences tend to be smaller than we think. They are often details, rather than substantial differentiation. Most of us are deploying applications to multiple environments. We might have some personal development environment. We might have others for previews generated when creating pull requests, and a few permanent environments like, for example, staging and production. The number might differ from one company to another, and we might call them differently. Nevertheless, we all tend to have more than one environment, with some being dynamic (e.g., personal environments) and others being permanent (e.g., production). What matters is that applications in different environments tend to have different needs. On the one hand, it would be great if a release in each of the environments is exactly the same. That would give us an additional peace of mind knowing that what we’re testing is exactly the same as what will run in production. That’s the promise of containers. A container based on an image should be the same no matter where it runs. On the other hand, we also want to be pragmatic and cost-effective. Those two needs are often at odds. We cannot use the same address when we want to access an application in, let’s say, staging and production. One could be staging.acme.com, while the other might be only acme.com. Even though that’s an insignificant difference, it already illustrates that what we run in one environment is not exactly the same as what we’re running in another. Similarly, it would be too expensive to run the same number of replicas in a personal environment as in production. If the latter has a hundred replicas, it would be too expensive to use the same amount while developing.
Using Helm As A Package Manager For Kubernetes
151
There are quite a few other differences. We tend to keep applications in different environments as similar as possible while being realistic in what makes sense and which trade-offs are worth making. So, let’s try to define a scenario of what we might need for an application. That might not fully fit your needs, but it should be close enough for you to get an understanding that will serve as a base that you should be able to extend on your own. To begin with, we will need three different types of environments. We’ll create one for personal development, and we’ll call it dev. Every member of our team would have its own, and they would all be temporary. That would allow us to be productive while being cost-conscious at the same time. We’ll create such environments when we need them, and destroy them when we don’t. Normally, we will also need environments where we will deploy applications as a result of making pull requests. They would also need to be temporary. Just like dev environments, they would be generated when pull requests are created and destroyed when PRs are merged or closed. Conceptually, those would be almost the same as dev environments, so we’ll skip them in our examples. Further on, we will need at least one permanent environment that would serve as production. While that could be enough, more often than not, we need at least another similar environment where we will deploy applications for, let’s say, integration tests. The goal is often to promote an application from one environment to another until it reaches production. There could be more than one permanent environment, but we should be able to illustrate the differences with two. So, we will have staging and production environments. Now that we established that we’ll have three environments (dev, staging, and production), we should explore what should be the differences in how we run an application in each of those. But, before we do that, let’s go quickly through the application itself. The demo application we will use in our examples will be a relatively simple one. It’s an API that needs to be accessible from outside the cluster. It is scalable, even though we might not always run multiple replicas. It uses a database, which is scalable. Since the database is stateful, it needs storage. The app is called go-demo-9. The reason for such a name lies in my inability to be creative. So, I call all my demo applications go-demo with an index as a suffix. The one I used in the previous book was called go-demo-8, so the one we’ll use in this one is go-demo-9. My creativity does not go beyond increasing a number as a prefix to an already dull name. Now, let’s see how that application should behave in each of the environments. When running in a personal development or a preview environment, the domain of the application needs to be dynamic. If, for example, my GitHub username is vfarcic, the domain through which I might want to access the application could be vfarcic.go-demo-9.acme.com. That way, it would be unique, and it would not clash with the same application running in John Doe’s environment since he’d use a domain jdoe.go-demo-9.acme.com. Further on, there should be no need to have more than one replica of the application or the database. I wouldn’t expect to have the production setup for the development. Given that one replica is enough, there’s probably no need to have HorizontalPodAutoscaler either.
152
Using Helm As A Package Manager For Kubernetes
Finally, there’s no need to have persistent storage for the database in the development environment. That database would be shut down when I finish working on the application, and it should be OK if I always start with a fresh default dataset. Let’s move into permanent environments. The domains could be fixed to staging.go-demo-9.acme.com when running in staging, and go-demo-9.acme.com when in production. Those are permanent environments, so the domains can be permanent as well. The application running in staging should be almost the same as when running in production. That would give us higher confidence that what we test is what will run in production. There’s no need to be on the same scale, but it should have all the elements that constitute production. As such, the API should have HorizontalPodAutoscaler (HPA) enabled, and the database should have persistent storage in both environments. On the other hand, if the API in production could oscillate between three and six replicas, two should be enough as the minimum in staging. That way, we will be able to confirm that HPA works in both and that running multiple replicas works as expected, while, at the same time, we will not spend more money than needed for staging. The database in production will run as two replicas, and that means that we should have an equal number in staging as well. Otherwise, we’d risk not being able to validate whether database replication works, before we promote changes to production. The summarized requirements can be seen in the table that follows. Feature Ingress host HPA App replicas DB replicas DB persistance
Dev/PR [GH_USER].godemo-9.acme.com false 1 1 false
Staging staging.go-demo9.acme.com true 2 2 true
Production go-demo9.acme.com true 3-6 2 true
Those are all the requirements. You might have others for your application, but those should be enough to demonstrate how Helm works, and how to make applications behave in different environments. Now, let us see whether we can package, deploy, and manage our application in a way that fulfills those requirements. But, first, we need to make sure that we have the prerequisites required for the exercises that follow.
Preparing For The Exercises All the commands from this chapter are available in the 02-helm.sh⁶⁷ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste. ⁶⁷https://gist.github.com/c9e05ce1b744c0aad5d10ee5158099fa
Using Helm As A Package Manager For Kubernetes
153
The code and the configurations that will be used in this chapter are available in the GitHub repository vfarcic/devops-catalog-code⁶⁸. Let’s clone it. Feel free to skip the command that follows if you already cloned that repository.
1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
Next, we’ll go into the local copy of the repository, and, to be on the safe side, we’ll pull the latest revision just in case you already had the repository from before, and I changed something in the meantime. 1
cd devops-catalog-code
2 3
git pull
We’ll need a Kubernetes cluster, with NGINX Ingress controller, and the environment variable INGRESS_HOST with the address through which we can access applications that we’ll deploy inside the cluster. If you meet those requirements, you should be able to use any Kubernetes cluster. However, bear in mind that I tested everything in Docker Desktop, Minikube, Google Kubernetes Engine (GKE), Amazon Kubernetes Service (EKS), and Azure Kubernetes Service (AKS). That does not mean that you cannot use a different Kubernetes flavor. You most likely can, but I cannot guarantee that without testing it myself. For your convenience, I created scripts that will create a Kubernetes cluster in the flavors I mentioned. All you have to do is follow the instructions from one of the Gists that follows. The Gists for GKE, EKS, and AKS, assume that you followed the exercises for using Terraform. If you didn’t, you might want to go through the Infrastructure as Code (IaC) chapter first. Or, if you are confident in your Terraform skills, you might skip that chapter, but, in that case, you might need to make a few modifications to the Gist you choose.
• • • •
Docker Desktop: docker.sh⁶⁹ Minikube: minikube.sh⁷⁰ GKE: gke.sh⁷¹ EKS: eks.sh⁷²
⁶⁸https://github.com/vfarcic/devops-catalog-code ⁶⁹https://gist.github.com/9f2ee2be882e0f39474d1d6fb1b63b83 ⁷⁰https://gist.github.com/2a6e5ad588509f43baa94cbdf40d0d16 ⁷¹https://gist.github.com/68e8f17ebb61ef3be671e2ee29bfea70 ⁷²https://gist.github.com/200419b88a75f7a51bfa6ee78f0da592
Using Helm As A Package Manager For Kubernetes
154
• AKS: aks.sh⁷³ You will also need Helm CLI. If you do not have it already, please visit the Installing Helm⁷⁴ page and follow the instructions for your operating system. The only thing missing is to go to the helm directory, which contains all definitions we’ll use in this chapter. 1
cd helm
Creating Helm Charts Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of Kubernetes resources. Helm relies heavily on naming conventions, so charts are created as files laid out in a particular directory tree, and with some of the files using pre-defined names. Charts can be packaged into versioned archives to be deployed. But that’s not our current focus. We’ll explore packaging later. For now, our goal is to create a chart. We can create a basic one through the CLI. 1
helm create my-app
Helm created a directory with the same name as the one we specified. Let’s see what we got. 1
ls -1 my-app
The output is as follows. 1 2 3 4
Chart.yaml charts templates values.yaml
You probably expect me to explain what each of those files and directories means. I will do that, but not through the chart we created. It is too simple for our use case, and we’d need to change quite a few things. Instead of doing that, we’ll explore a chart I prepared. For now, just remember that you can easily create new charts. We’ll explore through my example the most important files and directories, and what they’re used for. Given that we will not use the my-app chart we created, we’ll delete the whole directory. It served its purpose of demonstrating how to create new charts, and not much more. ⁷³https://gist.github.com/0e28b2a9f10b2f643502f80391ca6ce8 ⁷⁴https://helm.sh/docs/intro/install/
Using Helm As A Package Manager For Kubernetes 1
155
rm -rf my-app
The chart we’ll use is in the go-demo-9 subdirectory. Let’s see what’s inside. 1
ls -1 go-demo-9
The output is as follows. 1 2 3 4 5
Chart.yaml charts requirements.yaml templates values.yaml
You’ll notice that the files and the directories are the same as those we got when we created a new chart. As I already mentioned, Helm uses naming convention and expects things to be named the way it likes. The only difference between my chart and the one we created earlier is that now we have an additional file requirements.yaml. We’ll get to it later. In most cases, a repo like the one we’re using is not a good place for Helm charts. More often than not, they should be stored in the same repository as the application it defines. We have it in the same repository as all the other examples we’re using in this book, mostly for simplicity reasons.
Let’s take a look at the first file. 1
cat go-demo-9/Chart.yaml
The output is as follows. 1 2 3 4 5
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.1 appVersion: 0.0.1 Chart.yaml contains meta-information about the chart. It is mostly for Helm’s internal use, and it
does not define any Kubernetes resources. The apiVersion is set to v1. The alternative would be to set it to v2, which would indicate that the chart is compatible only with Helm 3. Everything we’ll use is compatible with earlier Helm versions,
Using Helm As A Package Manager For Kubernetes
156
so we’re keeping v1 as a clear indication that the chart can be used with any Helm, at least at the time of this writing (May 2020). The description and the name should be self-explanatory, so we’ll skip those. The version field is mandatory, and it defines the version of the chart. The appVersion, on the other hand, is optional, and it contains the version of the application that this chart defines. What matters is that both must use semantic versioning 2⁷⁵. There are a few other fields that we could have defined, but we didn’t. I’m assuming that you will read the full documentation later on. This is a quick dive into Helm, and it is not meant to show you everything you can do with it. Let’s see what’s inside the templates directory. 1
ls -1 go-demo-9/templates
The output is as follows. 1 2 3 4 5 6
NOTES.txt _helpers.tpl deployment.yaml hpa.yaml ingress.yaml service.yaml
That is the directory where the action is defined. There’s NOTES.txt, which contains templated usage information that will be output when we deploy the chart. The _helpers.tpl field defines “template” partials or, to put it in other words, functions that can be used in templates. We’ll skip explaining both by giving you the homework to explore them later on your own. The rest of the files define templates that will be converted into Kubernetes resource definitions. Unlike most other Helm files, those can be named any way we like. If you’re already familiar with Kubernetes, you should be able to guess what’s in those files from their names. There is a Kubernetes Deployment (deployment.yaml), HorizontalPodAutoscaler (hpa.yaml), Ingress (ingress.yaml), and Service (service.yaml). Let’s take a look at the deployment.yaml file. 1
cat go-demo-9/templates/deployment.yaml
The output is as follows.
⁷⁵https://semver.org/
Using Helm As A Package Manager For Kubernetes 1
---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
apiVersion: apps/v1 kind: Deployment metadata: name: {{ template "fullname" . }} labels: app: {{ template "fullname" . }} spec: selector: matchLabels: app: {{ template "fullname" . }} template: metadata: labels: app: {{ template "fullname" . }} {{- if .Values.podAnnotations }} annotations: {{ toYaml .Values.podAnnotations | indent 8 }} {{- end }} spec: containers: - name: {{ .Chart.Name }} image: {{ .Values.image.repository }}:{{ .Values.image.tag }} imagePullPolicy: {{ .Values.image.pullPolicy }} env: - name: DB value: {{ template "fullname" . }}-db - name: VERSION value: {{ .Values.image.tag }} ports: - containerPort: {{ .Values.service.internalPort }} livenessProbe: httpGet: path: {{ .Values.probePath }} port: {{ .Values.service.internalPort }} initialDelaySeconds: {{ .Values.livenessProbe.initialDelaySeconds }} periodSeconds: {{ .Values.livenessProbe.periodSeconds }} successThreshold: {{ .Values.livenessProbe.successThreshold }} timeoutSeconds: {{ .Values.livenessProbe.timeoutSeconds }} readinessProbe: httpGet: path: {{ .Values.probePath }}
157
Using Helm As A Package Manager For Kubernetes 44 45 46 47 48 49 50
158
port: {{ .Values.service.internalPort }} periodSeconds: {{ .Values.readinessProbe.periodSeconds }} successThreshold: {{ .Values.readinessProbe.successThreshold }} timeoutSeconds: {{ .Values.readinessProbe.timeoutSeconds }} resources: {{ toYaml .Values.resources | indent 12 }} terminationGracePeriodSeconds: {{ .Values.terminationGracePeriodSeconds }}
We will not discuss specifics of that Deployment, or anything else directly related to Kubernetes. I will assume that you have at least basic Kubernetes knowledge. If that’s not the case, you might want learn a bit about it first. One possible source of information could be The DevOps 2.3 Toolkit: Kubernetes⁷⁶.
If you ignore the entries surrounded with curly braces ({{ and }}), that would be a typical definition of a Kubernetes Deployment. The twist is that we replaced parts of it with variables, functions, and conditionals, with values and functions surrounded by curly braces. The templates (like the one in front of you) will be converted into Kubernetes manifest files that are YAML-formatted resource descriptions. In other words, those templates will be converted into typical Kubernetes YAML files. Templating itself is a combination of Go template language⁷⁷ and Sprig template library⁷⁸. It might take a while to learn them both, but the good news is that you might never have to go that deep. Most of the Helm definitions you will find, and most of those you will define will use a few simple syntaxes. Most of the values inside curly braces start with .Values. For example, we have an entry like path: {{ .Values.probePath }}. That means that the value of the path entry will be, by default, the value of the variable probePath defined in values.yaml. Pay attention that the previous sentence said; “by default”. We’ll see later what that really means. Let’s take a look at values.yaml, and try to locate probePath. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3
... probePath: / ... ⁷⁶https://www.devopstoolkitseries.com/posts/devops-23/ ⁷⁷https://godoc.org/text/template ⁷⁸https://masterminds.github.io/sprig/
Using Helm As A Package Manager For Kubernetes
159
If we do not overwrite that variable, the entry path: {{ .Values.probePath }} in templates/deployment.yaml will be converted into path: /. We could easily spend a few chapters only on Helm templating, but we won’t. This is supposed to be a quick dive into Helm. We have a few objectives we decided to accomplish, so let’s get back to them. We already said that the Ingress host should be different depending on the environment where the application will run. Let’s see how we can accomplish that through Helm. 1
cat go-demo-9/templates/ingress.yaml
The output is as follows. 1
---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: name: {{ template "fullname" . }} annotations: kubernetes.io/ingress.class: nginx spec: rules: - host: {{ .Values.ingress.host }} http: paths: - backend: serviceName: {{ template "fullname" . }} servicePort: {{ .Values.service.externalPort }}
As you can see, that is a typical Ingress definition. I’m still assuming that you have basic Kubernetes knowledge and that, therefore, you know what Ingress is. What makes that definition “special” is that a few values that would typically be hard-coded are changed to variables and templates (functions). The important part, in the context of our objectives, is that the first (and the only entry) of the spec.rules array has host set to {{ .Values.ingress.host }}. Let’s try to locate that in the values.yaml file. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows.
Using Helm As A Package Manager For Kubernetes 1 2 3 4
160
... ingress: host: go-demo-9.acme.com ...
We can see that there is the variable host nested inside the ingress entry and that it is set to go-demo-9.acme.com. That is the host we decided to use for production, and it represents the pattern I prefer to follow. More often than not, I define charts in a way that the default values (those defined in values.yaml) always represent an application in production. That allows me to have easy insight into the “final” state of the application while keeping the option to overwrite those values for other environments. We’ll see later how to do that. Another objective we have is to enable or disable HorizontalPodAutoscaler, depending on the environment of the application. So, let’s take a quick look at its definition. 1
cat go-demo-9/templates/hpa.yaml
The output is as follows. 1
---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
{{- if .Values.hpa.enabled }} apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: {{ template "fullname" . }} labels: app: {{ template "fullname" . }} spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: {{ template "fullname" . }} minReplicas: {{ .Values.hpa.minReplicas }} maxReplicas: {{ .Values.hpa.maxReplicas }} metrics: - type: Resource resource: name: cpu targetAverageUtilization: {{ .Values.hpa.cpuTargetAverageUtilization }} - type: Resource resource:
Using Helm As A Package Manager For Kubernetes 24 25 26
161
name: memory targetAverageUtilization: {{ .Values.hpa.memoryTargetAverageUtilization }} {{- end }}
Since we want to be able to control whether HPA should or shouldn’t be created depending on the environment, we surrounded the whole definition inside the block defined with {{- if .Values.hpa.enabled }} and {{- end }} entries. It is the Go template equivalent of a simple if statement. The definition inside that block will be used only if hpa.enabled value is set to true. We also said that, when HPA is indeed used, it should be configured to have a minimum of two replicas in staging, and that it should oscillate between three and six in production. We’re accomplishing that through minReplicas and maxReplicas entries set to hpa.minReplicas and hpa.maxReplicas values. Let’s confirm that those variables are indeed defined in values.yaml, and that the values are set to what we expect to have in production. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6
... hpa: enabled: true minReplicas: 3 maxReplicas: 6 ...
As we can see, if we don’t overwrite the default values, HPA will be enabled, the minimum number of replicas will be 3, and the maximum will be 6. That’s what we said we should have in production. There’s one more thing we need to explore before we deploy our application. We need to figure out how to add the database as a dependency.
Adding Application Dependencies As you already saw, Helm allows us to define templates. While that works great for our applications, it might not be the best idea for third-party apps. Our application requires a database. To be more precise, it needs MongoDB. Now, you might say, “it should be easy to define the resources for MongoDB”, but that could quickly turn into a false statement. Running MongoDB is not only about creating a Kubernetes StatefulSet and a Service. It’s much more than that. We might need to have a Deployment when it is running as a single replica, and a
Using Helm As A Package Manager For Kubernetes
162
StatefulSet when it is running in ReplicaSet mode. We might need to set up autoscaling. We might need an operator that will join replicas into a ReplicaSet. We might need different storage options, and we might need to be able to choose not to use storage at all. There are many other things that we might need to define or, even worse, to write a custom operator that would tie different resources and make sure that they are working as expected. But, the real question is not whether we could define everything we need to deploy and manage MongoDB. Rather, the question is whether that is a worthwhile investment. More often than not, it is not worth our time. Whenever possible, we should focus on what brings differentiating value. That, in most cases, means that we should focus mostly on developing, releasing, and running our own applications, and using services from other vendors and community knowledge for everything else. MongoDB is not an exception. Helm contains a massive library of charts maintained both by the community and vendors. All we have to do is find the one we need, and add it as a dependency. Let’s see yet another YAML file. 1
cat go-demo-9/requirements.yaml
The output is as follows. 1 2 3 4 5
dependencies: - name: mongodb alias: go-demo-9-db version: 7.13.0 repository: https://charts.bitnami.com/bitnami
The requirements.yaml file is optional. If we do have it, we can use it to specify one or more dependencies. In our case, there’s only one. The name of the dependency must match the name of the chart that we want to use as a dependency. However, that alone might result in conflicts, given that multiple applications running in the same Namespace might use the same chart as a dependency. To avoid such a potential issue, we specified the alias that provides the unique identifier that should be used when deploying that chart instead of the “official” name of the chart. Further on, we can see that we are using the specific version of that chart and that it is defined in the specific repository. That is indeed an easy way to add almost any third-party application as a dependency. Furthermore, we could use the same mechanism to add internal applications as dependencies. But, the real question is how I knew which values to add? How did I know that the name of the chart is mongodb? How did I figure out that the version is indeed 7.13.0, and that the chart is in that repository? Let’s go a few steps back and explore the process that made me end up with that specific config.
163
Using Helm As A Package Manager For Kubernetes
The first step in finding a chart is often to search for it. That can be done through a simple Google search like “MongoDB Helm chart”, or through the helm search command. I tend to start with the latter and resort to Google only if what I need is not that easy to find. 1 2
helm repo add stable \ https://kubernetes-charts.storage.googleapis.com
3 4
helm search repo mongodb
We added the stable repository. We’ll explore repositories in more detail soon. For now, the only thing that matters is that it is the location where most of the charts are located. Further on, we searched for mongodb in all the repositories we currently have. The output, in my case, is as follows (yours might differ). 1
NAME
CHART VERSION APP VERSION DESCRIPTION
\
stable/mongodb nt-oriented database tha... stable/mongodb-replicaset database that stores JS... stable/prometheus-mongodb-exporter or MongoDB metrics stable/unifi i Controller
7.8.10
4.2.4
DEPRECATED NoSQL docume\
3.15.0
3.6
NoSQL document-oriented\
2.4.0
v0.10.0
A Prometheus exporter f\
0.7.0
5.12.35
Ubiquiti Network's Unif\
2 3 4 5 6 7 8 9 10
There are quite a few things we can observe from that output. To begin with, all the charts, at least those related to mongodb, are coming from the stable repo. That’s the repository that is often the best starting point when searching for a Helm chart. We can see that there are at least four charts that contain the word mongodb. But, judging from the names, the stable/mongodb chart sounds like something we might need. Further on, we have the latest version of the chart (CHART VERSION) and the version of the application it uses (APP VERSION). Finally, we have a description, and the one for the mongodb chart is kind of depressing. It starts with DEPRECATED, giving us a clear indication that it is no longer maintained. Let’s take a quick look at the chart’s README and see whether we can get a clue as to why it was deprecated. 1
helm show readme stable/mongodb
164
Using Helm As A Package Manager For Kubernetes
If we scroll up to the top, we’ll see that there is a whole sub-section with the header This Helm chart is deprecated. In a nutshell, it tells us that the maintenance of the chart is moved to Bitnami’s repository, followed with short instructions on how to add and use their repository. While that might be seen as an additional complication, such a situation is great, since it provides me with the perfect opportunity to show you how to add and use additional repositories. The first step is to add the bitnami repository to the Helm CLI, just as the instructions tell us. 1 2
helm repo add bitnami \ https://charts.bitnami.com/bitnami
Next, we will confirm that the repo was indeed added by listing all those available locally through the Helm CLI. 1
helm repo list
The output is as follows. 1 2 3
NAME URL stable https://kubernetes-charts.storage.googleapis.com/ bitnami https://charts.bitnami.com/bitnami
Let’s see what happens if we search for MongoDB again? 1
helm search repo mongodb
The output is as follows. 1
NAME
CHART VERSION APP VERSION DESCRIPTION
\
bitnami/mongodb database that stores JS... bitnami/mongodb-sharded database that stores JS... stable/mongodb nt-oriented database tha... stable/mongodb-replicaset database that stores JS... stable/prometheus-mongodb-exporter or MongoDB metrics bitnami/mean -source JavaScript softw... stable/unifi i Controller
7.13.0
4.2.6
NoSQL document-oriented\
1.1.4
4.2.6
NoSQL document-oriented\
7.8.10
4.2.4
DEPRECATED NoSQL docume\
3.15.0
3.6
NoSQL document-oriented\
2.4.0
v0.10.0
A Prometheus exporter f\
6.1.1
4.6.2
MEAN is a free and open\
0.7.0
5.12.35
Ubiquiti Network's Unif\
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Using Helm As A Package Manager For Kubernetes
165
We can see that, besides those from the stable repo, we got a few new results from bitnami. Please note that one of the charts is bitnami/mongodb and that the chart version is, in my case, 7.13.0 (yours might be newer). Let’s take another look at the requirements.yaml file we explored earlier. 1
cat go-demo-9/requirements.yaml
The output is as follows. 1 2 3 4 5
dependencies: - name: mongodb alias: go-demo-9-db version: 7.13.0 repository: https://charts.bitnami.com/bitnami
The dependency we explored earlier should now make more sense. The repository is the same as the one we added earlier, and the version matches the latest one we observed through my output of helm search. But we’re not yet done with the mongodb dependency. We might need to customize it to serve our needs. Given that we’re trying to define production values as defaults in values.yaml, and that we said that the database should be replicated, we might need to add value or two to make that happen. So, the next step is to explore which values we can use to customize the mongodb chart. We can easily retrieve all the values available in a chart through yet another helm command. 1 2
helm show values bitnami/mongodb \ --version 7.13.0
The output, limited to the relevant parts, is as follows. 1 2 3 4 5
... replicaSet: ## Whether to create a MongoDB replica set for high availability or not enabled: false ...
We can see that MongoDB replication (ReplicaSet) is disabled by default. All we’d have to do to make it replicated is to change the replicaSet.enabled value to true. And I already did that for you in the values.yaml file, so let’s take another look at it.
Using Helm As A Package Manager For Kubernetes 1
166
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
... go-demo-9-db: replicaSet: enabled: true
This time, we are not trying to define a value of the main application, but of one of its dependencies. So, those related to the MongoDB are prefixed with go-demo-9-db. That matches the alias we defined for the dependency. Within that segment, we set replicaSet.enabled to true. As you will see soon, when we deploy the application with the go-demo-9-db dependency, the database will be replicated.
Deploying Applications To Production It might sound strange that we’re starting with production. It would make much more sense to deploy it first to a development environment, from there to promote it to staging, and only then to run it in production. From the application lifecycle perspective, that would be, more or less, the correct flow of events. Nevertheless, we’ll start with production, because that is the easiest use case. Since the default values match what we want to have in production, we can deploy the application to production as-is, without worrying about the tweaks like those we’ll have to make for development and staging environments. Normally, we’d split environments into different Namespaces or even different clusters. Since the latter would require us to create new clusters, and given that I want to keep the cost to the bare minimum, we’ll stick with Namespaces as a way to separate the environments. The process would be mostly the same if we’d run multiple clusters, and the only substantial difference would be in kubeconfig, which would need to point to the desired Kube API. We’ll start by creating the production Namespace. 1
kubectl create namespace production
We’ll imagine that we do not know whether the application has any dependencies, so we will retrieve the list and see what we might need. 1
helm dependency list go-demo-9
The output is as follows.
Using Helm As A Package Manager For Kubernetes 1 2
167
NAME VERSION REPOSITORY STATUS mongodb 7.13.0 https://charts.bitnami.com/bitnami missing
We can see that the chart has only one dependency. We already knew that. What is distinguishing in that list is that the status is missing. We need to download the missing dependency before we try to apply the chart. One way to do that is by updating all the dependencies of the chart. 1
helm dependency update go-demo-9
The output is as follows. 1 2 3 4 5 6 7
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "bitnami" chart repository ...Successfully got an update from the "stable" chart repository Update Complete. ⎈Happy Helming!⎈ Saving 1 charts Downloading mongodb from repo https://charts.bitnami.com/bitnami Deleting outdated charts
We can confirm that it is now available by re-listing the dependencies. 1
helm dependency list go-demo-9
The output is as follows. 1 2
NAME VERSION REPOSITORY STATUS mongodb 7.13.0 https://charts.bitnami.com/bitnami ok
We can see that the status is now ok, meaning that the only dependency of the chart is ready to be used. We can install charts through helm install, or we can upgrade them with helm upgrade. But I don’t like using those commands since that would force me to find out the current status. If the application is already installed, helm install will fail. Similarly, we would not be able to upgrade the app if it was not already installed. In my experience, it is best if we don’t worry about the current status and tell Helm to upgrade the chart if it’s already installed or to install it if it isn’t. That would be equivalent to the kubectl apply command.
Using Helm As A Package Manager For Kubernetes 1 2 3 4 5
168
helm --namespace production \ upgrade --install \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
The output is as follows. 1 2 3 4 5 6 7 8 9
Release "go-demo-9" does not exist. Installing it now. NAME: go-demo-9 LAST DEPLOYED: Tue Apr 21 21:41:16 2020 NAMESPACE: production STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Get the application URL by running these commands:
10 11
kubectl get ingress go-demo-9-go-demo-9
We applied the chart. Since we used the --install argument, Helm figured out whether it should upgrade it or install it. We named the applied chart go-demo-9, and we told it to use the directory with the same name as the source. The --wait argument made the process last longer because it forced Helm to wait until all the Pods are running and healthy. It is important to note that Helm is Namespace-scoped. We had to specify --namespace production to ensure that’s where the resources are created. Otherwise, it would use the default Namespace. If you are in doubt which charts are running in a specific Namespace, we can retrieve them through the helm list command. 1
helm --namespace production list
The output is as follows. 1 2
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION go-demo-9 production 1 2020-04-21 ... deployed go-demo-9-0.0.1 0.0.1
You should be able to figure out the meaning of each of the columns, so we’ll move on right away. We set some objectives for the application running in production. Let’s see whether we fulfilled them. As a refresher, they are as follows.
Using Helm As A Package Manager For Kubernetes
• • • • •
169
Ingress host should be go-demo-9.acme.com. HPA should be enabled. The number of replicas of the application should oscillate between three and six. The database should have two replicas. The database should have persistent volumes attached to all the replicas.
Let’s see whether we accomplished those objectives. Is our application accessible through the host go-demo-9.acme.com? 1 2
kubectl --namespace production \ get ingresses
The output is as follows. 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 go-demo-9.acme.com 192.168.64.56 80 8m21s
The host does seem to be correct, and we should probably double-check that the application is indeed accessible through it by sending a simple HTTP request. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
Since you probably do not own the domain acme.com, we’re “faking” it by injecting the Host header into the request.
The output is as follows. 1
Version: 0.0.1; Release: unknown
We got a response, so we can move to the next validation. Do we have HorizontalPodAutoscaler (HPA), and is it configured to scale between three and six replicas? 1 2
kubectl --namespace production \ get hpa
The output is as follows.
Using Helm As A Package Manager For Kubernetes 1 2 3 4
170
NAME REFERENCE TARGETS MINP\ ODS MAXPODS REPLICAS AGE go-demo-9-go-demo-9 Deployment/go-demo-9-go-demo-9 /80%, /80% 3 \ 6 3 9m58s
Is the database replicated as well, and does it have persistent volumes? 1
kubectl get persistentvolumes
The output is as follows. 1 2 3 4 5 6
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM S\ TORAGECLASS REASON AGE pvc-86dc84f4... 8Gi RWO Delete Bound production/datadir-... s\ tandard 10m pvc-f834886d... 8Gi RWO Delete Bound production/datadir-... s\ tandard 10m
We don’t need to waste our time by checking whether there are two replicas of the database. We can conclude that from persistent volumes since each of the two is attached to a different replica. Let’s move into development and pull request environments and see whether we can deploy the application there, but with different characteristics.
Deploying Applications To Development And Preview Environments We saw how we can deploy an application and its dependencies to production. We saw that using Helm to deploy something to production is easy, given that we set all the default values to be those we want to use in production. We’ll move into a slightly more complicated scenario next. We need to figure out how to deploy the application in temporary development and preview (pull request) environments. The challenge is that the requirements of our application in those environments are not the same. The host name should be dynamic. Given that every developer might have a different environment, and that there might be any number of open pull requests, the host of the application needs to be auto-generated, and always unique. We also need to disable HPA, to have a single replica of both the application and the DB dependency, and we don’t want persistent storage for something that is temporary and can exist for anything from minutes to days or weeks. While the need to have different hosts is aimed at allowing multiple releases of the application to run in parallel, the rest of the requirements are mostly focused on cost reduction. There is probably no need to run a production-size application in development environments. Otherwise, we might go
Using Helm As A Package Manager For Kubernetes
171
bankrupt, or we might need to reduce the number of such environments. We probably do not want either of the outcomes. Going bankrupt is terrible for obvious reasons, while not having the freedom to get an environment whenever we need it might severely impact our (human) performance. Let’s get going. Let’s create a personal development environment just for you. We’ll start by defining a variable with your GitHub username. Given that each is unique, that will allow us to create a Namespace without worrying whether it will clash with someone else’s. We’ll also use it to generate a unique host. Please replace [...] with your GitHub username or with any other unique identifier.
1
export GH_USER=[...]
2 3
kubectl create namespace $GH_USER
We created a new Kubernetes Namespace based on your GitHub username. Next, we’ll take another look at the values.yaml file. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7
image: repository: vfarcic/go-demo-9 tag: 0.0.1 pullPolicy: IfNotPresent ... ingress: host: go-demo-9.acme.com
8 9 10 11 12 13 14
hpa: enabled: true ... go-demo-9-db: replicaSet: enabled: true
We’ll need to change quite a few variables. To begin with, we probably do not want to use a specific tag of the image, but rather the latest one that we’ll be building whenever we want to see the results of our changes to the code. To avoid collisions with others, we might also want to change
Using Helm As A Package Manager For Kubernetes
172
the repository as well. But, we’ll skip that part since it would introduce unnecessary complexity to our examples. While we’re at the subject of images, we should consider changing pullPolicy to Always. Otherwise, we’d need to build a different tag every time we create a new image. We should also define a unique host and disable hpa. Finally, we should disable database replicaSet as well as persistence. All those changes are likely going to be useful to all those working on this application. So, we have a separate values.yaml file that can be used by anyone in need of a personal development environment. 1
cat dev/values.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10
image: tag: latest pullPolicy: Always hpa: enabled: false go-demo-9-db: replicaSet: enabled: false persistence: enabled: false
That file contains all the values that we want to use to overwrite those defined as defaults in go-demo-9/values.yaml. The only variable missing is ingress.host. We could not pre-define it since it will differ from one person to another. Instead, we will assign it to an environment variable that we will use to set the value at runtime. 1
export ADDR=$GH_USER.go-demo-9.acme.com
Now we are ready to create the resources in the newly created personal Namespace.
Using Helm As A Package Manager For Kubernetes 1 2 3 4 5 6 7
173
helm --namespace $GH_USER \ upgrade --install \ --values dev/values.yaml \ --set ingress.host=$ADDR \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
We used the --namespace argument to ensure that the resources are created inside the correct place. Moreover, we added two new arguments to the mix. The --values argument specified the path to dev/values.yaml that contains the environment-specific variables that should be overwritten. Further on, we passed the value of ingress.host through --set. The output is almost the same as when we installed the application in production, so we can safely skip commenting it. What matters is that the application is now (probably) running and that we can focus on confirming that the requirements for the development environment were indeed fulfilled. As a refresher, we’re trying to accomplish the following objectives. • • • • •
Ingress host should be unique. HPA should be disabled. The application should have only one replica. The database should have only one replica. The database should NOT have persistent volumes attached.
Let’s see whether we accomplished those objectives. 1 2
kubectl --namespace $GH_USER \ get ingresses
The output, in my case, is as follows. 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 vfarcic.go-demo-9.acme.com 192.168.64.56 80 85s
We can confirm that the host is indeed unique (vfarcic.go-demo-9.acme.com), so let’s check whether the application is reachable through it. 1 2
curl -H "Host: $ADDR" \ "http://$INGRESS_HOST"
We received the response, thus confirming that the application is indeed accessible. Let’s move on.
Using Helm As A Package Manager For Kubernetes 1 2
174
kubectl --namespace $GH_USER \ get hpa
The output claims that no resources were found, so we can confirm that HPA was not created. How about the number of replicas? 1 2
kubectl --namespace $GH_USER \ get pods
The output is as follows. 1 2 3
NAME READY STATUS RESTARTS AGE 7m47s go-demo-9-go-demo-9-... 1/1 Running 0 go-demo-9-go-demo-9-db-... 1/1 Running 0 7m47s
That also seems to be in line with our objectives. We’re running one replica for the API and one for the database. Finally, the only thing left to validate is whether the database in the personal development environment did not attach any persistent volumes. 1
kubectl get persistentvolumes
The output is as follows. 1 2 3 4 5 6
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM \ STORAGECLASS REASON AGE pvc-86dc84f4-... 8Gi RWO Delete Bound production/datadir-... \ standard 25m pvc-f834886d-... 8Gi RWO Delete Bound production/datadir-... \ standard 25m
PersistentVolumes are not Namespaced, so we got all those in the cluster. What matters is that there are only two and that both are from the claims from the production Namespace. None were created from the database in the personal development environment. That’s it. We saw how to deploy the application in a development or a preview (pull request) environment. The only thing left to do is to delete it. Those environments, or, at least, the resources in those environments, should be temporary. If, for example, we create resources for the purpose of developing something, it stands to reason that we should delete them when we’re finished. Since it’s so easy and fast to deploy anything we need, there’s no need to keep things running longer than needed and to unnecessarily increase costs.
Using Helm As A Package Manager For Kubernetes 1 2
175
helm --namespace $GH_USER \ delete go-demo-9
We deleted the application. We can just as well keep the empty Namespace, or delete it as well. We will not need it anymore, so we will do the latter. 1
kubectl delete namespace $GH_USER
That’s it. It’s gone as if it never existed. There was no need to delete the application. If we’re planning to delete the whole Namespace, it is not necessary to start with helm delete. Removal of a Namespace means the removal of everything in it.
Let’s move forward and see what we’d have to do to run the application in a permanent nonproduction environment.
Deploying Applications To Permanent Non-Production Environments We already learned everything we need to know to deploy an application to a permanent nonproduction environment like, for example, staging. It’s similar to deploying it to a development environment. As a matter of fact, it should be even easier since it is permanent, so there should be no dynamic values involved. All we have to do is define yet another set of values. With that in mind, we could just as well skip this part. But we won’t, mostly so that we can close the circle and go through all the commonly used permutations. Think of this section as a refresher of what we learned. I promise to go through it fast without wasting too much of your time. We’ll start by creating the Namespace 1
kubectl create namespace staging
Next, we’ll take a quick look at the values.yaml specific to the staging environment. 1
cat staging/values.yaml
The output is as follows.
Using Helm As A Package Manager For Kubernetes 1 2 3 4 5 6
176
image: tag: 0.0.2 ingress: host: staging.go-demo-9.acme.com hpa: minReplicas: 2
The values should be obvious and easy to understand. The only one that might be confusing is that we are setting minReplicas to 2, but we are not changing the maxReplicas (the default is 6). We did say that the staging environment should have only two replicas, but there should be no harm done if more are created under certain circumstances. We are unlikely going to generate sufficient traffic in the staging environment for the HPA to scale up the application. If we do, for example, run some form of load testing, jumping above 2 replicas would be welcome, and it would be temporary anyway. The number of replicas should return to 2 soon after the load returns to normal. Let’s apply the go-demo-9 chart with those staging-specific values. 1 2 3 4 5 6
helm --namespace staging \ upgrade --install \ --values staging/values.yaml \ go-demo-9 go-demo-9 \ --wait \ --timeout 10m
All that’s left is to validate that the objectives for the staging environment are met. But, before we do that, let’s have a quick refresher. • • • • •
Ingress host should be staging.go-demo-9.acme.com. HPA should be enabled. The number of replicas of the application should be two (unless more is required) The database should have two replicas. The database should have persistent volumes attached to all the replicas.
Let’s see whether we fulfilled those objectives. 1 2
kubectl --namespace staging \ get ingresses
The output is as follows.
177
Using Helm As A Package Manager For Kubernetes 1 2
NAME HOSTS ADDRESS PORTS AGE go-demo-9-go-demo-9 staging.go-demo-9.acme.com 192.168.64.56 80 2m15s
We can see that the host is indeed staging.go-demo-9.acme.com, so let’s check whether the application is reachable through it. 1 2
curl -H "Host: staging.go-demo-9.acme.com" \ "http://$INGRESS_HOST"
The output should be a “normal” response, so we can move on and validate whether the HPA was created and whether the minimum number of replicas is indeed two. 1 2
kubectl --namespace staging \ get hpa
The output is as follows. 1 2 3 4
NAME REFERENCE TARGETS MINP\ ODS MAXPODS REPLICAS AGE go-demo-9-go-demo-9 Deployment/go-demo-9-go-demo-9 /80%, /80% 2 \ 6 2 2m59s
We can see that the HPA was created and that the minimum number of Pods is 2. The only thing missing is to confirm whether the persistent volumes were created as well. 1
kubectl get persistentvolumes
The output is as follows. 1 2 3 4 5 6 7 8 9 10
NAME CAPACITY STORAGECLASS REASON AGE pvc-11c49634-... 8Gi standard 3m45s pvc-86dc84f4-... 8Gi standard 37m pvc-bf1c2970-... 8Gi standard 3m45s pvc-f834886d-... 8Gi standard 37m
ACCESS MODES RECLAIM POLICY STATUS CLAIM
\
RWO
Delete
Bound
staging/datadir-...
\
RWO
Delete
Bound
production/datadir-... \
RWO
Delete
Bound
staging/datadir-...
RWO
Delete
Bound
production/datadir-... \
\
We can observe that two new persistent volumes were added through the claims from the staging Namespace. That’s it. We saw how we can deploy the application in different environments, each with different requirements. But that’s not all we should do.
Using Helm As A Package Manager For Kubernetes
178
Packaging And Deploying Releases If we are going to follow the GitOps principles, each release should result in a change of, at least, the tag of the image and the version of the chart. Also, we might want to package the chart in a way that it can be distributed to all those who need it. We’ll explore that through a simulation of the development of a new feature that could result in a new release. Let’s start by creating a new branch, just as you would normally do when working on something new. 1
git checkout -b my-new-feature
Now, imagine that we spent some time writing code and tests and that we validated that the new feature works as expected in a personal development environment and/or in a preview environment created through a pull request. Similarly, please assume that we decided to make a release of that feature. The next thing we would probably want to do is change the version of the chart and the application. As you already saw, that information is stored in Chart.yaml, so let’s output is as a refresher. 1
cat go-demo-9/Chart.yaml
The output is as follows. 1 2 3 4 5
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.1 appVersion: 0.0.1
All we’d have to do is change the version and appVersion values. Normally, we’d probably do that by opening the file in an editor and changing it there. But, since I’m a freak for automation, and since I prefer doing as much as possible from a terminal, we’ll accomplish the same with a few sed commands. 1 2 3 4
cat go-demo-9/Chart.yaml \ | sed -e "s@version: 0.0.1@version: 0.0.2@g" \ | sed -e "s@appVersion: 0.0.1@appVersion: 0.0.2@g" \ | tee go-demo-9/Chart.yaml
We retrieved the content of go-demo-9/Chart.yaml and piped the output to two sed commands that replaced 0.0.1 with 0.0.2 for both the version and the appVersion fields. The final output was sent to tee that stored it in the same Chart.yaml file. The output is as follows.
Using Helm As A Package Manager For Kubernetes 1 2 3 4 5
179
apiVersion: v1 description: A Helm chart name: go-demo-9 version: 0.0.2 appVersion: 0.0.2
Now that we replaced the versions, we should turn our attention to the tag that we want to use. As you already know, it is one of the variables in values.yaml, so let’s output it. 1
cat go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
image: repository: vfarcic/go-demo-9 tag: 0.0.1 ...
We’ll replace the value of the image.tag variable with a yet another sed command. 1 2 3
cat go-demo-9/values.yaml \ | sed -e "s@tag: 0.0.1@tag: 0.0.2@g" \ | tee go-demo-9/values.yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4
image: repository: vfarcic/go-demo-9 tag: 0.0.2 ...
That’s it. Those are all the modifications we had to do, even though they were not mandatory. We could have accomplished the same by specifying those values as --set arguments at runtime. But that would result in undocumented and hard to track changes. This way, we could (and we should) push those changes to a Git repository. Normally, we would need to build a new image, and we would probably execute quite a few other steps like testing, creating release notes, etc. But that’s not the subject of this chapter, so we’re skipping them. I already built the image vfarcic/go-demo-9:0.0.2, so it’s ready for us to use it.
Before we commit to making a new release based on, among other things, that chart, we should validate whether the syntax we’re using is correct and that there are no obvious issues with it. We’ll do that by “linting” the chart.
Using Helm As A Package Manager For Kubernetes 1
180
helm lint go-demo-9
The output is as follows. 1 2
==> Linting go-demo-9 [INFO] Chart.yaml: icon is recommended
3 4
1 chart(s) linted, 0 chart(s) failed
If we ignore the fact that the icon of the chart does not exist, we can conclude that the chart was linted successfully. Now that the chart seems to be defined correctly, we can package it. 1
helm package go-demo-9
We should have added --sign to the helm package command. That would provide a way to confirm its authenticity. However, we’d need to create the private key, and I do not want to lead us astray from the main subject. I’ll leave that up to you as homework for later.
We can see, from the output, that the chart was packaged, and that it was saved as go-demo-9-0.0.2.tgz. From now on, we can use that file to apply the chart, instead of pointing to a directory. 1 2 3 4 5
helm --namespace production \ upgrade --install \ go-demo-9 go-demo-9-0.0.2.tgz \ --wait \ --timeout 10m
We upgraded the go-demo-9 in production with the new release 0.0.2. Right now, you might be wondering what’s the advantage of packaging a chart. It’s just as easy to reference a directory as to use a tgz file. And that would be true if we would always be looking for charts locally. With a package, we can store it remotely. That can be a network drive, or it can be an artifact repository, just like the one we used to add MongoDB as a dependency. Those repositories can be purposely built to store Helm charts, like, for example, ChartMuseum⁷⁹. We could also use a more generic artifacts repository like Artifactory⁸⁰, and many others. I’ll leave you to explore repositories for Helm charts later. Think of it as yet another homework. Let’s move on and check whether the application was indeed upgraded to the new release. ⁷⁹https://chartmuseum.com/ ⁸⁰https://jfrog.com/artifactory/
Using Helm As A Package Manager For Kubernetes 1
181
helm --namespace production list
The output is as follows. 1 2
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION go-demo-9 production 2 2020-04-21... deployed go-demo-9-0.0.2 0.0.2
We can see that go-demo-9 is now running the second revision and that the app version is 0.0.2. If you’re still uncertain whether the application was indeed upgraded, we can retrieve all the information of the chart running in the cluster, and confirm that it looks okay. 1 2
helm --namespace production \ get all go-demo-9
We probably got more information than we need. We could have retrieved only hooks, manifests, notes, or values if we were looking for something more specific. In this case, we retrieved all because I wanted to show you that we can get all the available information.
If, for example, we would like to confirm that the image of the Deployment is indeed correct, we can observe that from the fragment of the output, that is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13
... apiVersion: apps/v1 kind: Deployment ... spec: ... template: ... spec: containers: - name: go-demo-9 image: vfarcic/go-demo-9:0.0.2 ...
If you’re curious, feel free to explore that output in its entirety. You should be able to navigate through it since most of it are either Kubernetes definitions that you’re already familiar with or Helm-specific things like variables that we explored earlier. Finally, before we move on, we’ll confirm that the new release is indeed reachable by sending a simple curl request.
Using Helm As A Package Manager For Kubernetes 1 2
182
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
This time, the output states that the version is 0.0.2.
Rolling Back Releases Now that we figured out how to move forward by installing or upgrading releases, let’s see whether we can change the direction. Can we roll back in case there is a problem that we did not detect before applying a new release? We’ll start by taking a quick look at the go-demo-9 releases we deployed so far in the production Namespace. 1 2
helm --namespace production \ history go-demo-9
The output is as follows. 1 2 3
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Tue Apr 21... superseded go-demo-9-0.0.1 0.0.1 Install complete 2 Tue Apr 21... deployed go-demo-9-0.0.2 0.0.2 Upgrade complete
We can see that we made two releases of the application. The revision 1 was the initial install of the version 0.0.1, while the revision 2 was the upgrade to the version 0.0.2. Now, let’s imagine that there is something terribly wrong with the second revision, and that, for whatever reason, we cannot roll forward with a fix. In such a case, we’d probably choose to roll back. We can roll back to a specific revision, or we can go to the previous release, whichever it is. We’ll choose the latter, with a note that rolling back to a specific revision would require that we add the revision number to the command we are about to execute. 1 2
helm --namespace production \ rollback go-demo-9
The response clearly stated that the rollback was a success, so let’s take another look at history. 1 2
helm --namespace production \ history go-demo-9
The output is as follows.
183
Using Helm As A Package Manager For Kubernetes 1 2 3 4
REVISION 1 2 3
UPDATED STATUS CHART Tue Apr 21... superseded go-demo-9-0.0.1 Tue Apr 21... superseded go-demo-9-0.0.2 Tue Apr 21... deployed go-demo-9-0.0.1
APP VERSION 0.0.1 0.0.2 0.0.1
DESCRIPTION Install complete Upgrade complete Rollback to 1
We can see that it did not roll back literally. Instead, it rolled forward (upgraded the app), but to the older release. The alternative way to see the current revision is through the status command. 1 2
helm --namespace production \ status go-demo-9
We can see from the output that the current revision is 3. Finally, to be on the safe side, we’ll send a request to the application and confirm that we’re getting the response from the older release 0.0.1. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
The output is Version: 0.0.1, so we can confirm that the rollback was indeed successful.
What Did We Do Wrong? There are quite a few things that we could have done better, but we didn’t. That would have forced us to go astray from the main subject. Instead, we will go through a few notes that might prove to be useful in your usage of Helm. We upgraded the release directly in production, without going through staging, or whichever other environments we might have. Think of that as a shortcut meant to avoid distracting us from the subject of this chapter, and not as something you should do. When working on “real” projects, you should follow the process, whatever it is, instead of taking shortcuts. If the process is not right, change it, instead of skipping the steps. We should not apply changes to clusters directly. Instead, we should be pushing them to Git and applying only those things that were reviewed and merged to the master branch. While at the subject of Git, we should keep Helm charts close to applications, preferably in the same repository where the code and other application-specific files are located. Or, to be more precise, we should be doing that for the “main” chart, while environment-specific values and dependencies should probably be in the repositories that define those environments, no matter whether they are Namespaces or full-blown clusters. Finally, we should not run helm commands at all, except, maybe, while developing. They should all be automated through whichever continuous delivery (CD) tool you’re using. The commands are simple and straightforward, so you shouldn’t have a problem extending your pipelines.
Using Helm As A Package Manager For Kubernetes
184
Destroying The Resources We will not need the changes we did in the my-new-feature branch, so let’s stash it, checkout the master, and remove that branch. 1
git stash
2 3
git checkout master
4 5
git branch -d my-new-feature
The next steps will depend on whether you’re planning to destroy the cluster or to keep it running. If you choose the latter, please execute the commands that follow to delete the Namespaces we created and everything that’s in them. If you’re using Docker Desktop and you’re not planning to reset Kubernetes cluster, better execute the commands that follow no matter whether you’ll keep the cluster running or shut it down.
1
kubectl delete namespace staging
2 3
kubectl delete namespace production
Let’s go back to the root of the local repository. 1
cd ../
If you chose to destroy the cluster, feel free to use the commands at the bottom of the Gist you used to create it. Finally, let’s get out of the local repository, and get back to where we started. 1
cd ../
Setting Up A Local Development Environment This might be an excellent place to take a break from infrastructure, Kubernetes, Helm, and other topics we explored previously. This time, we will create a productive local development environment. Let’s start with the requirements. What do we need as a local development environment? Different people will give different answers to that question. In my case, it all boils down to four key ingredients. We need a browser. That’s our source of information (Google, internal Wiki pages, etc.). We need an IDE which we use to navigate through files and write code, configurations, documentation, and almost everything else. Finally, we need a terminal. Any self-respecting engineer understands the advantages of using a terminal to execute commands instead of clicking buttons, so we’ll consider that a requirement. For a terminal to make sense, we need a Shell. We might need to install a few other applications, mostly those dedicated to communication like Slack or Zoom. I’m sure that you know how to install an app, so we will not bother with those. Similarly, I’m sure that you already have a browser. So, we will not bother with it either. We will concentrate on IDE, terminal, and Shell. Those are the three pillars of a local development environment. Anything else we might need (e.g., compilers, utilities, CLIs, etc.) can and should be installed through a Shell. We’ll go through a setup that works great for me. As a matter of fact, if we exclude Slack and a browser, what we are about to set up is everything I run on my machines. But, before we do that, let’s discuss which operating system is the best for your laptop. Don’t worry, it’s going to be a short one.
Which Operating System Is The Best For Laptops? Which operating system should software engineers use on their laptops? The answer is that it does not matter. They are all, more or less, the same, with only two significant differentiators. One significant difference is the ease of the initial setup. That’s not an issue since Windows and macOS laptops already come with an operating system pre-installed. Some come with Linux as well. Some people tend to wipe their OS and start over by installing their favorite Linux distribution, but, these days, that’s often limited to people who “like tweaking it to be exactly as they like it.” In other words, laptops tend to come with an OS pre-installed, so ease of setup is not that important anymore unless you are one of those who like to start fresh with a “special” OS. If that’s you, you
Setting Up A Local Development Environment
186
are probably doing that because you want it, and not because you are looking into something that works out-of-the-box. Now, at this point, people would usually start screaming at me. They would begin listing things that are better in their favorite OS and claim that everything else is trash. That’s, in most cases, nonsense. Most of us click a shortcut, and an application opens. That application tends to be the same regardless of whether it is running in macOS, Windows, or Linux. So it does not really matter. We tend to spend most of our time working inside applications that behave, more or less, the same, no matter the underlying OS. One of those applications is a terminal, and that also tends to be the source of heated debates. Again, it does not matter, especially since I am about to introduce you to my favorite terminal, which happens to work (almost) the same in any OS. While we are at the subject of terminals, Shells tend to be different. While macOS and Linux have, more or less, the same Shells (Bash, ZSH, KSH, etc.), Windows is using Windows Command Line and PowerShell. That’s where the critical difference lies. You should not use PowerShell, at least not exclusively. I could start this discussion by saying that the Windows command prompt is horrible and that PowerShell is not good enough. But that would be missing the point. It does not really matter whether PowerShell is fantastic or not. What matters is that the world is running on Unix Shell. All your servers are based on some Linux distribution. Even if that’s not the case, they are likely mixed with Windows servers. There is hardly any company these days that has only Windows servers. So, some or all your servers are using Linux. That means that Linux is inevitable, while Windows servers are optional and slowly fading away. Consequently, that also means that you have to use a Linux Shell when working with servers. So why not adopt it for local development as well? Why would we use different Shells if one can work in all circumstances and if it works well? The basic operations, those that we use most of the time, are the same no matter whether we are using Bash, ZSH, KSH, or any other Linux-based Shell. So, even if your servers have Bash, and your laptop is using ZSH, you should be able to work seamlessly since they are very similar anyway. There is one significant difference between operating systems, and that’s package management. The way we install applications from the command line in Windows, macOS, and various Linux distributions is different. Yet, even that does not matter much since it’s mostly about running a different command. That can be brew, apt, rpm, etc.
To summarize, we start applications by clicking shortcuts, and we use terminals to execute commands. The former is more or less the same everywhere. The latter should be the same as well if only Windows would not insist on PowerShell. That last statement is actually not true. Windows embraced Linux in the form of WSL, and that means that we can have a productive local development environment no matter which operating
Setting Up A Local Development Environment
187
system is running on our laptops. So, what we’re about to set up will work the same, no matter which operating system you’re using. We just need to set up a few pre-requirements on Windows. All in all, it does not matter which operating system you are running on your laptop. It all boils down to personal preferences. All operating systems are mature, and none are releasing new game-changing features. That’s it. The war is over. No one won. All operating systems are good enough and do, more or less, the same job on our laptops. Servers are a different story, and we will not enter into that discussion right now. Instead, we’ll set up the local development environment I tend to use. All the commands from this chapter are available in the 03-local-dev.sh⁸¹ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
Before we get to the actual setup, I prepared a “special” sub-section only for Windows users. Feel free to skip it if you are not using Windows.
Installing Windows Subsystem For Linux (WSL) This is only for Windows users. Skip to the next section, if that is not you.
I already explained why Linux Shell is essential. I did not go into great length to explain the reasons, given that I am assuming that you already know that. So, we’ll jump straight into Windows Subsystem for Linux (WSL). It lets us run Linux environment directly on Windows, and without the overhead of a virtual machine. It is part of Windows (not a replacement), so you can keep doing “Windows stuff”, while still benefiting from Linux Shells and quite a few other things. It is a much better solution than using Shell emulators like Cygwin or GitBash, and it saves you from wasting resources on a virtual machine. WSL might be the best thing that happened to Windows, at least during the last couple of years. Before we proceed, please note that I am assuming that you are using Windows 10 or newer.
The setup is relatively easy and straight forward, so let’s get down to it right away. We’ll start by getting you up-to-date. ⁸¹https://gist.github.com/7b9ce0f066a209b66fd2efe9d1f5ba06
Setting Up A Local Development Environment
188
Please open Windows Update Settings. You can do that by typing it in the search field, or any other way you prefer opening applications. If you’re running an older build of Windows, this might be the right moment to upgrade. I’ll assume that you know how to do that. Next, we need to turn on the Developer mode. Open Settings followed with Update & Security, and select the For developers tab. Once inside, turn on the Developer mode. Now we are ready to install the Windows Subsystem for Linux (WSL). Open OptionalFeatures, select Windows Subsystem for Linux, and restart. I recommend using WSL2 since It provides a full Linux kernel, which frees us from workarounds for some Linux commands. Please consult WSL1 vs. WSl2⁸² for a detailed comparison of the two. You may need to upgrade your Windows OS since WSL2 is only available in Windows 10, Version 2004, Build 19041, or higher. Follow the instructions to upgrade to WSL2⁸³.
WSL2 uses Hyper-V architecture to enable its virtualization, which means that software will not be able to run in WSL2 if it requires virtualization (e.g., VirtualBox). It would take us a while to make WSL2 and VirtualBox co-exist. Fortunately, Docker Desktop integrates seamlessly with WSL2.
The only thing missing is to install Linux. Wait until Windows is restarted, and open Microsoft Store. You should be able to install any Linux offered in the store. However, for simplicity, I recommend that you start with Ubuntu since that’s the one we’ll use in the examples that follow. Otherwise, you might need to modify the commands in the examples. Most of them should be the same no matter which Linux you choose, and the significant change would be in installing and managing packages. Please search for Ubuntu, select it, and follow the instructions to install it. I’m using Ubuntu 20.04 LTS. Launch it once it is installed. After a while, you will be asked for credentials. Type your user name, enter the password, and confirm it. From now on, you should be able to use Bash. Let’s confirm whether that’s indeed true. Type exit to get out of Ubuntu’s terminal and open bash. ⁸²https://docs.microsoft.com/en-us/windows/wsl/compare-versions ⁸³https://docs.microsoft.com/en-us/windows/wsl/install-win10#update-to-wsl-2
Setting Up A Local Development Environment
189
What you see in front of you is Bash Shell. From now on, you can run the same commands as those using macOS or Linux, and you do not need to emulate anything or create a virtual machine. Nevertheless, we are not yet done getting to the same level as those using other operating systems. Ubuntu installed as WSL does not come with all the tools that are typically installed in a stand-alone Ubuntu. For now, we’ll focus only on the two essential tools. We will need curl to send requests or download files. We will also need git, and I have no intention of explaining what it does. You’re in deep trouble is you never used Git. We’ll start by adding the git-core package repository. 1
sudo apt-add-repository ppa:git-core/ppa
You will be asked for the password. Use the same one you provided during the Ubuntu installation process. From now on, I will not be telling you to type the password whenever you’re asked, or to confirm that you want to continue with some process.
Next, we need to update the local copy of the package definitions. 1
sudo apt update
Now we are ready to install curl and git. 1
sudo apt install curl git
We will not need the terminal any more in this section, so let’s exit. 1
exit
That’s it. You are still using Windows for whatever reasons you like using it, but you also have Ubuntu running as Windows Subsystem, and you can use it for all Shell-related tasks. Now you should be able to run all the commands from this book without resorting to magic, emulations, or virtual machines. Please note that there might be other packages you might need. I’ll provide the instructions for those required in this book, and you are on your own for the others. I am sure you will figure them out.
What matters is that, from now on, you should follow the instructions for Linux and not for Windows. As a matter of fact, there will not be any Windows-specific instructions anymore. You’re a Linux guru now, at least when commands and scripting are concerned.
Setting Up A Local Development Environment
190
Let me make sure that the message is clear. From now on, follow the instructions for Linux whenever something is executed from a command line.
Let’s join the others and setup up a “proper” Shell.
Choosing A Shell Most Shells are based on Bourne Shell (sh). Most operating systems default to Bourne-Again Shell (bash), which replaces sh. To be more precise, bash is a superset of sh. It adds additional features that we will not have time to discuss, especially since it is not the option we’ll choose. Instead, we’ll go with zsh. Just like bash, Z Shell (zsh) is a superset of sh. However, it brings a few additional features that bash does not have. Specifically, it has a theme and plugin mechanism that allows us to extend it. That enables it to do things that cannot be (easily) done with bash. Let’s install it and see it in action. First, we will open a bash terminal. If you are a Windows user with WSL, open bash. Otherwise, open whichever terminal is available. The installation differs depending on your operating system. Please execute the command that follows if you are a Ubuntu or Debian user. That includes Windows users with WSL since you are now running Ubuntu as well.
1 2
sudo apt install \ zsh powerline fonts-powerline
Please execute the command that follows if you are a macOS user.
1
brew install zsh
I already mentioned that the main advantage of zsh is its ability to use themes and plugins. While we could create them ourselves, there is already a fantastic project that simplifies the setup and provides a vast library of plugins. That project is called Oh My Zsh. Oh My Zsh is an open-source community-driven framework for managing zsh configuration. It makes things simple with fantastic out-of-the-box experience, which can be fine-tuned to our liking. Let’s install it.
Setting Up A Local Development Environment 1 2
191
sh -c "$(curl -fsSL \ https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
Follow the instructions, if there are any. That’s it. We have zsh set up and managed with Oh My Zsh. It will be our Shell of choice, and we’ll explore it soon. But, before we do that, we still need to figure out which IDE and terminal to use.
A Short Intermezzo We’ll make a small detour if you are a Ubuntu or a Debian user. As you already saw, that includes Windows users with WSL. If you are using macOS, you can skip this section since it does not apply to you.
I often use the open command in my book. It is a handy way to open “stuff”, which, in my case, tends to be URL addresses. The issue is that Ubuntu (including WSL) does not have it. To be more precise, it does not have the open command, but it does have xdg-open, which is almost the same. We just need to install it and create an alias. As I mentioned, this will be quick since there’s not much to do but execute a few commands. First, we are going to install xdg-utils. 1
sudo apt install xdg-utils
Next, we’ll create an alias so that it acts in the same way as open in macOS. If nothing else, it is shorter to write open than xdg-open, and, at the same time, it will save me from providing different commands depending on the operating system. 1
alias open='xdg-open'
Let’s see whether it works. 1
open https://www.devopstoolkitseries.com
A new tab in a browser should have opened with the home page of devopstoolkitseries.com. Brilliant! Now I don’t need to tell you anymore to open one page or another. Instead, I can provide the instructions as executable commands. But we are not yet done. The alias we created is temporary and will exist only until we close the terminal session. To avoid creating the same alias over and over again, we’ll make it permanent by adding it to .zshrc, which is executed every time we start a new zsh session.
Setting Up A Local Development Environment 1 2
192
echo "alias open='xdg-open'" \ | tee -a $HOME/.zshrc
That’s it. All the commands will now be the same, no matter the operating system you are using. The exceptions are only those used to install packages. Let’s get back to the task at hand and explore which IDE and terminal we are going to use.
Choosing An IDE And A Terminal Discussions about editors and integrated development environments (IDEs) is a common cause for heated debates that can last days and lead only to frustration and realization that everyone thinks that their favorite is the only right solution. I will not enter into such a debate. Instead, I’ll just say that I believe that Visual Studio Code is the best IDE. I will not even provide arguments in favor of such a statement. It’s fantastic, and we are about to install it. 1
open https://code.visualstudio.com/download
Choose the distribution for your favorite operating system, download it, and follow the instructions on how to install it. Open it. If you are a Windows user that I hopefully convinced to use WSL, you will be asked whether you want to install the WSL plugin. Do it!
I’ll let you explore all the goodies Visual Studio Code provides. Or, even better, there is no strong need to explore anything. Just use it. One of the reasons I like VS Code is that it is intuitive, and it makes the right choices. For example, you do not need to look for “special” plugins. As soon as it detects a file that would benefit from a plugin, it will pop up a message asking you whether you want to install it. Most of the time, you need to let it guide you, instead of figuring out how to do things. Later on, after using it for a while, you might want to check out some “hidden” features. For now, just use it and let it educate you. Next, we need to choose a terminal that we are going to use. Those pre-installed in operating systems tend to be very rudimentary and are often not the best choice. That is especially true in Windows, which probably has the worst out-of-the-box terminal I ever saw. Do not think that pre-installed terminals in macOS or Linux are great. They are just better than the one in Windows, but not exceptionally good.
Setting Up A Local Development Environment
193
Just as with the IDE, I will not enter into a lengthy debate justifying why the terminal I prefer to use is better than others. I will just pause and let you guess which one it is. Did you guess it? If not, let me give you a clue. We already have it installed no matter which operating system you are using. That should be an obvious clue. How about now? Visual Studio Code comes with a terminal, and it happens to be the one I like the most. On top of that, it helps a lot having (almost) everything we need in a single app. I have two monitors, and one always has Visual Studio Code occupying the whole screen. It allows me to browse files, to edit whatever I need editing, and it contains a terminal that I use to run whichever commands I need to execute. Apart from the apparent need for a browser and whichever communication tools I use, Visual Studio Code provides everything I need to develop software, operate my systems, and do whatever else engineers are doing. We’ll open the terminal soon. Before we do that, we need to let it know that it should use zsh as the default Shell. From inside Visual Studio Code, click the View item from the top menu, and select Command Palette…. Almost everything in Visual Studio Code is accessible through keyboard shortcuts. I rarely navigate the menus, except now. The shortcuts tend to differ from one operating system to another, so it would be difficult to provide instructions based on keyboard shortcuts that would work for all.
The Command Palette provides access to Visual Studio Code commands, which can do almost anything. In this case, we are trying to redefine the default Shell. Please type select default shell and open it. If you are a WSL user, select WSL Bash. Otherwise, choose zsh. That’s it. We have our IDE, and it contains an excellent terminal that is configured to use zsh, which, in turn, is managed by Oh My Zsh. The only thing left is to open the terminal. Select Terminal from the top menu and open New Terminal. We are not yet done, though. There is still one more thing we need to set up.
Using Oh My Zsh To Configure Z Shell Oh My Zsh is great without tweaking anything. It already applied the default theme, and it has the Git plugin pre-configured.
Setting Up A Local Development Environment
194
Nevertheless, we can do better than that. We will add a few plugins, while I’ll leave you to explore themes alone. Let’s start by checking which plugins are already installed. 1
ls -1 $HOME/.oh-my-zsh/plugins
The output, limited to the last few entries, is as follows. 1 2 3 4 5 6 7
... yum z zeus zsh-interactive-cd zsh-navigation-tools zsh_reload
That’s a huge list. The last time I counted, there were 280 plugins available. Almost anything we need is already there, and all we have to do is tell Oh My Zsh which ones we would like to use. Let’s see, for example, whether kubectl is there. 1 2
ls -1 $HOME/.oh-my-zsh/plugins \ | grep kubectl
We can see that the kubectl plugin is available. Nevertheless, sometimes we might want to add plugins that are not pre-installed. In that case, all we have to do is find them, and clone their repositories to the .oh-my-zsh/custom/plugins directory. Let’s see what’s inside it. 1
ls -1 $HOME/.oh-my-zsh/custom/plugins
The output is as follows. 1
example
There is only one custom plugin which does not do much. As the name suggests, it is there mostly to provide an example of custom plugins that we can add. The one that is, in my opinion, indispensable, and yet not available out of the box, is the autosuggestions plugin. So, let’s install it.
Setting Up A Local Development Environment 1 2 3
195
git clone \ https://github.com/zsh-users/zsh-autosuggestions \ $HOME/.oh-my-zsh/custom/plugins/zsh-autosuggestions
That’s it. All we had to do is clone the repository with the plugin inside the .oh-my-zsh/custom/plugins directory. But we are not yet finished. Installing a plugin or, to be more precise, downloading it is not enough. We need to tell zsh that we want to use it. Almost all Shells have an rc file that defines the variables, init processes, and everything else we need when a Shell starts. Z Shell is not an exception. If we want to add plugins installed in Oh My Zsh, we need to change the .zshrc file. I will use vim to edit the file. The chances are that you do not like vim or that you even do not know what it is. Feel free to edit the .zshrc file any other way you like.
1
vim $HOME/.zshrc
You’ll notice that one of the variables near the top is ZSH_THEME set to robbyrussell. It tells zsh which theme we would like to use. We’ll keep the default one since it happens to be my favorite. Later on, you might want to try others. If you do, I’m sure you’ll be able to figure it out from the documentation. This is not a deep dive, so we’re defaulting to what I like. We’ll focus on plugins instead. Please find the plugins=(git) entry. We are about to add a few additional entries. To be more specific, we will add kubectl, helm, minikube, and zsh-autosuggestions. The first two tend to be used in many of the examples in this book. Minikube is useful only if that’s the Kubernetes platform you are using. Otherwise, you might want to skip that one. Finally, the zsh-autosuggestions plugin provides auto-complete feature for the commands that not covered by other plugins. It’s a sort of catch-all mechanism. If you did choose to use vim as I did, press the i key to enter the insert mode. Replace plugins=(git) with the snippet that follows.
Setting Up A Local Development Environment 1 2 3 4 5 6 7
196
plugins=( git kubectl minikube zsh-autosuggestions helm )
If you are still in vim, press the escape key to exit the insert mode, type wq to save and quit, and press the enter key. The only thing left is to exit the terminal and start a new session which will have the changes to the rc file applied. 1
exit
Going For A Test Drive With Oh My Zsh We will not have time to go through everything we could do with zsh and Oh My Zsh. That would be a futile attempt that would result in a never-ending story. Instead, we’ll take a quick look at it through a few examples. From there on, you should be able to explore it on your own. Just like Visual Studio Code, zsh is intuitive, as long as you’re already familiar with Linux Shells. There’s not much to learn beyond simply using it. To be more precise, there is a lot to learn before becoming proficient in Shells in general, but that’s more related to the syntax of the commands than things specific to zsh. Let’s say that we want to clone the vfarcic/devops-catalog-code repository. Select Terminal from the top menu and open New Terminal. First, we’ll create a new directory. But, before executing the command that follows, you should be aware of two essential features of Oh My Zsh or, to be more precise, of the zsh-autosuggestions plugin. You can start typing a command and press the tab key to receive suggestions. Instead of typing the command that follows fully, you can type mkd (the first few characters). It will give you suggestions unless that is the only command that begins with those letters. Also, if there is a command from the history, from previous executions that starts with those letters, it will output the rest in gray, and you can just press the right arrow key to complete it. We’ll practice the tab key later. Autosuggestions without tabs, on the other hand, will work with me since I already run those commands, but will not yet be happening for you. You’ll see them later as you start building your history. Now, let’s get back to the task at hand and create a directory code where we will clone the repo.
Setting Up A Local Development Environment 1
197
mkdir code
Next, we’ll enter the newly created directory. 1
cd code
Now we can clone the repository. 1
git clone https://github.com/vfarcic/devops-catalog-code.git
None of the things we did so far showed any advantage of zsh, unless you experimented with it without my instructions. Let’s change that. Instead of typing the whole command that follows, type cd de (the command and the first few letters of the name of the directory), and press the tab key. 1
cd devops-catalog-code
You’ll notice that it auto-populated the rest. It figured out that there is only one directory that starts with de and, after we pressed the tab key, it filled in the rest. Press the enter key to execute the command. You’ll notice that the prompt starts by the name of the directory, just as it did before. However, since we are using the git plugin, it also shows git so that we are aware that we are inside a repository. That is followed with the name of the branch (master). That feature is a combination of the theme we are using (robbyrussell) and the git plugin. Let’s check out a new branch. 1
git checkout -b something
You’ll notice that, this time, as soon as you started typing the first letter (g), it auto-populated it with the last executed command that matches what we are typing. As we continue typing, it keeps filtering. That’s very useful if we are trying to run a command that we executed before. We can keep typing until we reach the part that matches the command we want and press the right arrow key to complete it. But that’s not what you might want since you never checked out that branch. I, on the other hand, already practiced the examples, so, in my case, it suggested the full command. After executing the command, we can see that the branch in the prompt changed to something. Like that, we are always aware of whether we are in a git repository, and what is the current branch. Let’s go back to the master branch.
Setting Up A Local Development Environment 1
198
git checkout master
Observe how it auto-completes the commands when we have them in our history. Also, please note that the branch in the prompt changed again. Now that we cloned the repo, we can open it in Visual Studio Code and (pretend to) work on it. The instructions will differ slightly depending on your operating system. Choose File from the top menu and select Open Folder… if you are a Windows user, or just Open… if you are using Linux or macOS. Next, we need to select the path to the directory that contains the repository we cloned. First, we are going to navigate to the home directory of your local user. If you are a Windows user, that would be This PC /C: /Users / YOUR_USER. If you are using macOS, the path is /Users/YOUR_USER. Finally, Linux users should be able to find it in /home/YOUR_USER. Please note that the above path might not be the correct one in your case. The one I listed is the likely one, but might still differ depending on your hardware and how you set up your operating system.
Once inside the user’s home directory, enter into code and select the devops-catalog-code directory. All that is left is to click the Select Folder or the Open button. You can observe that the list of all the files is available in one of the tabs in the left-hand side of the application. Visual Studio Code remembers the history of the folders we worked in, so we do not need to navigate to them every time. We can access the history by selecting File from the top menu and opening the Open Recent section. Right now, only the devops-catalog-code is there, since that’s the only directory we worked in so far. Let’s open one of the files in the project and see what happens. Open any of the Terraform files we explored in one of the previous chapters. I will, for example, expand the terraform-aks directory, followed with files. Inside it, I will open backend.tf. As soon as we opened one of the files with the tf extension, Visual Studio Code saw that we want to work on a type of file that it does not know how to handle. As a result, the suggestion to install a plugin popped up. Such a pop-up might suggest a single plugin or, as in this case, there might be multiple ones that could help us with Terraform. Click the Search Marketplace button, and a list of plugins will appear in the left part of the screen. Select the Terraform plugin and click the Install button. You can choose additional Terraform plugins, but, as a demonstration, the one we just installed should be enough. Click the Explorer icon from the left-hand menu to go back to the list of the files. Select any tf file.
Setting Up A Local Development Environment
199
You will notice that the tf file is now nicely formatted and full of colors. We also got code complete feature and a few other goodies. I will leave you to explore the plugins and the editor by yourself. But not right away. Do that later. For now, we will take another look at how the Visual Studio Code terminal with zsh behaves. Please click the Terminal item from the top menu, and select New Terminal. We are about to see the benefits of Oh My Zsh plugins by experimenting with kubectl. But, before we do that, please make sure that you have kubectl installed. You probably do if you followed the exercises from the previous chapters, except if you just installed WSL. In that case, you almost certainly do not have kubectl, at least not inside the Ubuntu subsystem. In any case, install it if you do not have it. I’m sure you’ll be able to do that without my help. Google is your friend. Remember that if you are a Windows user and you installed WSL, you should be looking for instructions on how to install kubectl in Ubuntu.
We will also need a Kubernetes cluster. Any should do. We will not make any change to it, so if you already have one, make sure that your Kube config is pointing to it. If you don’t, create a new one using the instructions from the previous chapters, or any other way you like. The command we are about to execute is as follows (do not run it yet). 1 2
kubectl --namespace kube-system \ describe pod [...]
Start typing kubectl --namespace. Normally, this would be the moment when we would need to know the Namespace we want to use. Typically, that would either mean that we memorized which one we need, or that we executed kubectl get namespace to get the list. Remembering is not my strong suit, and executing a command to list the Namespaces sounds like a waste of energy. Fortunately, we can do better now that we are using zsh and the kubectl plugin. Make sure that there is space character after --namespace and press the tab key. You should see the list of all the Namespaces from the cluster. We can use the arrow keys to select any. Choose kube-system or any other Namespace that has at least one Pod and press the enter key. That was much easier than wasting brain capacity memorizing which Namespace we want to use. Wasn’t it? Next, we want to describe a Pod. So, continue typing describe pod, make sure that there is space at the end, and press the tab key again. Z Shell output the list of all the pods in that Namespace, and, once again, it allows us to select what we need.
Setting Up A Local Development Environment
200
Choose any Pod, press the enter key to select it, and the enter key one more time to execute the command. The output is the description of the Pod. It doesn’t matter what was displayed. We did this to show one of the benefits of the plugin, and not because we are really interested in that Pod. Now, let’s say that we want to execute the same command again. We’ll ignore that the easiest way to do that would be to press the up arrow to repeat the previous command. Instead, we will start typing the command again, and zsh will begin outputting the rest of it as a suggestion. The more we type, the more filtered the suggestion is. Start typing the previous command. As soon as the suggestion is what you want to execute, press the right arrow key to complete it, followed with the enter key to execute the command.
What Should We Do Next? We did not deep-dive into terminals, IDEs, Shells, and other stuff. Instead, we dived straight into setting up what I believe is a very productive combination. Now it’s up to you to make a decision. Do you think that using an IDE and a terminal for most of your work is a good idea? If you do, what do you think about Visual Studio Code, zsh, and Oh My Zsh? Please let me know or suggest an alternative setup. I will do my best to include your suggestions in this chapter. The next subject we will explore are serverless deployments. Some say that it is the future, while others claim that it is a hype. Let’s see what we can learn and whether you should be using them.
There Is More There is more material about this subject waiting for you on the YouTube channel The DevOps Toolkit Series⁸⁴. Please consider watching one of the videos that follow, and bear in mind the list will likely grow over time. • HashiCorp Waypoint Review - Is It Any Good?⁸⁵ • Gitpod - Instant Development Environment Setup⁸⁶ ⁸⁴https://www.youtube.com/c/TheDevOpsToolkitSeries ⁸⁵https://youtu.be/7qrovZjdgz8 ⁸⁶https://youtu.be/QV1fYt-7SLU
Exploring Serverless Computing Serverless computing is the future. Or, maybe, it’s just the hype. It’s hard to tell right now, mostly because we do not yet know what serverless is. For some, it is about developing functions (nano services?). For others, it is about not worrying about infrastructure. Some claim that serverless is all about reducing cost, while some say that it is about scalability. I do not think it is about the size of the applications, even though the first attempts at serverless were mostly focused on functions. Forcing everyone to design their systems around functions is unrealistic and counter-productive. Instead, I think that serverless is mostly about externalizing maintenance of infrastructure and the up-time of the applications. Now, you might say that we had that before, and that would be true. Many were using third-party companies to manage their infrastructure and to monitor their systems. But that was ineffective. It required too many people who, by the nature of the business, had no incentive to improve. Now we are getting a similar result, but instead of employing an army of people to manage and maintain our infrastructure and applications, we have systems capable of doing that. As a result, we need only a fraction of people to support those systems, when compared to what we needed before. Initially, I thought that the word “serverless” is the worst name the software industry gave to anything. Now, however, I believe the name is close to what it really is. It’s not that there are no servers (there certainly are), but that, from the user’s perspective, servers are irrelevant. For a user, it is as if servers do not exist. Hence, it is serverless from the user’s perspective. But that is only part of the story. It’s not only about infrastructure, but the whole management of our applications. I tend to think about serverless as “my job is to write code, everything else should just happen.” That is certainly a worthy goal. If we can get that far, we can enable developers to focus on writing code and assume that everything else is just working. But such attempts are not new. In the past, we could accomplish that goal in many different ways. We had Platform-as-a-service solutions, but most of them are now considered obsolete, if not failures. Maybe they emerged too early when the industry was not mature enough, or perhaps we misused them. Heroku was a great example of a great idea that failed to gain traction, and it is far from being the only one. There are other examples from the past, but I’ll skip on commenting them. Instead, we’ll focus on what we have as viable options today. I just noticed that I started rumbling again, so let me try to regain focus by defining serverless computing.
Serverless computing is an execution model where a provider or a platform is responsible for executing code by dynamically allocating resources. We give code (or binaries) to someone, and that
Exploring Serverless Computing
202
someone (or something) ensures that it is running, that it is highly available, that it is responsive, and so on. I will categorize serverless computing into two big categories. • Functions as a Service (FaaS) • Containers as a Service (CaaS) Those two models are mostly the same since, in both cases, our code, applications, functions, or whatever we want to call them, are likely running in containers. I used the word “likely” because the information on how vendor-specific FaaS is implemented is not always public. Nevertheless, it is almost certainly based on containers, and the main question is rather what is orchestrating them. In some cases, it is Kubernetes, and in others, it is Docker Swarm. Some solutions use custom platforms, while, in some cases, we do not know what is behind. Still, all that does not matter. For you, as a user, what matters is that your functions are running, and not what is running them.
What matters for FaaS is that there are severe limitations. It needs to be tiny (a function), it cannot run forever, it might need to be stateless, and so on and so forth. The specific limitations depend on the provider, so it’s hard to list them all. Nevertheless, the most important thing to note is that it is a function or, to put it in other words, a minuscule application. In some cases, tiny is good, while in others it is not. Most likely, even if you do find a use case for functions, they are unlikely to be your whole system. Do not be discouraged since that does not mean that you should not adopt FaaS. There is nothing wrong with having different mechanisms to accomplish different goals. It will be up to you to decide whether FaaS makes sense for your use cases. Actually, that is the purpose of every section in this book. Wait until we explore FaaS through practical examples before you make any decisions. Further on, we have Containers as a Service or CaaS as the second most common flavor of serverless computing. It is almost the same as FaaS, and the significant difference is that quite a few of the limitations are lifted. The real differentiator between FaaS and CaaS is how we package deliverables. While FaaS expects code or binaries and is limited to whatever languages are supported, CaaS can, theoretically, run anything packaged as a container image. That does not mean that any image is an equally good candidate for CaaS. Stateless is better than stateful. Those that initialize faster are better than those that are slow to boot. And so on and so forth. Further on, we have a distinction between proprietary solutions, usually employed by public Cloud vendors, and open source. We can also split the solutions between services provided by others and those you would maintain, hopefully as a service to other teams in your company. We’ll call the
Exploring Serverless Computing
203
former managed and the latter self-managed. There are quite a few other ways we can slice it and dice it. But we won’t go there. All in all, there are a couple of different ways we can group serverless computing. So we will separate them into categories. We will split the solutions into Functions as a Service (FaaS) and Containers as a Service (CaaS). The problem is that those are often mixed, so I’ll make an easy distinction between the two. The difference is in the form of the deliverables we need to provide to a service. They can be code or binaries, and, in that case, we will call those FaaS. On the other hand, the service might ask us to provide a container image, and, in that case, we’ll call it CaaS. As I already mentioned, the division between FaaS and CaaS is blurred since almost all solutions do use containers. So, in the rest of this book, I will not be making the separation on how it is running. That is almost always a container. Instead, the distinction will be made based on our input (code or container images). Another distinction we will make is whether you or someone else is in charge of maintaining the platform. We can choose a managed solution from one of the vendors (e.g., AWS, Azure, Google Cloud) or assemble and manage it ourselves as part of our platform of choice. Since almost all serverless computing platforms are based on containers, the latter choice almost inevitably leads to Kubernetes. In any case, if it is you who is managing the platform, it is self-managed, and if it is someone else, it is managed (without the word self ). All in all, we will split serverless computing into FaaS and CaaS and managed and self-managed. You are likely to choose one from both groups, and that will give you a clearer indication of what would be the right solution for you. There is always a possibility that none of the serverless flavors fit your use cases. Still, the best way to know that is to learn more about serverless. This was already more theory than I like, so let’s dive into practical implementations and examples. We’ll start with managed Functions as a Service or simply managed FaaS.
Using Managed Functions As A Service (FaaS) We will not go into a detailed explanation of what managed Functions as a Service (FaaS) means. We already explained it briefly, and you are soon going to get a hands-on experience that will likely provide more insight than any theory would. For now, just remember that we are focused on managed services, which are those maintained by others, and that we are exploring FaaS defined as “give me code as input.” Other serverless flavors might come later. Before we dive into practical examples, I should probably comment on a potentially big problem that almost all managed FaaS solutions have. They are mostly proprietary, and they can differ a lot from one solution to another. Lack of uniformity might not be an issue if you already chose to use one and only one provider. But, if you are still exploring what might be the best solution to commit to, or, if you are planning to span multiple vendors, a common denominator might come in handy. Fortunately, Serverless Framework⁸⁷ might be the missing piece that ties all managed FaaS solutions into a single umbrella. It is not limited to managed FaaS, though; it can be used for quite a few other flavors. Nevertheless, we are focusing on managed FaaS, at least for now, so that’s what matters. The Serverless Framework claims that it gives us everything we need to develop, deploy, monitor, and secure serverless applications on any cloud. That claim might have gone a bit too far, though. Still, it is a handy tool that we are going to use heavily. Not only that it is useful under “normal conditions”, but it will also allow us to have the same process no matter whether we are exploring the solutions provided by AWS, Azure, or Google Cloud. Think of it as a prerequisite for the next few sections. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
Let’s install the Serverless Framework CLI. All the commands from this chapter are available in the 04-01-managed-faas.sh⁸⁸ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
⁸⁷https://www.serverless.com/ ⁸⁸https://gist.github.com/5d309852d42475202f4cfb6fdf4e8894
Using Managed Functions As A Service (FaaS) 1 2
205
curl -o- -L https://slss.io/install \ | bash
The output is uneventful, so we’ll skip commenting on it. For the newly installed binary to be usable, we need to start a new terminal session. Given that there is no point in keeping the existing one open, we’ll exit from it first. 1
exit
That’s it. We installed the serverless CLI, and all we have to do to use it is open a new terminal session. To be on the safe side, we’ll confirm that it works by outputting help from any of its commands. 1
serverless create --help
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Plugin: Create create ........................ Create new Serverless service --template / -t .................... Template for the service. Available templat\ es: "aws-clojure-gradle", "aws-clojurescript-gradle", "aws-nodejs", "aws-nodejs-\ typescript", "aws-alexa-typescript", "aws-nodejs-ecma-script", "aws-python", "aws-py\ thon3" "aws-groovy-gradle", "aws-java-maven", "aws-java-gradle", "aws-kotlin-jvm-ma\ ven", "aws-kotlin-jvm-gradle", "aws-kotlin-nodejs-gradle", "aws-scala-sbt", "aws-csh\ arp" "aws-fsharp", "aws-go", "aws-go-dep", "aws-go-mod", "aws-ruby", "aws-provide\ d" "tencent-go", "tencent-nodejs", "tenc\ ent-python", "tencent-php" "azure-csharp", "azure-nodejs", "azure-nodejs-typescript", "azure-python" \
15
"cloudflare-workers", "cloudflare-workers-enterprise", "cloudflare-workers-r\
16 17
ust"
18
"fn-nodejs", "fn-go" "google-nodejs", "google-python", "google-go" "kubeless-python", "kubeless-nodejs" "knative-docker" "openwhisk-java-maven", "openwhisk-nodejs", "openwhisk-php", "openwhisk-pyth\ on", "openwhisk-ruby", "openwhisk-swift" "spotinst-nodejs", "spotinst-python", "spotinst-ruby", "spotinst-java8" \
19 20 21 22 23 24
206
Using Managed Functions As A Service (FaaS) 25 26 27 28 29 30 31 32 33 34 35 36
"twilio-nodejs" "aliyun-nodejs" "plugin" "hello-world" --template-url / -u ................ Hub, BitBucket --template-path .................... --path / -p ........................ d (e.g. --path my-service) --name / -n ........................ t name of the created service.
Template URL for the service. Supports: Git\ Template local path for the service. The path where the service should be create\ Name for the service. Overwrites the defaul\
I had an ulterior motive for executing that particular command. Not only that it confirmed that the CLI works, but it also showed us the available templates. At the time of this writing, the Serverless Framework allows us to create and operate around fifty different serverless computing variations. We can use it to quick-start us on AWS, Azure, Google Cloud, and quite a few other providers. Most of those support different languages. In the sections that follow, we will explore three flavors of managed FaaS. We will use Google Cloud Functions, AWS Lambda, and Azure Functions. Feel free to jump to whichever provider you are using, or, even better, go through all three. I recommend the latter option. You will not spend any money since all three do not charge anything until we pass a certain number of requests. On the other hand, understanding how managed FaaS works on all of them will give you a better understanding of the challenges and the pros and cons we might encounter. I will try to keep all managed FaaS examples as similar as possible. That might be a bit boring if you’re going through all of them. Nevertheless, I believe that having it similar to one another will allow us to compare it better.
The first managed FaaS we will explore is Google Cloud Functions.
Deploying Google Cloud Functions (GCF) Google Cloud Functions⁸⁹ service ticks the usual managed FaaS boxes. It is a service that allows us to deploy and manage functions while ignoring the existence of servers. Functions are not running on thin air, but, this time, it’s Google and not you who is making sure that they are up and running. The solution scales automatically based on the load. It has monitoring, logging, debugging, security, and so on and so forth. ⁸⁹https://cloud.google.com/functions
Using Managed Functions As A Service (FaaS)
207
All managed FaaS solutions we’ll explore are doing, more or less, the same things, so we can skip that part and jump straight into practical hands-on examples. But, before we start deploying Google Cloud Functions, we need to set up a few things. We need a Google project, a service account with a corresponding key, a policy, and a few APIs enabled. We could go through the instructions on how to set that up, but that would be boring. Instead, I prepared a Terraform definition that will do all that for us. You can use it as-is, modify it to your liking, or ignore it altogether. If you do choose the option to ignore what I prepared, you might still want to consult Terraform to find out the requirements that you can create anyway you like. Keep in mind that using my definition is likely the easiest option, and I am assuming that you are sufficiently familiar with Terraform to explore it on your own. All in all, install Google Cloud SDK (gcloud)⁹⁰ and Terraform⁹¹, and follow the instructions in the gcf.sh⁹² Gist, or roll out the requirements any other way. What matters is that you will need to export PROJECT_ID and REGION variables. Their names should be self-explanatory. Also, we’ll need a path to the account.json file with the credentials. It should be stored in the environment variable PATH_TO_ACCOUNT_JSON. We’ll see later why we need those. If you are using my Gist, they are already included. I’m mentioning them just in case you choose to go rogue and ignore the Gist. If you are getting confused with the word “Terraform”, you either skipped the Infrastructure as Code (IaC) chapter or did not pay attention. If that’s what’s going on, I suggest you go back to that chapter first.
We’ll start by creating a new Google Cloud Functions project. Instead of following whichever means Google provides, we’ll use serverless to create a new project. As a minimum, we need to select a programming language for our functions, and a directory we want to use to keep the project files. While the project directory can be anything we want, languages are limited. At the time of this writing, Google Cloud Functions support Node.js (versions 8 and 10), Python, Go, and Java. By the time you read this, others might have been added to the mix, and you should check the official documentation. To make things a bit more complicated, the Serverless Framework might not yet have templates for all the languages Google supports. We will use Node.js in our examples. I chose it for a few reasons. To begin with, it is a widely used language. At the same time, I thought that you might want a break from me injecting Go everywhere. Another critical factor is that Node.js is supported in almost all FaaS solutions, so using it will allow us to compare them easily. Do not be discouraged if Node.js is not your language of choice. Everything we will do applies to any other language, as long as it is supported by Google Cloud Functions. ⁹⁰https://cloud.google.com/sdk/install ⁹¹https://learn.hashicorp.com/terraform/getting-started/install.html ⁹²https://gist.github.com/a18d6b7bf6ec9516a6af7aa3bd27d7c9
Using Managed Functions As A Service (FaaS)
208
Let’s create a Google Cloud Functions project. We’ll use the google-nodejs template and create the files in the gcp-functions directory. 1 2 3
serverless create \ --template google-nodejs \ --path gcp-function
If you are interested in which other templates are available, you can get the list by executing serverless create --help and observing the possible values of the --template argument. Those related to GCF are all prefixed with google.
Next, we will enter the newly-created directory gcp-function, and list the files that were created for us. 1
cd gcp-function
2 3
ls -1
The output is as follows. 1 2 3
index.js package.json serverless.yml
There are only three files in there, so this should be a rapid overview. Let’s start with index.js. 1
cat index.js
The output is as follows.
Using Managed Functions As A Service (FaaS) 1
209
'use strict';
2 3 4 5
exports.http = (request, response) => { response.status(200).send('Hello World!'); };
6 7 8 9
exports.event = (event, callback) => { callback(); };
As you can see, there isn’t much to look at. The function responds with Hello World!. It could hardly be any simpler than that. We will not dive into extending the function to do something more complicated than spitting out a “hello world” message. I’m sure you know how to write an application, especially when it is as tiny as a single function.
The next in line is package.json. 1
cat package.json
The output is as follows. 1
{ "name": "gcp-function", "version": "0.1.0", "description": "", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "author": "serverless.com", "license": "MIT", "devDependencies": { "serverless-google-cloudfunctions": "*" }
2 3 4 5 6 7 8 9 10 11 12 13 14
}
It’s a typical npm package config that, among other things, defines the dependencies which, in this case, are serverless-google-cloudfunctions. The serverless.yml file is the key. It defines everything serverless needs to deploy and manage our functions.
Using Managed Functions As A Service (FaaS) 1
210
cat serverless.yml
The output, without the commented parts, is as follows. 1
service: gcp-function
2 3 4 5 6 7 8 9 10
provider: name: google stage: dev runtime: nodejs8 region: us-central1 project: my-project ... credentials: ~/.gcloud/keyfile.json
11 12 13 14 15 16 17 18 19
plugins: - serverless-google-cloudfunctions ... package: exclude: - node_modules/** - .gitignore - .git/**
20 21 22 23 24 25 26
functions: first: handler: http events: - http: path ...
Most of the entries should be self-explanatory and do not need any particular explanation. Later on, if you choose to use Google Cloud Functions through the Serverless Framework, you might want to dig deeper into the available options. We will need to change a few entries in the serverless.yml file. Specifically, we’ll set the region, the project, and the location of the credentials file. As you already know from the previous sections, we will use sed magic to replace those values. Make sure that you do have the environment variables used in the command that follows. If you don’t, refer to the gcf.sh⁹³ Gist for instructions.
⁹³https://gist.github.com/a18d6b7bf6ec9516a6af7aa3bd27d7c9
Using Managed Functions As A Service (FaaS) 1 2 3 4 5
211
cat serverless.yml \ | sed -e "s@us-central1@$REGION@g" \ | sed -e "s@my-project@$PROJECT_ID@g" \ | sed -e "s@~/.gcloud/keyfile.json@$PATH_TO_ACCOUNT_JSON@g" \ | tee serverless.yml
We are almost ready to deploy our function. But, before we do that, we need to install the dependencies. For that, we will need npm, which, more often than not, is installed through Node.js. Hopefully, I do not need to explain what npm is. You likely already have npm on your laptop. If that’s not the case, you can install it through the commands that follow. Please execute the command that follows if you are using Ubuntu, and you do not already have npm. Remember, if you are a Windows user, I assume that you are using WSL and that you have Ubuntu installed as a subsystem.
1 2
sudo apt update && \ sudo apt install nodejs npm
Please execute the command that follows if you are a macOS user, and you do not already have npm.
1
open https://nodejs.org/en/download/
If you are neither Ubuntu, nor WSL, nor a macOS user, you are on your own. Google (search) is your friend.
Now we can install the dependencies. 1
npm install
The output, limited to, in my opinion, the most fascinating line, is as follows. 1 2 3
... added 48 packages from 43 contributors and audited 48 packages in 4.145s ...
Using Managed Functions As A Service (FaaS)
212
The processes added 48 dependencies, and we can proceed to deploy the function. Node.js dependency model and practices are something I have difficulties justifying. In this example, we need 48 dependencies for a simple “hello world” example. That’s such a significant number that I decided to make it a competition between the “big three” implementations of managed FaaS. Let’s see who will win by having the most dependencies.
Let’s deploy the function. That’s what we are here for. Isn’t it? 1
serverless deploy
The output, limited to the last few lines, is as follows. 1 2 3 4
... Deployed functions first https://us-east1-doc-20200619182405.cloudfunctions.net/my-service-dev-main
We can see that only one function was deployed (first) and exposed through a publicly accessible address. Let’s try to invoke it and confirm whether it works. We can use serverless invoke command for that. 1
serverless invoke --function first
The output is as follows. 1
Serverless: zbia1ws9w56q Hello World!
While that confirmed that our function is accessible through the serverless CLI, we need to verify that it is available to everyone through an API or a browser. So, we need the address that we got when we deployed the function. We can retrieve almost any information related to the functions in a project by outputting the info. 1
serverless info
The output is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5
213
Service Information service: my-service project: doc-20200619182405 stage: dev region: us-east1
6 7 8 9
Deployed functions first https://us-east1-doc-20200619182405.cloudfunctions.net/my-service-dev-main
As you can see, that is the same info we got at the end of the deploy process. What matters, for now, is the address. Please copy it, and use it to replace [...] in the command that follows. 1
export ADDR=[...]
Now we can test whether we can send a request to the newly deployed function without any help from serverless CLI. After all, users of our function are likely not going to use it. 1
curl $ADDR
The output is as follows 1 2 3 4 5 6 7 8 9 10
403 Forbidden
Error: Forbidden Your client does not have permission to get URL /my-service-dev-main from this server.
Your output might have been less dissapointing. If you got Hello World!, the missing feature that I am about to comment was likely added to the Serverless Framework and you can skip the next few paragraphs.
Sometime around the end of 2019, Google introduced the support for IAM policies for Cloud Functions, and, since then, they are inaccessible by default. That feature is a good thing, given that it allows us to decide who can invoke our functions. However, that is also inconvenient since we need to figure out how to make it public to test it.
Using Managed Functions As A Service (FaaS)
214
Typically, the Serverless Framework should have the capability to help us decide who can access our functions, but that is not yet the case. To be more precise, it wasn’t at the time of this writing (June 2020). So, until that feature is added to the Serverless Framework, we’ll have to grant permissions to the function using gcloud. You might want to follow issues 179⁹⁴ and 205⁹⁵ if you want to stay up-to-date with the status of the “missing feature”.
Fortunately, adding a policy for the function is relatively easy, so let’s get down to it. 1 2 3 4 5 6 7
gcloud functions \ add-iam-policy-binding \ gcp-function-dev-first \ --member "allUsers" \ --role "roles/cloudfunctions.invoker" \ --region $REGION \ --project $PROJECT_ID
Now we should be able to invoke the function directly. 1
curl $ADDR
The output is Hello World!, and we can congratulate ourselves for deploying our first Google Cloud Function. Google promised that it would be easy, and it indeed was. There’s still one more thing we might want to check before moving to the next managed FaaS flavor. Or, to be more precise, there are many other things we might want to explore, but I will limit our curiosity to only one more potentially significant test. We are about to see how does our function behaves under moderate (not even heavy) load. We’ll use Siege⁹⁶ for that. It can be described as a “poor man’s” load testing and benchmark utility. While there are better tools out there, they are usually more complicated. Since this will be a “quick and dirty” test of availability, siege should do. I will run siege as a Pod. Feel free to follow along if you have a Kubernetes cluster at your disposal. Otherwise, you can skip executing the commands that follow and observe the results I will present.
We will be sending thousand concurrent requests during thirty seconds. ⁹⁴https://github.com/serverless/serverless-google-cloudfunctions/issues/179 ⁹⁵https://github.com/serverless/serverless-google-cloudfunctions/issues/205 ⁹⁶https://github.com/JoeDog/siege
Using Managed Functions As A Service (FaaS) 1 2 3 4 5
215
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
... Transactions: Availability: Elapsed time: Data transferred: Response time: Transaction rate: Throughput: Concurrency: Successful transactions: Failed transactions: Longest transaction: Shortest transaction: ...
9819 hits 99.96 % 30.02 secs 0.44 MB 1.91 secs 327.08 trans/sec 0.01 MB/sec 626.36 9725 4 22.20 0.20
Siege sent almost ten thousand requests during thirty seconds, and 99.96 % were successful. That’s a decent figure. It’s not great, but it could be worse as well. If you did run siege, your results are almost certainly going to differ from mine.
Let’s run it one more time and check whether the results are consistent. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
... Transactions: Availability: Elapsed time: Data transferred: Response time: Transaction rate: Throughput: Concurrency: Successful transactions: Failed transactions: Longest transaction: Shortest transaction: ...
216
9438 hits 99.87 % 29.59 secs 0.46 MB 1.66 secs 318.96 trans/sec 0.02 MB/sec 529.84 9323 12 17.58 0.21
The results are similar. This time, 99.87 % of the requests were successful. Later on, we’ll compare the results with those from AWS and Azure and see whether there is any significant difference.
The last thing we will check is whether there is some kind of a dashboard we can use. I’m not fond of UIs for managing applications, but having a dashboard for monitoring purposes and for insights is usually welcome. 1
open "https://console.cloud.google.com/functions/list?project=$PROJECT_ID"
Open the gcp-function-dev-first function, and explore it yourself. We’ll remove the function once you’re done “playing”. I’ll assume you’ll know how to follow links and click buttons, so you probably do not need my help.
That’s it. We will not need the function anymore, so we will remove it. While we are at it, we will also get out of the directory with the project, and delete it.
Using Managed Functions As A Service (FaaS) 1
217
serverless remove
2 3
cd ..
4 5
rm -rf gcp-function
The function is no more, and we should probably remove the infrastructure resources as well, unless you plan to keep “playing” with the Serverless Framework and Google Cloud Functions. If you used my Gist to create all the required resources using Terraform, you will find the instructions on how to destroy the resources at the bottom. If you created them by yourself, you are on your own.
We’re done. The function is gone, as well as all the pre-requisite resources. It’s as if we haven’t done anything. Next, we will explore Azure Functions.
Deploying Azure Functions (AF) Azure Functions⁹⁷ service ticks the usual managed FaaS boxes. It is a service that allows us to deploy and manage functions while ignoring the existence of servers. Functions are not running on thin air, but, this time, it is Azure and not you who is making sure that they are up and running. The solution scales automatically based on the load. It has monitoring, logging, debugging, security, and so on and so forth. You might have noticed that this started with similar or even the same sentences as those from the other managed FaaS chapters. That is intentional. I wanted to make each presented solution alike so that we can compare them easily if you choose to go through all of them, but also so that you do no miss anything if you decided to follow the examples only for Azure.
All managed FaaS solutions we are exploring are doing, more or less, the same things, so we can skip that part and jump straight into practical hands-on examples. But, before we start deploying Azure Functions, we need to set up a few things. We need Azure credentials and a resource group. To be more precise, we do not have to have a resource group. Serverless Framework can create it for us. However, I don’t like leaving such things to the tools that are designed to perform deployments. I believe that Terraform is a much better choice to create infrastructure, even if that is limited to a single resource. So, I prepared a Terraform definition that ⁹⁷https://azure.microsoft.com/services/functions/
Using Managed Functions As A Service (FaaS)
218
will create a resource group for us. You can use it as-is, modify it to your liking, or ignore it altogether. If you do choose the option to ignore what I prepared, you might still want to consult Terraform to create the resource group any other way you like, or leave it to the Serverless Framework to do it for you. Just keep in mind that using my definition is likely the easiest option, and I am assuming that you are sufficiently familiar with Terraform to explore it on your own. All in all, install Azure CLI (az)⁹⁸ and Terraform⁹⁹, and follow the instructions in the af.sh¹⁰⁰ Gist, or roll out the requirements any other way. There’s one more requirement, though. We’ll use jq¹⁰¹, and you will see later on why we need it. Just make sure to install it. If you are a Windows user, remember that I am assuming that you are using Ubuntu as Windows Subsystem for Linux (WSL), so you should follow the instructions for Linux or Ubuntu, and not for Windows.
What matters is that you will need to export REGION, AZURE_SUBSCRIPTION_ID, and RESOURCE_GROUP variables. Their names should be self-explanatory. We’ll see later why we need those. If you are using my Gist, they are already included. I’m mentioning them just in case you choose to go rogue and ignore the Gist. If you are getting confused with the word “Terraform”, you either skipped the Infrastructure as Code (IaC) chapter or did not pay attention. If that’s what’s going on, I suggest you go back to that chapter first.
We’ll start by creating a new Azure Functions project. Instead of following whichever means Azure provides, we’ll use serverless to create a new project. As a minimum, we need to select a programming language for our functions, and a directory we want to use to keep the project files. While the project directory can be anything we want, languages are limited. At the time of this writing, Azure Functions support C#, JavaScript, F#, Java, PowerShell, Python, and TypeScript. By the time you read this, others might have been added to the mix, and you should check the official documentation. To make things a bit more complicated, the Serverless Framework might not yet have templates for all the languages Azure supports. We will use Node.js (JavaScript) in our examples. I chose it for a few reasons. To begin with, it is a widely used language. At the same time, I thought that you might want a break from me injecting Go everywhere. Another critical factor is that Node.js is supported in almost all FaaS solutions, so using it will allow us to compare them easily. Do not be discouraged if Node.js is not your language of choice. Everything we will do applies to any other language, as long as it is supported by Azure Functions. ⁹⁸https://docs.microsoft.com/en-us/cli/azure/install-azure-cli ⁹⁹https://learn.hashicorp.com/terraform/getting-started/install.html ¹⁰⁰https://gist.github.com/432db6a1ee651834aee7ef5ec4c91eee ¹⁰¹https://stedolan.github.io/jq/download/
Using Managed Functions As A Service (FaaS)
219
Let’s create an Azure Functions project. We’ll use the azure-nodejs template and create the files in the azure-function directory. 1 2 3
serverless create \ --template azure-nodejs \ --path azure-function
If you are interested in which other templates are available, you can get the list by executing serverless create --help and observing the possible values of the --template argument. Those related to Azure Functions are all prefixed with azure.
Next, we will enter the newly-created directory azure-function, and list the files that were created for us. 1
cd azure-function
2 3
ls -1
The output is as follows. 1 2 3 4 5
README.md host.json package.json serverless.yml src
There are only a few files in there, so this should be a rapid overview. The src directory contains a subdirectory handlers, where the source code is located. Let’s take a look at what is inside. 1
ls -1 src/handlers
The output is as follows. 1 2
goodbye.js hello.js
The Serverless Framework’s template comes with two sample functions. Both are almost the same, so it should be enough to take a quick look at only one of them.
Using Managed Functions As A Service (FaaS) 1
220
cat src/handlers/hello.js
The output is as follows. 1
'use strict';
2 3 4
module.exports.sayHello = async function(context, req) { context.log('JavaScript HTTP trigger function processed a request.');
5 6 7 8 9 10 11 12 13 14 15 16 17
if (req.query.name || (req.body && req.body.name)) { context.res = { // status: 200, /* Defaults to 200 */ body: 'Hello ' + (req.query.name || req.body.name), }; } else { context.res = { status: 400, body: 'Please pass a name on the query string or in the request body', }; } };
As you can see, there isn’t much to look at. The function responds with Hello and a name passed as a query or a message body. It could hardly be any simpler than that. We will not dive into extending the function to do something more complicated than spitting out a “hello” message. I’m sure you know how to write an application, especially when it is as tiny as a single function.
The next in line is package.json. 1
cat package.json
The output is as follows.
Using Managed Functions As A Service (FaaS) 1
{ "name": "azure-function", "version": "1.0.0", "description": "Azure Functions sample for the Serverless framework", "scripts": { "test": "echo \"No tests yet...\"", "start": "func host start" }, "keywords": [ "azure", "serverless" ], "dependencies": {}, "devDependencies": { "serverless-azure-functions": "^1.0.0" }
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
221
}
It’s a typical npm package config that, among other things, defines the dependencies which, in this case, are serverless-azure-functions. The serverless.yml file is the key. It defines everything serverless needs to deploy and manage our functions. 1
cat serverless.yml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
... service: azure-function ... provider: name: azure region: West US 2 runtime: nodejs12.x # os: windows # windows is default, linux is available # prefix: "sample" # prefix of generated resource name # subscriptionId: A356AC8C-E310-44F4-BF85-C7F29044AF99 # stage: dev # type: premium # premium azure functions ... functions: hello:
Using Managed Functions As A Service (FaaS) 16 17 18 19 20 21
222
handler: src/handlers/hello.sayHello events: - http: true methods: - GET authLevel: anonymous # can also be `function` or `admin`
22 23 24 25 26 27 28 29 30
goodbye: handler: src/handlers/goodbye.sayGoodbye events: - http: true methods: - GET authLevel: anonymous ...
Most of the entries should be self-explanatory and do not need any particular explanation. Later on, if you choose to use Azure Functions through the Serverless Framework, you might want to dig deeper into the available options. We will need to change a few entries in the serverless.yml file. Specifically, we’ll set the region, the Node.js version, the subscription ID, the resource group, and the prefix. The meaning behind all those should be self-explanatory, except, maybe, the prefix. When deploying Azure Functions, without a custom domain, the URL through which we can access them is auto-generated. A combination of a region, stage, and the function name combined are used as a subdomain of azurewebsites.net. Given that many are likely using the same region and the same stage (dev), and that azure-function is probably not an uncommon name, we would run a risk that others might be using the same address. Even if no one else came up with a brilliant idea to call it azure-function, there are other readers of this book who follow the same examples. Given that each function has to have a unique address, we’ll use the prefix field. But, given the nature of the problem, it cannot be hard-coded. It needs to be “random”. We’ll generate the prefix using a timestamp. As long as no one executes the command that follows in precisely the same second, it should be sufficiently unique. 1
export PREFIX=$(date +%Y%m%d%H%M%S)
Now we can change the values in serverless.yaml. As you already know from the previous sections, we will use sed magic to replace those values. Make sure that you do have the environment variables used in the command that follows. If you don’t, refer to the af.sh¹⁰² Gist for instructions. ¹⁰²https://gist.github.com/432db6a1ee651834aee7ef5ec4c91eee
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7 8
223
cat serverless.yml \ | sed -e "s@West US 2@$REGION@g" \ | sed -e "[email protected]@nodejs12@g" \ | sed -e "s@# os@subscriptionId: $AZURE_SUBSCRIPTION_ID\\ resourceGroup: $RESOURCE_GROUP\\ prefix: \"$PREFIX\"\\ # os@g" \ | tee serverless.yml
We are almost ready to deploy our function. But, before we do that, we need to install the dependencies. For that, we will need npm, which, more often than not, is installed through Node.js. Hopefully, I do not need to explain what npm is. You likely already have npm on your laptop. If that’s not the case, you can install it through the commands that follow. Please execute the command that follows if you are using Ubuntu, and you do not already have npm. Remember, if you are a Windows user, I assume that you are using WSL and that you have Ubuntu installed as a subsystem.
1 2
sudo apt update && \ sudo apt install nodejs npm
Please execute the command that follows if you are a macOS user, and you do not already have npm.
1
open https://nodejs.org/en/download/
If you are neither Ubuntu, nor WSL, nor a macOS user, you are on your own. Google (search) is your friend.
Now we can install the dependencies. 1
npm install
The output, limited to, in my opinion, the most fascinating line, is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3
224
... added 615 packages from 546 contributors and audited 615 packages in 20.635s ...
Installing dependencies might fail with the error gyp: No Xcode or CLT version detected!. I experienced it only on macOS. If that happens to you, and you are indeed using macOS, the solution is to execute the commands that follow: sudo rm -rf /Library/Developer/CommandLineTools && xcode-select --install.
The processes added quite a few dependencies. The exact number varies depending on the operating system. In my case, on macOS, 615 dependencies were downloaded. Node.js dependency model and practices are something I have difficulties justifying. In this example, we need more than five hundred dependencies for a simple “hello” example. That’s such a significant number that I decided to make it a competition between the “big three” implementations of managed FaaS. Let’s see who will win by having the most dependencies.
There are a few more things that we need to do before we deploy the functions. For Azure Functions to work correctly, we might need to install a Serverless Framework plugin that enhances its usage with Azure by adding additional commands to serverless CLI. But, more importantly, without it, we would not be able to deploy to Azure. Let’s see which Azure-related plugins are available. Since the number of plugins is vast, we’ll list only those that contain azure in their name. 1
serverless plugin list | grep azure
The output is as follows. 1 2
serverless-azure-functions - A Serverless plugin that adds Azure Functions support t\ o the Serverless Framework.
As you can see, there is only one plugin, or, to be more precise, there was only one at the time of this writing (June 2020). That makes it easy to choose, doesn’t it? 1 2
serverless plugin install \ --name serverless-azure-functions
Using Managed Functions As A Service (FaaS)
225
Finally, the last thing missing is the credentials. We need to give the Serverless Framework means to authenticate to our Azure account. There are a couple of ways to pass authentication to the CLI, and we’ll use what is, in my opinion, the easiest method. We’ll define a few environment variables. Specifically, we will create AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET. To get the values we need, we’ll set the account of the local az CLI to the subscription ID, and then we’ll create a service principal. If you are already using Azure, you probably know how to do that. Nevertheless, I’ll list the commands just in case. Please execute the command that follows to set the account to the subscription ID we got from Terraform, or whichever other means you used if you chose to go rogue and ignore my Gist. 1
az account set -s $AZURE_SUBSCRIPTION_ID
Next, we’ll create the service principal, and store the output in an environment variable. 1 2
export SERVICE_PRINCIPAL=$(\ az ad sp create-for-rbac)
Let’s take a quick look at the output we stored in SERVICE_PRINCIPAL. 1
echo $SERVICE_PRINCIPAL
The output is as follows (with password and tenant removed for my own safety). 1
{ "appId": "523998df-ced6-4b5f-aff0-5ddc2afafa0d", "displayName": "azure-cli-2020-06-28-19-23-04", "name": "http://azure-cli-2020-06-28-19-23-04", "password": "...", "tenant": "..."
2 3 4 5 6 7
}
That is the info we need for serverless to authenticate with Azure. We’ll extract the tenant, the name, and the password using jq, and store the values as environment variables serverless expects.
Using Managed Functions As A Service (FaaS) 1 2 3
226
export AZURE_TENANT_ID=$( echo $SERVICE_PRINCIPAL | \ jq ".tenant")
4 5 6 7
export AZURE_CLIENT_ID=$( echo $SERVICE_PRINCIPAL | \ jq ".name")
8 9 10 11
export AZURE_CLIENT_SECRET=$( echo $SERVICE_PRINCIPAL | \ jq ".password")
You probably do not want to go through the same steps to create a service principal and the variables. To save us from doing that more than once, we’ll store those in a file so that we can source them whenever we need them. 1 2 3 4
echo "export AZURE_TENANT_ID=$AZURE_TENANT_ID export AZURE_CLIENT_ID=$AZURE_CLIENT_ID export AZURE_CLIENT_SECRET=$AZURE_CLIENT_SECRET" \ | tee creds
Now that creds contains confidential information that you can retrieve whenever you need it, we better make sure that it does not get committed accidentally to Git. 1
echo "
2 3
/creds" | tee -a .gitignore
From now on, whenever you start a new terminal session and need those credentials, you can retrieve them through the command that follows. 1
source creds
Let’s deploy the function. That’s what we are here for. Isn’t it? 1
serverless deploy
The output, limited to the last few lines, is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7
227
... Serverless: -> Function package uploaded successfully Serverless: Deployed serverless functions: Serverless: -> goodbye: [GET] 20200628211206-eus-dev-azure-function.azurewebsites.ne\ t/api/goodbye Serverless: -> hello: [GET] 20200628211206-eus-dev-azure-function.azurewebsites.net/\ api/hello
We can see that two functions were deployed (goodbye and hello) and exposed through publicly accessible addresses. Let’s try to invoke hello and confirm whether it works. We can use serverless invoke command for that. 1 2 3
serverless invoke \ --function hello \ --data '{"name": "Viktor"}'
The output is as follows. 1 2 3 4 5 6 7
Serverless: Initializing provider configuration... Serverless: Logging into Azure Serverless: Using subscription ID: 7f9f9b08-7d00-43c9-9d30-f10bb79e9a61 Serverless: Invocation url: http://20200628211206-eus-dev-azure-function.azurewebsit\ es.net/api/hello?name=Viktor Serverless: Invoking function hello with GET request Serverless: "Hello Viktor"
The last line is the output we were looking for. We got the Hello Viktor message confirming that the response is indeed coming from the newly deployed function. However, while that confirmed that our function is accessible through the serverless CLI, we need to verify that it is available to everyone through an API or a browser. So, we need the address that we got when we deployed the function. We can retrieve almost any information related to the functions in a project by outputting the info. 1
serverless info
The output is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
228
Resource Group Name: c1sciafd Function App Name: 20200628211206-eus-dev-azure-function Functions: goodbye hello Azure Resources: { "name": "20200628211206-eus-dev-d6a0ee-appinsights", "resourceType": "microsoft.insights/components", "region": "eastus" }, { "name": "202eusdevd6a0ee", "resourceType": "Microsoft.Storage/storageAccounts", "region": "eastus" }, { "name": "EastUSPlan", "resourceType": "Microsoft.Web/serverFarms", "region": "eastus" }, { "name": "20200628211206-eus-dev-azure-function", "resourceType": "Microsoft.Web/sites", "region": "eastus" }
As you can see, that is the same info we got at the end of the deploy process, only in a different format. It is a “strange” combination of something at the top (it looks like YAML, but it is not), and JSON at the bottom. What matters, for now, is the name of the function which we can use to “guess” what the URL of the function is. It is available at the top next to Function App Name, as well as at the bottom under the block with the resourceType set to Microsoft.Web/sites. Please copy it, and use it to replace [...] in the command that follows. In my case, that would be 20200628211206-eus-dev-azure-function. 1
export FUNC_NAME=[...]
Next, we will generate the full address using the name of the function. 1
export ADDR=http://$FUNC_NAME.azurewebsites.net/api/hello?name=Viktor
Now we can test whether we can send a request to the newly deployed function without any help from serverless CLI. After all, users of our function are likely not going to use it.
Using Managed Functions As A Service (FaaS) 1
229
curl $ADDR
The output is Hello Viktor, and we can congratulate ourselves for deploying our first Azure Function. Microsoft promised that it would be easy, and it indeed was. Azure Functions are “insecure by default”. Anyone can access them. While in some cases, that is the desired state; in others, you might want to have the functions accessible only to internal users or other internal processes. The good news is that Azure does allow RBAC for the functions. The bad news is that it is out of the scope of this chapter.
There’s still one more thing we might want to check before moving to the next managed FaaS flavor. Or, to be more precise, there are many other things we might want to explore, but I will limit our curiosity to only one more potentially significant test. We are about to see how does our function behaves under moderate (not even heavy) load. We’ll use Siege¹⁰³ for that. It can be described as a “poor man’s” load testing and benchmark utility. While there are better tools out there, they are usually more complicated. Since this will be a “quick and dirty” test of availability, siege should do. I will run siege as a Pod. Feel free to follow along if you have a Kubernetes cluster at your disposal. Otherwise, you can skip executing the commands that follow and observe the results I will present.
We will be sending thousand concurrent requests during thirty seconds. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows.
¹⁰³https://github.com/JoeDog/siege
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
... Transactions: Availability: Elapsed time: Data transferred: Response time: Transaction rate: Throughput: Concurrency: Successful transactions: Failed transactions: Longest transaction: Shortest transaction: ...
230
9788 hits 100.00 % 29.69 secs 1.15 MB 1.88 secs 329.67 trans/sec 0.04 MB/sec 618.31 9250 0 22.40 0.23
Siege sent almost ten thousand requests during thirty seconds, and 100 % were successful. That’s a perfect result. Well done, Azure! If you did run siege, your results are almost certainly going to differ from mine.
Let’s run it one more time and check whether the results are consistent. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11
... Transactions: 9309 hits Availability: 100.00 % Elapsed time: 30.08 secs Data transferred: 1.13 MB Response time: 2.33 secs Transaction rate: 309.47 trans/sec Throughput: 0.04 MB/sec Concurrency: 722.49 Successful transactions: 9154 Failed transactions: 0
Using Managed Functions As A Service (FaaS) 12 13 14
Longest transaction: Shortest transaction: ...
231
26.14 0.23
The results are similar. There were still almost ten thousand transactions, and 100 % of the requests were successful. Later on, we’ll compare the results with those from AWS and Google and see whether there is any significant difference.
The last thing we will check is whether there is some kind of a dashboard we can use. I’m not fond of UIs for managing applications, but having a dashboard for monitoring purposes and for insights is usually welcome. 1
open "https://portal.azure.com"
Search for function app, and select it. You’ll see the function. Open it and explore the dashboard yourself. We’ll remove the function once you’re done “playing”. I’ll assume you’ll know how to follow links and click buttons, so you probably do not need my help.
That’s it. We will not need the function anymore, so we will remove it. 1
serverless remove
The output is as follows. 1 2 3 4 5 6
Serverless: Initializing provider configuration... Serverless: Logging into Azure Serverless: Using subscription ID: 7f9f9b08-7d00-43c9-9d30-f10bb79e9a61 Serverless: This command will delete your ENTIRE resource group (c1sciafd). and ALL \ the Azure resources that it contains Are you sure you want to proceed? If so, enter \ the full name of the resource group :
We are asked whether we want to delete your ENTIRE resource group. That’s a bit silly, isn’t it? Serverless Framework assumes that each functions project is in a separate resource group, so it tries to delete the functions by removing the whole resource group. It would make a lot of sense to proceed if we let the Serverless Framework create the resource group for us. But we didn’t. We are managing it through Terraform, or, to be more precise, we are doing that if you followed the instructions from my Gist.
Using Managed Functions As A Service (FaaS)
232
If we are to remove the whole resource group, we better do it the “right way” by executing terraform destroy. We’ll get to that soon. For now, cancel the execution of serverless remove by pressing ctrl+c. Let’s get out of the directory with the project, and delete it. 1
cd ..
2 3
rm -rf azure-function
The project directory is no more, and we should probably remove the infrastructure resources as well unless you plan to keep “playing” with the Serverless Framework and Azure Functions. If you used my Gist to create all the required resources using Terraform, you will find the instructions on how to destroy the resources at the bottom. If you created them by yourself, you are on your own.
We’re done. The function is gone, as well as all the pre-requisite resources. It’s as if we haven’t done anything. Next, we will explore AWS Lambda.
Deploying AWS Lambda AWS Lambda¹⁰⁴ ticks the usual managed FaaS boxes. It is a service that allows us to deploy and manage functions while ignoring the existence of servers. Functions are not running on thin air, but, this time, it is AWS and not you who is making sure that they are up and running. The solution scales automatically based on the load. It has monitoring, logging, debugging, security, and so on and so forth. You might have noticed that this started with similar or even the same sentences as those from the other managed FaaS chapters. That is intentional. I wanted to make each presented solution alike so that we can compare them easily if you choose to go through all of them, but also so that you do no miss anything if you decided to follow the examples only for AWS.
All managed FaaS solutions we are exploring are doing, more or less, the same things, so we can skip that part and jump straight into practical hands-on examples. Before we start deploying AWS Lambda, we need to get credentials, and we might need to install a few tools. Serverless Framework will create everything else. ¹⁰⁴https://aws.amazon.com/lambda/
Using Managed Functions As A Service (FaaS)
233
We’ll deal with the credentials later. For now, install AWS CLI (aws)¹⁰⁵ if you do not have it already. We’ll also need jq¹⁰⁶, and you will see later on why. Just make sure to install it. If you are a Windows user, remember that I am assuming that you are using Ubuntu as Windows Subsystem for Linux (WSL), so you should follow the instructions for Linux or Ubuntu, and not for Windows.
We’ll start by creating a new AWS Lambda project. Instead of following whichever means AWS provides, we’ll use serverless to create a new project. As a minimum, we need to select a programming language for our functions, and a directory we want to use to keep the project files. While the project directory can be anything we want, languages are limited. At the time of this writing, AWS Lambda supports Java, Go, PowerShell, Node.js, C#, Python, and Ruby. By the time you read this, others might have been added to the mix, and you should check the official documentation. To make things a bit more complicated, the Serverless Framework might not yet have templates for all the languages Lambda supports. On top of the already supported ones, AWS provides an API that allows us to use additional languages for our Lambda functions. We will use Node.js in our examples. I chose it for a few reasons. To begin with, it is a widely used language. At the same time, I thought that you might want a break from me injecting Go everywhere. Another critical factor is that Node.js is supported in almost all FaaS solutions, so using it will allow us to compare them easily. Do not be discouraged if Node.js is not your language of choice. Everything we will do applies to any other language, as long as it is supported by AWS Lambda.
Let’s create an AWS Lambda project. We’ll use the aws-nodejs template and create the files in the aws-function directory. 1 2 3
serverless create \ --template aws-nodejs \ --path aws-function
If you are interested in which other templates are available, you can get the list by executing serverless create --help and observing the possible values of the --template argument. Those related to AWS Lambda are all prefixed with aws.
Next, we will enter the newly-created directory aws-function, and list the files that were created for us. ¹⁰⁵https://aws.amazon.com/cli/ ¹⁰⁶https://stedolan.github.io/jq/download/
Using Managed Functions As A Service (FaaS) 1
234
cd aws-function
2 3
ls -1
The output is as follows. 1 2
handler.js serverless.yml
There are only two files in there, so this should be a rapid overview. If you are familiar with Node.js, you probably noticed that there is no package.json with the list of dependencies. AWS expects us to deliver a ZIP file with the source code as well as the dependencies. As you will see soon, we will not have any, so package.json is not needed. To be more precise, we will not have any dependency directly related to our function. Those required by Lambda will be injected at runtime.
The handler.js file contains a sample function. Let’s take a quick look at it. 1
cat handler.js
The output is as follows. 1
'use strict';
2 3 4 5 6 7 8 9 10 11 12 13 14
module.exports.hello = async event => { return { statusCode: 200, body: JSON.stringify( { message: 'Go Serverless v1.0! Your function executed successfully!', input: event, }, null, 2 ), };
15 16 17 18 19
// Use this code if you don't use the http event with the LAMBDA-PROXY integration // return { message: 'Go Serverless v1.0! Your function executed successfully!', e\ vent }; };
Using Managed Functions As A Service (FaaS)
235
As you can see, there isn’t much to look at. The function responds with Go Serverless v1.0! Your function executed successfully!. It could hardly be any simpler than that. We will not dive into extending the function to do something more complicated than spitting out a simple hard-coded message. I’m sure you know how to write an application, especially when it is as tiny as a single function.
The serverless.yml file is the key. It defines everything serverless needs to deploy and manage our functions. 1
cat serverless.yml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11
... service: aws-function ... provider: name: aws runtime: nodejs12.x ... functions: hello: handler: handler.hello ...
Most of the entries should be self-explanatory and do not need any particular explanation. Later on, if you choose to use AWS Lambda through the Serverless Framework, you might want to dig deeper into the available options. We will need to add a few entries in the serverless.yml file. Specifically, we will inject events that will enable us to send GET requests to the path hello. To do that, we’ll insert a few lines right below handler.hello. As you already know from the previous sections, we will use sed magic to replace those values.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7
236
cat serverless.yml \ | sed -e "[email protected]@handler.hello\\ events:\\ - http:\\ path: hello\\ method: get@g" \ | tee serverless.yml
Next, we’ll need to let serverless CLI know how to authenticate itself before deploying the function. There are a couple of ways to pass authentication to the CLI, and we’ll use what is, in my opinion, the easiest method. We’ll define a few environment variables. Specifically, we will create AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If you are already using AWS, you probably know how to get the keys. You likely already created them in one of the previous sections. If you didn’t, and you don’t know how to do that, please visit the Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform chapter for instructions. Come back here once you have the keys. Please replace the first occurrence of [...] with the access key ID, and the second with the secret access key in the commands that follow.
1
export AWS_ACCESS_KEY_ID=[...]
2 3
export AWS_SECRET_ACCESS_KEY=[...]
You probably do not want to go through the same steps to create access keys. To save us from doing that more than once, we’ll store those in a file so that we can source them whenever we need them. 1 2 3
echo "export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \ | tee creds
Now that creds contains confidential information that you can retrieve whenever you need it, we better make sure that it does not get committed accidentally to Git. 1 2
echo " /creds" | tee -a .gitignore
From now on, whenever you start a new terminal session and need those keys, you can retrieve them through the command that follows.
Using Managed Functions As A Service (FaaS) 1
237
source creds
Let’s deploy the function. That’s what we are here for. Isn’t it? 1
serverless deploy
The output, limited to the relevant parth, is as follows. 1 2 3 4 5 6 7 8
... endpoints: GET - https://vwnr8n39vb.execute-api.us-east-1.amazonaws.com/dev/hello functions: hello: aws-function-dev-hello layers: None ...
We can see that the function was deployed and exposed through a publicly accessible address. Let’s try to invoke hello and confirm whether it works. We can use serverless invoke command for that. 1
serverless invoke --function hello
The output is as follows. 1
{
2
"statusCode": 200, "body": "{\n \"message\": \"Go Serverless v1.0! Your function executed successf\ ully!\",\n \"input\": {}\n}" }
3 4 5
We got the Go Serverless v1.0! Your function executed successfully! message confirming that the response is indeed coming from the newly deployed function. However, while that confirmed that our function is accessible through the serverless CLI, we need to verify that it is available to everyone through an API or a browser. So, we need the address that we got when we deployed the function. We can retrieve almost any information related to the functions in a project by outputting the info. 1
serverless info
The output is as follows.
Using Managed Functions As A Service (FaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
238
Service Information service: aws-function stage: dev region: us-east-1 stack: aws-function-dev resources: 11 api keys: None endpoints: GET - https://vwnr8n39vb.execute-api.us-east-1.amazonaws.com/dev/hello functions: hello: aws-function-dev-hello layers: None
As you can see, that is the same info we got at the end of the deploy process. What matters, for now, is the GET endpoint. Please copy it, and use it to replace [...] in the command that follows. In my case, that would be https://vwnr8n39vb.execute-api.us-east-1.amazonaws.com/dev/hello. 1
export ADDR=[...]
Now we can test whether we can send a request to the newly deployed function without any help from serverless CLI. After all, users of our function are likely not going to use it. 1
curl $ADDR
The output, limited to the first few lines, is as follows. 1
{ "message": "Go Serverless v1.0! Your function executed successfully!", "input": { "resource": "/hello", "path": "/hello", "httpMethod": "GET", "headers": { "Accept": "*/*", ... }, ... }
2 3 4 5 6 7 8 9 10 11 12 13
}
Using Managed Functions As A Service (FaaS)
239
As you can see, it’s a relatively big JSON for a tiny function. We’ll ignore most of it and focus on the message field. It contains the output of our function, and we can confirm that it is indeed accessible to anyone. AWS Lambda functions are “insecure by default”. Anyone can access them. While in some cases, that is the desired state; in others, you might want to have the functions accessible only to internal users or other internal processes. The good news is that AWS does allow RBAC for the functions. The bad news is that it is out of the scope of this chapter.
If you went through similar exercises to deploy functions in Azure and Google Cloud, you probably remember that, at this point, we run Siege¹⁰⁷ to test performance and availability. We will not do that with AWS Lambda. I could not make it work, and I do not have a good explanation for why I failed. Please let me know if it worked for you and what you did to fix it.
The last thing we will check is whether there is some kind of a dashboard we can use. I’m not fond of UIs for managing applications, but having a dashboard for monitoring purposes and for insights is usually welcome. 1
open https://console.aws.amazon.com/lambda/home?region=us-east-1#/functions
Search for aws-function-dev-hello, and select it. Explore the dashboard yourself. We’ll remove the function once you’re done “playing”. I’ll assume you’ll know how to follow links and click buttons, so you probably do not need my help.
That’s it. We will not need the function anymore, so we will remove it. While we are at it, we will also get out of the directory with the project, and delete it. 1
serverless remove
2 3
cd ..
4 5
rm -rf aws-function
We’re done. The function is gone. It’s as if we haven’t done anything. Next, we will see what we learned, discuss the pros and cons of using managed Functions as a Service (FaaS), compare the solutions we explored, and a few other useful things that might help you make a decision. ¹⁰⁷https://github.com/JoeDog/siege
Using Managed Functions As A Service (FaaS)
240
To FaaS Or NOT To FaaS? We should ask two significant questions when contemplating whether we should use managed Functions as a Service (FaaS) flavor of serverless computing. Should we use them? If we should, shall it be AWS Lambda, Azure Functions, Google Cloud Functions, or something completely different? So, should we use managed FaaS? We probably should. But that’s not the right question. We can almost certainly find at least one good example. A more important question is whether managed FaaS can be the solution for a significant percentage of our workload. That’s the question that is much more difficult to tackle. To answer it, we might need first to establish good use cases for deploying and running functions. By now, most of us are using or are experimenting with microservices. They are smaller applications that are loosely coupled, independently deployable, organized around business capabilities, and owned by small teams. I will not dive deeper into microservices. You probably know what they are, and you likely have at least one in your system. You might have even converted everything into microservices by now. What matters is that you do develop and deploy microservices. If you do not, you can just as well forget about functions. If you failed to transition to microservices or you do not see value in them, functions will almost certainly not work for you. From a very simplistic point of view, they are smaller microservices. We could just as well call them nano services. On the other hand, if you do have good use cases for microservices, you might benefit from functions. You might just as well continue the trend of shrinking your applications into very focused and small entities which we call functions. Some use cases are natural candidates for functions, some might produce a questionable return of investment, while others can be discarded right away. Let’s start with the situations that do not fit well into the FaaS model. If the workload of an application is, more or less, constant, FaaS is not a good idea. The model is based around a mechanism capable of spinning up an instance of a function for each request. Such operation is expensive both in time required to make it operational and the cost, which is often calculated per-request or time of execution. Even though initialization time is usually measured in milliseconds, if we have relatively constant and massive traffic, wasting time on initialization is not a good idea. If, for example, you have an API that is receiving thousands of concurrent requests and the volume does not change drastically from one minute to another, you are likely better of with a different deployment and execution model. Initialization or the cold-start time has been improved. Service providers managed to reduce it drastically, and they might keep your functions “warm” for a while. Nevertheless, all that only improved the situation, but did not entirely remove the issues with initialization.
There is one more crucial reason why we are better off not using FaaS when we have, more or less, constant workload. They are too expensive. As a matter of fact, if we calculate the cost based
Using Managed Functions As A Service (FaaS)
241
on CPU or memory utilization, FaaS is much more costly than most of the alternatives. It can be a couple of times of magnitude more expensive. And that’s where the real importance of variable load comes in. If an application’s workload changes drastically from one moment to another, or if your application is used sporadically, FaaS is way cheaper since we are usually paying for CPU per millisecond, second, or anything in between. But, if we do have constant usage of a resource (e.g., CPU), that is usually much cheaper without FaaS. Under certain conditions, the difference can be expressed with a multiplier of five, ten, or even more. Let me make it clear. FaaS is very expensive when used extensively. It is likely the most costly way to run our applications at scale. From the cost perspective, managed FaaS tends to make sense only when functions are used sporadically, or when there is a considerable variation in the amount of workload, from one moment to another. That does not necessarily mean that FaaS is a bad idea. Those costs might be overshadowed by the benefits and operational or development savings. Nevertheless, the price is not something to be ignored, but, instead, factored into the decision whether to use FaaS. Another potentially problematic issue is related to vendor lock-in. How important is it to avoid being locked into a single provider? Think about it and, while you are at it, remember that it is not only about being locked into a specific compute vendor but also that there is not yet a widely accepted standard. You will be locked not only into a vendor but into a particular product of a vendor that is likely very different from other FaaS solutions. Now, vendor lock-in is not necessarily a bad thing. It is all about a calculation in which you put benefits on one side, and the cost on the other. If benefits are higher than the cost, it is okay. If the return of investment you will gain by kick-starting your project fast is sufficiently high to warrant the potential loss incurred by being locked-in, you should go for it. Just remember that managed FaaS is probably one of the best examples of vendor lock-in we have seen so far. Now, you might say that you do not see what is so difficult. Why would it be hard to switch to a different vendor or a different service? Everything we did with the Serverless Framework seems to be simple enough, no matter the vendor. That might lead you to conclude that it should be easy to move from one provider to another. But that would be misleading. We saw a simple example of a single function. When your functions increase in number, so will increase the number of auxiliary services around them. At one moment, you will realize that there’s much more code required to glue together all the functions, than functions themselves. All in all, FaaS is rarely the right solution if any of the following cases are true. • Workloads are, more or less, constant • Workloads are high • Lock-in is not desirable We can turn those upside down, and say that a use case could be the right candidate for FaaS, if workloads are variable and not high, and if the benefits are greater than the downsides of being locked-in. Now that we know which situations are not a good fit for FaaS, let’s see a few examples that might be good candidates.
Using Managed Functions As A Service (FaaS)
242
Static content is an excellent candidate for conversion into Functions as a Service. To be more precise, the interface between users and static content can benefit significantly from being functions. Static content, which is not accessed all the time by many, could be on the top of the list for conversion to FaaS. For example, we could store artifacts (binaries) in an external drive, and spin up an instance of a function whenever we want to push or pull something. Batch processing is another excellent example of something that could be served well with FaaS. It could be, for example, an ETL process initiated when we push a change to storage. Or it could be image processing triggered by an upload. It could also be scheduled creation of backups, which we would typically do through a CronJob. All in all, the best candidates for FaaS are variable or occasional executions of processes that can be encapsulated into small units (functions). Does that mean that we cannot build whole systems based on FaaS? It doesn’t. We can undoubtedly split everything we have into thousands of functions, deploy them all independently from each other, and glue them altogether through whichever means our provider gives us. But that would probably result in a costly decision. The bill from your compute provider would likely be much bigger than it tends to be, and you would spend more time in maintenance than you’re used to. That would defy some of the main reasons for using FaaS. It is supposed to be easier to manage and cheaper to run. Do not go crazy. Do not make a plan to convert everything you have into functions. Heck, we are still debating whether microservices are a good thing or not. Functions applied to everything would be an extreme that would likely result in a miserable failure. Choose well candidates for conversion into functions. In my experience, the cases when functions are the right choice are relatively small in number, when compared with other ways to run applications. But, if you do find a good use case, the results will likely be gratifying. For now, I will imagine that you have at least one good use-case to create a function. Let’s talk where that function should run.
Choosing The Best Managed FaaS Provider Which compute provider is the one that offers the best managed FaaS solution? Is it Azure, AWS, or Google Cloud? We will exclude on-prem solutions. Similarly, we will not cover all compute providers, but only the “big three”. Please let me know if you are interested in running functions in your datacenters or in any other provider, and I’ll do my best to evaluate additional solutions.
There are quite a few criteria we could use to compare FaaS services. Many will depend on your use cases, experience, needs, and so on and so forth.
Using Managed Functions As A Service (FaaS)
243
Let’s start with the supported languages. The number of supported languages might have increased since the time of this writing (July 2020).
Google Cloud Functions service supports the languages that follow. • • • •
Node.js Python Go Java
Azure Functions service supports the languages that follow. • • • • • • •
C# JavaScript F# Java PowerShell Python TypeScript
Finally, AWS Lambdas can be written in the languages that follow. • • • • • • •
Java Go PowerShell Node.js C# Python Ruby
According to those lists, Google Cloud Functions is the apparent looser, and both Azure and AWS share the trophy. Nevertheless, that often does not matter. The question is not who supports more languages, but whether the language you prefer is available. Also, it is not only about your preferred language. Some are better suitable to be functions than others. Think twice before using a language that needs significant time to initialize processes. In a “normal” situation, waiting for a second for a Java or C# process to start might be okay. After all, one second is almost nothing considering that the process might run for hours, days, or even months. But, in the situation when each request initializes a new instance of your application (function), initialization
Using Managed Functions As A Service (FaaS)
244
time is essential. Even though FaaS providers managed to overcome some of the hurdles around that issue by keeping functions “warm” for a while, those are still improvements and not “real” solutions. I prefer using JavaScript, Python, and Go for functions. They start fast and work well on a small codebase. Still, the choice is yours, and not mine, and the supported languages might influence your decision where to run your functions. We can compare managed FaaS solutions by the time it takes to deploy a function. Initial installation in AWS is the fastest with around 1 minute and 20 seconds, Google being the second with approximately 2 minutes, while in Azure, it takes almost 3 minutes to install a brand new function. Updates of existing functions tend to be fast in Azure and AWS and last between 20 and 30 seconds, while in Google, the time is, more or less, the same as when installing them for the first time. Taking all that into account, we can say that AWS is the clear winner. Nevertheless, that probably does not matter much. I doubt that anyone will choose one over the other because it takes a minute longer to install or update a function. While we are at the subject of installations and updates, we might comment on the complexity of performing those operations. But that is also irrelevant since we did not use a provider-native way to create and deploy functions. Instead, we used the Serverless Framework¹⁰⁸, which works, more or less, the same everywhere. To be more precise, with the Serverless Framework, deploying a function is as easy as it can get wherever we run them. So, we cannot use the complexity of deployment operations as an argument that FaaS of one provider is better than the other. How about availability? No matter where we run our functions, they are highly available with, more or less, the same results across all three providers. To be more precise, Google tends to be a bit behind by giving us only one nine after the comma (around 99.9%), while others (AWS and Azure) are closer to 100% with at least a few more nines. Still, all three are relatively reliable, and we can say that all provide a highly-available service. We can also speak about scalability, security, state persistence, and so on and so forth. The truth is that FaaS solutions are, more or less, the same, everywhere. And that leaves us with only one crucial thing that might influence our decision where to run functions. The cost of running functions is probably on top of everyone’s list. But, in the context of comparing the providers, that is not an essential criterion. Simply put, all are similarly priced. It is difficult to say who is cheaper since many factors can influence the cost. Nevertheless, no matter the usage patterns of our functions, they will cost us, more or less, the same everywhere. The differences are usually not bigger than ten percent. So, we cannot use the price as a differentiator either. Since we are talking about the cost, I must stress that FaaS tends to be very expensive when running at scale. The previous statements were aimed at comments about the differences in pricing between the major providers, and not about the cost-effectiveness of using managed FaaS in general. ¹⁰⁸https://www.serverless.com/
Using Managed Functions As A Service (FaaS)
245
This is a depressing story. I cannot make it exciting. I wanted to get into a position from which I could say something like “use provider X because their service is superior to others.” But I can’t. There are no significant differences. I’m sure that, at this moment, there is at least one of you screaming, “you’re wrong, X is so much better than Y.” Go ahead. Ping me, send me an email, contact me on Slack, or Tweet. I would love to hear from you what makes a FaaS service from one provider so much better than the other.
Truth be told, the comparison between FaaS solutions is (almost) irrelevant. There are no vast differences, and, more importantly, you probably already chose which provider you’re using. It is highly unlikely that you will switch your provider without a very compelling reason, and I do not have one. All in all, managed FaaS solutions are very similar, at least between the “big three” providers. Pros and cons between the solutions probably do not matter since you’re likely going to use whatever your vendor provides if you do choose to use managed FaaS. I can conclude that all FaaS solutions are similarly good or, depending on the point of view, that they are all equally bad. So far, I did my best to be neutral and not “infect” you with my personal views on that subject. Let’s see what I really think about FaaS.
Personal Thoughts About Managed FaaS Personally, I do not think that managed Functions as a Service are a good idea. Functions are too small for my taste. The execution model in which each request is served by a fresh instance is deeply flawed. The pricing is too high for my budget. All that being said, I can see use cases where managed FaaS is a perfect fit, but only if that would be the only flavor of serverless deployments. But it’s not, even though many are putting the equation between FaaS and serverless computing. Functions as a Service became very popular with the introduction of AWS Lambda. That is not to say that Lambda is the first implementation of serverless computing (it is not), but that Lambda is what brought the actual implementation of some of the principles to the masses. Lambda is to serverless what Docker is to containers. Both brought the implementation of existing concepts to mainstream. As a result of the popularization of Lambda, people associate serverless with functions. That is wrong. It is a misunderstanding of what serverless is. FaaS is only one flavor of serverless. It is one possible path we can take towards the same goal. But there are others. There are other ways to do serverless deployments, and we will explore them soon. More often than not, FaaS is not the right solution for the vast majority of workloads. Writing only functions is limiting. Having limits to how long a process can run, which language it is written
Using Managed Functions As A Service (FaaS)
246
in, how many requests it can handle, and so on and so forth, is too constraining for my taste. I understand the need to restrict choices. That’s the best and the easiest way to simplify things and provide a service everyone can use. But it is deeply flawed. Serverless computing, on the other hand, has much more to offer. Companies and communities are experimenting with different ways we can fulfill on the promise of making deployment and management of our applications easier. A few years from now, I am confident that we will look at managed FaaS like Lambda, Azure Functions, and Google Cloud Functions, as pioneers. They will be services that made serverless computing popular, but also that were failed experiments. Almost all initial attempts at something new turn out to be failures. Mesos is an excellent example of a scheduler that was a pioneer, but failed and was replaced with Kubernetes. Similarly, I believe that the same future is waiting for us with managed FaaS. It will be replaced with a better flavor of serverless computing. As a matter of fact, we already have better implementations of serverless computing, and we will explore them soon. Use FaaS if you have a good use case for it. But limit your usage of it only to “special” cases. Do not go crazy and convert everything into functions. If you are eager to use serverless deployments, give me a bit more time to introduce you to other options before deciding which is the right one for you, if any. All in all, I do believe that the future is in Serverless computing. However, Functions as a Service (FaaS) is not it. FaaS is not going to be the most commonly used nor the best serverless flavor. We are at the begining. Other solutions will prevail. While we might not yet know what those other solutions are, there are a few common properties that will likely become a norm across most serverless computing solutions. Next, we’ll explore Containers as a Service (CaaS) as an alternative flavor of serverless deployments.
Using Managed Containers As A Service (CaaS) Serverless is the future! You probably heard me repeat that sentence quite a few times. However, you also probably heard me saying that managed Functions as a Service (FaaS) is not it. Or, at least, it is unlikely to represent a significant percentage of your workload. We went through the exercises of deploying Azure Functions, Google Cloud Functions, and AWS Lambdas. You had the first-hand experience using managed FaaS, and you heard me providing a few reasons why I do not want to put my future into the hands of such services. While I offered reasons why I do not like FaaS, I never explained what I expect from serverless deployments, beyond the obvious things. Understanding the expectations could shine some light on my statements, and likely influence the choices and provide a potential path we might want to take. Let’s start with the requirements that are obvious and apply to any serverless implementation. We already explored them, but it might be worthwhile refreshing our memory. Serverless deployments remove the need for us to manage the underlying infrastructure. They do not eliminate application management but simplify it greatly. Any serverless service should provide out-of-the-box high-availability and scaling. Finally, instead of paying what we use, we are paying what our users are using. That last sentence might require further explanation. In a “traditional” Cloud setup, we pay our vendor for virtual machines, storage, networking, and quite a few other things. It does not matter whether someone is using those. We are paying for all the resources we create. To be more precise, we pay for every minute of the existence of those resources, and it is our responsibility to shut down things we do not use. Therefore, we pay for what we use. That, by itself, is a considerable improvement compared to the traditional setup, usually employed on-prem. We have an incentive to shut down the nodes that are not used. We are responsible for scaling servers and applications, and it is in our interest to always have just the right amount of resources. We need to pay for idle VMs. It does not matter if none of our users are using resources. On the other hand, the serverless model is based on the principle that we should pay our vendor for the resources our users are consuming. If, for example, we use Functions as a Service, and, at one moment, there are thousand concurrent requests, a thousand functions will be spun up, and we will pay for the time our users are using them. If the next moment no one wants to use our applications, none of the functions will exist, and there will be nothing we need to pay to our vendor. It’s not necessarily true that “none of the functions will exist” when no one uses them. Our service provider might keep them “warm”. From our perspective, they do not exist or, to be more precise, we do not pay for them. Also, users in the context of serverless can be people, but also processes. A user is anyone or anything using our function.
Using Managed Containers As A Service (CaaS)
248
Hardly anyone can complain about not having to manage infrastructure. The ability to scale applications automatically and keep them always highly available is something no one can argue against. That is happening automatically without our involvement making it a “dream come true”. Finally, being able to pay based on usage is tempting. It does not always turn out to be cheap, though. The pricing model behind serverless services can turn out to be a burden rather than a benefit. Nevertheless, we’ll ignore potentially massive bill for now. The sentence that says that “hardly anyone can complain about not having to manage infrastructure” is likely not true for some. I understand that sysadmins and operators might feel threatened thinking that such an approach would take away their jobs. I do not think that is the case. If you believe that’s you, all I can say is that you would do other things and be more productive. The alternative is to become obsolete.
All in all, the characteristic features present in (almost) all managed serverless services are as follows. • No need to manage infrastructure • Out-of-the-box scalability and high-availability • “Pay what your users use” model
If we are running serverless in our own datacenter (on-prem), the last point (“Pay what your users use”) can be translated into do not waste resources like CPU and memory for no good reason. However, this section is dedicated to managed CaaS and managed serverless services, so we’ll ignore it.
The problem is that infrastructure tasks delegated to others, out-of-the-box scalability and availability, and paying what our users use are often not enough. We need more.
Discussing The “Real” Expectations What do I expect from serverless, or for that matter, any type of deployment services? I expect us to be able to develop our applications locally and to run them in clusters. Ideally, local environments should be the same as clusters, but that is not critical. It’s okay if they are similar. As long as we can easily spin up the application we are working on, together with the direct dependencies, we should be able to work locally. Our laptops tend to have quite a few processors and gigabytes of memory, so why not use them? That does not mean that I exclude development that entirely relies on servers in the cloud, but, instead, that I believe that the ability to work locally is still essential. That might change in the future, but that future is still not here. I want to develop and run applications locally before deploying them to other “real” environments. It does not matter much whether those applications are monoliths, microservices, or functions.
Using Managed Containers As A Service (CaaS)
249
Similarly, it should be irrelevant whether they will be deployed as serverless, or as Kubernetes resources, or by running as processes on servers, or anything else. I want to be able to develop locally, no matter what that something is. We also need a common denominator before we switch to higher-level specific implementations. Today, that common denominator is a container image. We might argue whether we should run on bare-metal or VMs, whether we should deploy to servers or clusters, whether we should use a scheduler or not, and so on, and so forth. One of the things that almost no one argues anymore is that our applications should be packaged as container images. That is the highest common denominator we have today. It does not matter whether we use Docker, Docker Compose, Mesos, Kubernetes, a service that does not provide visibility to what is below it, or anything else. What matters is that it is always based on container images. We can even convert those into VMs and skip running containers altogether. Container images are a universal packaging mechanism for our applications. I just realized that container images are NOT a common denominator. There are still those using mainframe. I’ll ignore them. There are also those developing for macOS, they are the exception that proves the rule.
Container images are so beneficial and commonly used that I want to say, “here’s my image, run it.” The primary question is whether that should be done by executing docker-compose, kubectl, or something else. There is nothing necessarily wrong in adding additional layers of abstraction if that results in elevation of some of the complexities. Then there is the emergence of standards. We can say that having a standard in an area of software engineering is the sign of maturity. Such standards are often de facto, and not something decided by a few people. One such standard is container images and container runners. No matter which tool you are using to build a container image or run containers, most use the same formats and the same API. Standards often emerge when a sufficient number of people use something for a sufficient period. That does not mean that standards are something that everyone uses, but rather that the adoption is so high, that we can say that the majority is using it. So, I want to have some sort of a standard, and let service providers compete on top of it. I do not want to be locked more than necessary. That’s why we love Kubernetes. It provides a common API that is, more or less, the same, no matter who is in charge of it, or where it is running. It does not matter whether Kubernetes is running in AWS, Google, Azure, DigitalOcean, Linode, in my own datacenter, or anywhere else. It is the same API. I can learn it, and I can have the confidence that I can use that knowledge no matter where I work, or where my servers are running. Can we have something similar for serverless deployments? Can’t we get a common API, and let service vendors compete on top of it with lower prices, more reliable service, additional features, or any other way they see fit? Then there is the issue with restrictions. They are unavoidable. There is no such thing as an unlimited and unrestricted platform. Still, some of the limitations are acceptable, while others are not. I don’t want anyone to tell me which language to use to write my applications. That does not mean that I do not want to accept advice or admit that some are better than others. I do. Still, I do not want to be
Using Managed Containers As A Service (CaaS)
250
constrained either. If I feel that Rust is the right choice that would be better suited for a given task, I want to use it. The platform I’m going to use to deploy my application should not dictate which language I will use. It can “suggest” that something is a better choice than something else, but not to restrict my “creativity”. To put it bluntly, it should not matter which language I use to write my applications. I also might want to choose the level of involvement I want to have. For example, having a single replica of an application for each request might fit some use cases. But there can be (and usually are) those in which I might want to serve up to thousand concurrent requests with a single replica. That cannot be a decision only of the platform where my application is running. It is part of the architecture of an application as well. I do believe that the number of choices given to users by serverless service providers must be restricted. It cannot be limited only by our imagination. Nevertheless, there should be a healthy balance between simplicity, reliability, and freedom to tweak a service to meet specific use cases’ goals. Then there is the issue of types of applications. Functions are great, but they are not the solution to all the problems in the universe. For some use-cases, microservices are a better fit, while in others, we might be better off with monoliths. Should we be restricted to functions when performing serverless deployments? Is there a “serverless manifesto” that says that it must be a function? I am fully aware that some types of applications are better candidates to be serverless than others. That is not much different than, let’s say, Kubernetes. Some applications benefit more from running in Kubernetes than others. Still, it is my decision which applications go where.
I want to be able to leverage serverless deployments for my applications, no matter their size, or even whether they are stateless or stateful. I want to give someone else the job to provision and manage the infrastructure and to take care of the scaling and make the applications highly available. That allows me to focus on my core business and deliver that next “killer” feature as fast as possible. Overall, the following list represents features and abilities I believe are essential when evaluating serverless solutions. • It should allow local development • It should leverage common and widely accepted denominators like, for example, container images • It should be based on some sort of a standard • It should not be too restrictive • It should support (almost) any type of applications None of those items from my “wish list” exclude those we mentioned earlier. Instead, they complement the basic features of managed serverless services that allow us to avoid dealing with infrastructure, scaling, and high-availability, and to pay for what we use. We can say that those
Using Managed Containers As A Service (CaaS)
251
are table-stakes, while the items in my “wish list” are things that I value a lot, and that can be used to evaluate which solution is a better fit for my needs. That was enough mumbling on my part. It’s time to jump into practical examples of yet another flavor of serverless deployments and see which use-cases it might serve well. Let’s explore managed Containers as a Service (CaaS). I made a bold statement saying that CaaS is a flavor of serverless. That might not be the case, so take that statement with a healthy dose of skepticism.
In the sections that follow, we will explore three flavors of managed CaaS. We will use Google Cloud Run, AWS ECS with Fargate, and Azure Container Instances. Feel free to jump to whichever provider you are using, or, even better, go through all three. I recommend the latter option. You will not spend any money since all three do not charge anything until we pass a certain usage. On the other hand, understanding how managed CaaS works on all of them will give you a better understanding of the challenges and the pros and cons we might encounter. I will try to keep all managed CaaS examples as similar as possible. That might be a bit boring if you’re going through all of them. Nevertheless, I believe that having it similar to one another will allow us to compare it better.
The first managed CaaS we will explore is Google Cloud Run.
Deploying Applications To Google Cloud Run Google Cloud Run is a fully managed compute platform for deploying and scaling containerized applications quickly and securely. Since it uses container images, we can write code in any language we want. We can use any application server and any dependencies. Long story short, if an application and everything it needs can be packaged into a container image, it can be deployed to Google Cloud Run. The service abstracts infrastructure management, allowing us to focus on our applications and the business value we are trying to create. It provides automated scaling and redundancy, which results in highavailability. It has integrated logging, monitoring, and strict isolation. The pricing is based on payper-use. The best part of Google Cloud Run is that it is based on Knative¹⁰⁹. It is an open-source project that aims at becoming a standard for running serverless applications in Kubernetes. As a result, using Google Cloud Run simplifies accomplishing objectives that we could do without it as well. It’s a service on top of an open standard and open source. As such, we might not be locked into the service nearly as much as with other managed serverless solutions, especially FaaS. ¹⁰⁹https://knative.dev/
Using Managed Containers As A Service (CaaS)
252
Knative has a large community and is backed by some of the most important software companies like Google, VMware, RedHat, and IBM. A large community and the support of major players combined with massive adoption means that the technology is likely here to stay. But, in the case of Knative, it is much more than that. It aims to define a standard and become the de facto implementation of serverless deployments in Kubernetes. Nevertheless, in this case, we are interested in Google Cloud Run as a potentially right choice for a Container as a Service implementation of managed serverless. So, we’ll skip the examples of using Knative directly, and focus on what Google offers as a service on top of it. We’ll discuss the pros and cons of using Cloud Run later. Right now, we need a bit of hands-on experience with it. All the commands from this section are available in the 04-02-gcr.sh¹¹⁰ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
Before we begin, you’ll need to ensure that you have a few pre-requisites set up. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an , you might need to modify some of the commands in the examples that follow.
You will need a Google Cloud¹¹¹ account with sufficient permissions. Preferably, you should be the owner of the account or have admin permissions. You will also need to install Google Cloud SDK (gcloud)¹¹². You will need to create a Google Cloud project with a billing account, container registry, and Cloud Run services enabled. I prepared a Gist gcr.sh¹¹³ that you can use to set up the resources we need. Everything is defined as Terraform configs. I will assume that, by now, you feel comfortable working with Terraform and that you should be able to explore the files from that Gist on your own. Even if you decide to set up everything yourself, you can use those Terraform definitions to figure out what is needed. If you got scared at the word “Terraform”, you likely missed the section Infrastructure as Code (IaC), or you are very forgetful. If that’s the case, go through the exercises there, and come back when you’re ready. ¹¹⁰https://gist.github.com/59f647c62db7502a2ad9e21210f38c63 ¹¹¹https://cloud.google.com/ ¹¹²https://cloud.google.com/sdk/install ¹¹³https://gist.github.com/2aa8ee4a6451fd762b1a10799bbeac88
Using Managed Containers As A Service (CaaS)
253
What matters is that you will need to export the PROJECT_ID variable. The name should be selfexplanatory. If you are using my Gist, the creation of the variable is already included. I’m mentioning it just in case you choose to go rogue and ignore the Gist. Finally, you will also need Docker¹¹⁴ running on your laptop, and jq¹¹⁵. You’ll see later why we need those. All in all, you will need the following. • Google Cloud¹¹⁶ account • Google Cloud SDK (gcloud)¹¹⁷ • A few Google Cloud resources and the environment variable PROJECT_ID. Follow the instructions from gcr.sh¹¹⁸. • Docker¹¹⁹ • jq¹²⁰ Cloud Run can use container images stored in any registry. For simplicity, we’ll use the one provided by Google Cloud. If that’s not your registry of choice, you should be able to modify the examples to use whichever you prefer. But do that later. It will be easier if you follow the examples as they are, and that means that you will be working with Google Container Registry (GCR). We’ll push the image we will use using Docker, and the first step is to provide it with authentication so that it can access Google Cloud Container Registry. 1
gcloud auth configure-docker
Confirm that you do want to continue by typing Y and press the enter key. Your Docker configuration file should be updated. We will not build container images. That is not the goal of this section. We are trying to deploy applications packaged as container images. So, I will assume that you know how to build container images, and we will take a shortcut by using an image that I already built. Let’s pull it. 1 2
docker image pull \ vfarcic/devops-toolkit-series
Next, we’ll push that image into your registry. To do that, we’ll tag the image we just pulled by injecting gcr.io and your project’s ID. That way, Docker will know that we want to push it to that specific registry. ¹¹⁴https://docs.docker.com/get-docker/ ¹¹⁵https://stedolan.github.io/jq/download/ ¹¹⁶https://cloud.google.com/ ¹¹⁷https://cloud.google.com/sdk/install ¹¹⁸https://gist.github.com/2aa8ee4a6451fd762b1a10799bbeac88 ¹¹⁹https://docs.docker.com/get-docker/ ¹²⁰https://stedolan.github.io/jq/download/
Using Managed Containers As A Service (CaaS) 1
254
export IMAGE=gcr.io/$PROJECT_ID/devops-toolkit-series:0.0.1
2 3 4 5
docker image tag \ vfarcic/devops-toolkit-series \ $IMAGE
Now we can push the image to your Google Container Registry. 1
docker image push $IMAGE
To be on the safe side, we will list all the images in the container registry associated with the project we are using. 1 2
gcloud container images list \ --project $PROJECT_ID
The output, limited to the relevant parts, is as follows. 1 2 3
NAME gcr.io/doc-l6ums4bgo3sq4pip/devops-toolkit-series ...
We can see that the image is indeed stored in the registry. If you are scared of terminals and feel dizzy without the occasional presence of a wider color palette, you can visit GCR from Google Console. If you are a Linux or a Windows WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on how to do that in the Setting Up A Local Development Environment chapter. If you do use Windows, but with a bash emulator (e.g., GitBash) instead of WSL, the open command might not work. You should replace open with echo and copy and paste the output into your favorite browser.
1
open https://console.cloud.google.com/gcr/images/$PROJECT_ID
Follow the devops-toolkit-series link. We can see the tag 0.0.1. That should be enough of a confirmation that it was indeed pushed to the registry. Now that we have the container image we want to deploy stored in a registry from which Google Cloud Run can pull it, we can proceed to the most important part of the story. We are about to
Using Managed Containers As A Service (CaaS)
255
deploy our application, packaged as a container image, using Google Cloud Run. To do that, we need to make a couple of decisions. What is the image we want to run? That one is easy since we just pushed the image we’re interested in. We also need to choose a region. To simplify things, we’ll store it in an environment variable. 1
export REGION=us-east1
Next, we need to decide whether we want to allow unauthorized requests to the application. For simplicity, we will let everyone see it. Further on, we need to define a port of the process that will be running inside the containers. In this case, it is port 80. We might want to specify how many concurrent requests a replica of the application should handle. This is a very powerful feature that is typically not available in Functions as a Service solutions. Functions are based on the idea that each request is handled by a single replica. The number of replicas is equal to the number of concurrent requests. If there are ten thousand requests at one moment, FaaS will spin up ten thousand replicas of a function. Such a model is deeply flawed. Most applications can easily handle hundreds, thousands, or even more concurrent requests, without affecting performance, and without a substantial increase of CPU and memory usage. As such, FaaS is ineffective. On the other hand, Containers as a Service (CaaS) typically do not impose such limitations. We can specify how many concurrent requests can be handled by a single replica. In our case, we will set the concurrency to 100. Cloud Run, in turn, will ensure that the number of replicas is equivalent to the number of requests divided by concurrency. It will monitor the number of requests and scale the application to accommodate the load. If no one wants to use it, none of the replicas will run. On the other hand, if it becomes a huge success, it will scale up. Knative can use many other parameters to decide when to scale up or down. It is limited only by the scope of the metrics we have. We’ll (probably) explore Knative separately and use that opportunity to see other scaling options.
We might also need to choose the platform. As I already mentioned, Cloud Run is based on Knative, which can run on any Kubernetes cluster. Similarly, gcloud allows us to deploy serverless applications almost anywhere. We can choose between managed, gke, and kubernetes. The managed platform is Cloud Run, which we are exploring right now. I’m sure you can guess what gke and kubernetes platforms are. Finally, we will have to specify the Google Cloud project in which we will run the application. We could specify quite a few other things. I’ll let you explore them on your own if, after you finish with this section, you still feel that Google Cloud Run might be the right solution for some of your use-cases.
Using Managed Containers As A Service (CaaS)
256
With all that in mind, the command we will execute is as follows. 1 2 3 4 5 6 7 8 9
gcloud run deploy \ devops-toolkit-series \ --image $IMAGE \ --region $REGION \ --allow-unauthenticated \ --port 80 \ --concurrency 100 \ --platform managed \ --project $PROJECT_ID
After approximately one minute, our application has been deployed and is available. That’s all it took. A single command, with a few simple arguments, deployed our application packaged as a container image. Let’s confirm that by outputting all the applications managed by Cloud Run inside that project. 1 2 3 4
gcloud run services list \ --region $REGION \ --platform managed \ --project $PROJECT_ID
The output is as follows. 1 2 3 4
SERVICE REGION URL LAST DEPLOYED\ BY LAST DEPLOYED AT devops-toolkit-series us-east1 https://devops-toolkit-series...run.app viktor@farcic\ .com 2020-07-06T13...
Similarly, we can retrieve all the revisions of an application. 1 2 3 4
gcloud run revisions list \ --region $REGION \ --platform managed \ --project $PROJECT_ID
The output is as follows.
Using Managed Containers As A Service (CaaS) 1 2 3 4
257
REVISION ACTIVE SERVICE DEPLOYED DEPLOYED \ BY devops-toolkit-series-00001-yeb yes devops-toolkit-series 2020-07-06... viktor@fa\ rcic.com
There is not much too look at since we deployed only one revision of the application. The revisions will start piling up if we start deploying new releases. Finally, let’s describe the application and see what we’ll get. 1 2 3 4 5 6
gcloud run services describe \ devops-toolkit-series \ --region $REGION \ --platform managed \ --project $PROJECT_ID \ --format yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
apiVersion: serving.knative.dev/v1 kind: Service ... spec: template: ... spec: containerConcurrency: 100 containers: - image: gcr.io/doc-l6ums4bgo3sq4pip/devops-toolkit-series:0.0.1 ports: - containerPort: 80 resources: limits: cpu: 1000m memory: 256Mi timeoutSeconds: 300 traffic: - latestRevision: true percent: 100 ...
That might seem like yet another YAML output, but it shows something significant. It comes from Kubernetes, and it demonstrates that the resource is knative. That is a confirmation that Cloud Run
Using Managed Containers As A Service (CaaS)
258
is an additional layer on top of Kubernetes. Google is saving us from the hassle of managing our own infrastructure and maintaining our clusters. But that is no different from what every other managed serverless solution is offering. What makes Cloud Run unique is a reliance on an open standard and open source. By the time you read this, Knative as a backbone of a managed Containers as a Service solution might not be that unique anymore. It was at the time of this writing (July 2020).
With Google Cloud Run, our dependency on Google is minimal. If we feel that someone else can offer us a better service or lower cost, we could move there with relative ease. As long as that someone allows us to use Knative, the transition should be painless. Even if no one comes up with a managed serverless solution based on Containers as a Service and Knative, we could always spin up our own Kubernetes cluster and install Knative. Given that quite a few companies are investing heavily in Knative, we can expect them to come up with their own managed Container as a Service solution based on it. Now that we are relatively confident that the application is running or, at least, that Google thinks it is, let’s check whether it is accessible. We’ll retrieve the address by describing the service, just as we did before. This time, we’ll output JSON, and use jq to retrieve the .status.url field that contains the URL. 1 2 3 4 5 6 7
export ADDR=$(gcloud run services \ describe devops-toolkit-series \ --region $REGION \ --platform managed \ --project $PROJECT_ID \ --format json \ | jq -r ".status.url")
Now that we retrieved the URL auto-generated by Google Cloud Run, we can open the application in our favorite browser. 1
open $ADDR
You should see the home screen of a simple Web application with the list of all the books and courses I published. Feel free to navigate through the app and purchase as many books and courses as you can. Once you’re done, expense them to your manager. It’s a win-win situation (except for your manager). As a side note, I am using Google Cloud Run for that application, making it permanently available on devopstoolkitseries.com¹²¹. That might give you an early insight into my opinion who offers the best serverless deployments service. ¹²¹https://devopstoolkitseries.com/
Using Managed Containers As A Service (CaaS)
259
Besides using gcloud to observe Cloud Run instances, we can visit Google’s console if we need a UI. 1
open https://console.cloud.google.com/run?project=$PROJECT_ID
You will see the list of all Cloud Run instances. If this is the first time you’re using Google Cloud Run, you should have only one. Follow the devops-toolkit-series link and observe the dashboard. As before, I will not provide instructions on how to click links and navigate inside a dashboard. You can do that yourself.
We are about to see how does our application behaves under moderate (not even heavy) load. We’ll use Siege¹²² for that. It can be described as a “poor man’s” load testing and benchmark utility. While there are better tools out there, they are usually more complicated. Since this will be a “quick and dirty” test of availability, siege should do. I will run siege as a Pod. Feel free to follow along if you have a Kubernetes cluster at your disposal. Otherwise, you can skip executing the commands that follow and observe the results I will present.
We will be sending thousand concurrent requests during thirty seconds. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows.
¹²²https://github.com/JoeDog/siege
Using Managed Containers As A Service (CaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
260
... Transactions: 9724 hits Availability: 99.98 % Elapsed time: 29.10 secs Data transferred: 67.25 MB Response time: 1.89 secs Transaction rate: 334.16 trans/sec Throughput: 2.31 MB/sec Concurrency: 632.72 Successful transactions: 9730 Failed transactions: 2 Longest transaction: 23.69 Shortest transaction: 0.20 ...
We can see that, in my case, almost ten thousand requests were sent, and the resulting availability was 99.98 %. We’ll keep commenting on that information for later when we compare Google Cloud Run with similar solutions available in other providers. I will only say that it is a bit disappointing that we got only one nine as a decimal in availability. I would have expected a result that is closer to a hundred percent. While the availability of 99.98 % is not that good, it is also not a valid number. Our sample size was too small. We would need to have a much larger number of requests to evaluate the availability. Within such a small sample, anything that reached 99.9 % is a good enough number.
Let’s run the siege again and check whether the results are consistent. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output is as follows.
Using Managed Containers As A Service (CaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
261
... Transactions: 9823 hits Availability: 99.91 % Elapsed time: 29.41 secs Data transferred: 67.92 MB Response time: 1.74 secs Transaction rate: 334.00 trans/sec Throughput: 2.31 MB/sec Concurrency: 582.35 Successful transactions: 9827 Failed transactions: 9 Longest transaction: 20.69 Shortest transaction: 0.19 ...
We can see that the results are similar. In my case, the number of transactions is almost the same, and the availability decreased slightly. There’s not much more to know about Google Cloud Run, except to explore the additional arguments we can use when executing gcloud run deploy. I’ll leave you to explore them alone. Actually, there is one crucial thing you should know. We can define Cloud Run as YAML. To be more precise, Cloud Run uses Knative, which is a Kubernetes Custom Resource Definition (CRD). Instead of using gcloud commands to create or update Cloud Run apps, we can define them as Knative YAML. From there on, we could execute a command similar to gcloud beta run services replace service.yaml to tell Cloud Run to use whatever is defined in a YAML file (e.g., service.yaml). That is, actually, a preferable method since we could store the definitions in a Git repository, and hook them into whichever CI/CD solution we are using. However, I will not go into those details just yet. I will reserve that for a separate exploration of Knative, which can run in any Kubernetes cluster, not only as Google Cloud Run. You’ll notice that I used beta in the before mentioned command. The situation might have changed by the time you read this, and the option to deploy knative defined in a YAML file might have reached general availability (GA).
That’s it. We’re done with the quick exploration of Google Cloud Run, as one of many possible solutions for using managed Containers as a Service (CaaS) flavor of serverless deployments. We’ll remove the application we deployed.
Using Managed Containers As A Service (CaaS) 1 2 3 4 5
262
gcloud run services \ delete devops-toolkit-series \ --region $REGION \ --platform managed \ --project $PROJECT_ID
All that is left is to remove whichever Google Cloud resources we created. If you created them using the gcr.sh¹²³ Gist, the instructions on how to destroy everything are at the bottom. Next, we’ll explore AWS ECS with Fargate as an alternative managed Containers as a Service (CaaS) implementation.
Deploying Applications To Amazon Elastic Container Service (ECS) With Fargate Unless you skipped the previous section, you already saw how to use Google Cloud Run as Containers as a Service (CaaS) solution. Let’s see whether we can do something similar in AWS. For that, the best candidate is probably ECS combined with Fargate. Amazon Elastic Container Service (ECS)¹²⁴ is a scheduler and orchestrator of containers in a cluster. Unlike Kubernetes, ECS is proprietary and closed source. We do not know what is running the containers. Depending on the importance you give to understanding the underlying services, that might or might not matter. Just like other container orchestrators, it has a control plane and agents that take instructions from it. In turn, those agents are in charge of running containers, and quite a few other auxiliary tasks. ECS aims at making management of clusters and containers easy. The main ECS concepts are Tasks and Services. A task is one or more containers that are to be scheduled together by ECS. It is, in a way, equivalent to a Kubernetes Pod. An ECS service is in charge of fault tolerance, high-availability, and scaling. It is similar to AWS Auto Scaling groups but focused on ECS tasks. It defines the number of tasks that should run across the cluster. It also decides where they should be running (e.g., in which availability zones), and it associates them with load balancers. On top of all that, it can scale tasks automatically based on metrics (e.g., CPU and memory utilization). We could say that it performs similar operations as Kubernetes Scheduler, with Pods, Services, and HorizontalPodAutoscalers. Given that AWS has Elastic Kubernetes Service (EKS), you might be wondering why I am providing examples in ECS. The reason is simple. We are exploring Containers as a Service solutions which abstract underlying technology (e.g., ECS, EKS, etc.). At the same time, AWS is still pushing for ECS and considers it a preferable way to run containers, so I am just going with the flow and presenting what AWS considers the best option. ¹²³https://gist.github.com/2aa8ee4a6451fd762b1a10799bbeac88 ¹²⁴https://aws.amazon.com/ecs/
Using Managed Containers As A Service (CaaS)
263
ECS, however, might be too complex to manage alone, so AWS introduced Fargate as a layer on top of it. Later on, it enabled it to work with EKS as well, but that’s not part of this story. AWS Fargate¹²⁵ is a layer on top of ECS or EKS that simplifies cluster management and the deployment and management of the applications. With Fargate, we can specify a container image we want to deploy, with a few additional arguments like the amount of CPU and memory we expect it to use. Fargate is trying to take care of everything else, like updating and securing the underlying servers, scaling the infrastructure, and so on. Fargate is very similar in its approach to EC2 combined with autoscaling groups and a few other “standard” AWS services, except that it does not use VMs are inputs, but rather container images. We’ll discuss the pros and cons of using ECS and Fargate later. Right now, we need a bit of hands-on experience with it. All the commands from this section are available in the 04-02-ecs-fargate.sh¹²⁶ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
Before we begin, you’ll need to ensure that you have a few pre-requisites set up. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
You will need an AWS¹²⁷ account with sufficient permissions. Preferably, you should be the owner of the account or have admin privileges. You will also need to install AWS CLI (aws)¹²⁸. On top of that, you will need AWS access key ID and secret access key. If you are not sure how to get them, please return to the Creating And Managing AWS Elastic Kubernetes Service (EKS) Clusters With Terraform section for instructions. We will need to create an ECS cluster and quite a few other resources like subnets, internet gateway, route, elastic IP, NAT gateway, route table, security groups, application load balancer, and a few others. We will not go into details of all those. Instead, I prepared a Gist ecs-fargate.sh¹²⁹ that you can use to set up the resources we need. Everything is defined as Terraform configs. I will assume that, by now, you feel comfortable working with Terraform and that you should be able to explore the files from that Gist on your own. Even if you decide to set up everything yourself, you can use those Terraform definitions to figure out what is needed. ¹²⁵https://aws.amazon.com/fargate/ ¹²⁶https://gist.github.com/2ef4e1933d7c46fb1ddc41a633e1e7c7 ¹²⁷https://aws.amazon.com/ ¹²⁸https://aws.amazon.com/cli/ ¹²⁹https://gist.github.com/fa047ab7bb34fdd185a678190798ef47
Using Managed Containers As A Service (CaaS)
264
If you got scared at the word “Terraform”, you likely missed the section Infrastructure as Code (IaC), or you are very forgetful. If that’s the case, go through the exercises there, and come back when you’re ready.
What matters is that you will need to export environment variables LB_ARN, SECURITY_GROUP_ID, SUBNET_IDS, CLUSTER_ID, and DNS. The names should be self-explanatory if you are familiar with AWS. If you are using my Gist, the creation of the variables is already included. I’m mentioning it just in case you choose to go rogue and ignore the Gist. Finally, you will also need Docker¹³⁰ running on your laptop. All in all, you will need the following. • AWS¹³¹ account • AWS CLI (aws)¹³² • Quite a few AWS resources and the environment variables LB_ARN, SECURITY_GROUP_ID, SUBNET_IDS, CLUSTER_ID, and DNS. Follow the instructions from ecs-fargate.sh¹³³. • Docker¹³⁴ If you are confused by the sheer number of resources we need to create to deploy Containers as a Service in AWS, you are not alone. If you went through the exercise with Google Cloud Run, you probably already noticed that AWS is much more complicated. That is a common theme in AWS. It does many things very well, but simplicity is not one of them. AWS ECS is, by far, the most complicated managed CaaS solution we will explore.
Even though AWS Elastic Container Registry (ECR) is the most commonly used container image registry, we will not use it. Instead, we’ll use Docker Hub¹³⁵ for simplicity reasons. If that’s not your registry of choice, you should be able to modify the examples to use whichever you prefer. But do that later. It will be easier if you follow the examples as they are, and that means that you will be working with Docker Hub. I already built and pushed an image, so there is nothing for you to do, except to tell Fargate to deploy it. Let’s proceed to the most important part of the story. We are about to deploy our application, packaged as a container image, into ECS using Fargate. In the case of most of the other Containers as a Service solutions, we could deploy the application with a simple command. But, with ECS, things are a bit more complicated. We need to create a few resources. To be more precise, we need an ECS task definition, an ECS service, a role, and a container definition. ¹³⁰https://docs.docker.com/get-docker/ ¹³¹https://aws.amazon.com/ ¹³²https://aws.amazon.com/cli/ ¹³³https://gist.github.com/fa047ab7bb34fdd185a678190798ef47 ¹³⁴https://docs.docker.com/get-docker/ ¹³⁵https://hub.docker.com/
Using Managed Containers As A Service (CaaS)
265
To make things relatively simple, I already prepared Terraform definitions with all those, except the container definition, which is defined as a JSON file. We can say that I kept all infrastructure-related definitions in Terraform and the application definition in a JSON file. But that might not be entirely true since the lines are a bit blurred. For example, the ECS service is just as much a definition of the application as a reference to other infrastructure resources. ECS is strange. Nevertheless, that’s what I prepared, so that’s what we’ll use. The definitions we’ll need are in the GitHub repository vfarcic/devops-catalog-code¹³⁶, so let’s clone it. You almost certainly already have the repository cloned since it is the same one used in other sections. If that’s the case, feel free to skip the command that follows. It’s there just in case you chose to jump straight into this section, and you skipped following the instructions from the ecs-fargate.sh¹³⁷ Gist.
1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
Let’s get into the local copy of the repository and pull the latest version, just in case you already had the repo from before, and I made some changes since the last time you used it. 1
cd devops-catalog-code
2 3
git pull
The files we are going to use are located in the terraform-ecs-fargate/app directory, so let’s go there and see what we have. 1
cd terraform-ecs-fargate/app
2 3
ls -1
The output is as follows. 1 2 3
devops-toolkit-series.json main.tf variables.tf
We can see that we have two Terraform files and a JSON. Let’s start with the variables. ¹³⁶https://github.com/vfarcic/devops-catalog-code ¹³⁷https://gist.github.com/fa047ab7bb34fdd185a678190798ef47
Using Managed Containers As A Service (CaaS) 1
266
cat variables.tf
The output is as follows. 1 2 3 4
variable "desired_count" { type = number default = 1 }
5 6 7 8 9
variable "memory" { type = string default = "512" }
10 11 12 13 14
variable "cpu" { type = string default = "256" }
15 16 17 18 19
variable "port" { type = number default = 80 }
20 21 22 23
variable "lb_arn" { type = string }
24 25 26 27
variable "security_group_id" { type = string }
28 29 30 31
variable "subnet_ids" { type = list(string) }
32 33 34 35
variable "cluster_id" { type = string }
All those variables should be self-explanatory, at least for those with some familiarity with AWS and containers in general. The desired_count specifies how many containers we want to run. The
Using Managed Containers As A Service (CaaS)
267
memory and cpu entries define the resources we want to allocate, while the port maps to what the
process inside a container will listen to. The rest of the variables (lb_arn, security_group_id, subnet_ids, and cluster_id) provide references to the dependent resources we created at the very start. Let’s take a look at the definitions of the resources. 1
cat main.tf
The output is as follows. 1 2 3 4 5 6 7 8 9
resource "aws_ecs_task_definition" "dts" { family = "devops-toolkit-series" requires_compatibilities = ["FARGATE"] container_definitions = file("devops-toolkit-series.json") network_mode = "awsvpc" memory = var.memory cpu = var.cpu execution_role_arn = data.aws_iam_role.ecs_task_execution_role.arn }
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
resource "aws_ecs_service" "dts" { name = "devops-toolkit-series" launch_type = "FARGATE" task_definition = aws_ecs_task_definition.dts.arn cluster = var.cluster_id desired_count = var.desired_count network_configuration { subnets = var.subnet_ids security_groups = [var.security_group_id] } load_balancer { target_group_arn = var.lb_arn container_name = "devops-toolkit-series" container_port = var.port } }
27 28 29 30
data "aws_iam_role" "ecs_task_execution_role" { name = "ecsTaskExecutionRole" }
The aws_ecs_task_definition entry defines the ECS task, which contains the properties like, for example, cpu and memory allocations. The most important part of the task is the container_definitions
Using Managed Containers As A Service (CaaS)
268
value, which references JSON representing information about the containers we want to run. We’ll explore it soon. The second resource is aws_ecs_service, which provides the association between the cluster and the task, the number of replicas (desired_count), the network, and the load balancer. The most important definition is in the devops-toolkit-series.json file referenced through container_definitions in aws_ecs_task_definition. It is set to the path of the file. Let’s take a look at it. 1
cat devops-toolkit-series.json
The output is as follows. 1
[ {
2
"name": "devops-toolkit-series", "image": "vfarcic/devops-toolkit-series", "portMappings": [ { "containerPort": 80, "hostPort": 80 } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group" : "/ecs/devops-toolkit-series", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } }
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
}
19 20
]
There are a couple of things worth noting. First, you can see that the whole definition is an array ([ and ]). It means that we can define one or more containers. In this case, it is only one, but that should not prevent you from having more. Additional containers often act in a similar way as side-cars in Kubernetes and tend to be focused on initialization of the “main” container. The fields in the definition of a container should be self-explanatory. We have a name and a reference to the container image we want to deploy. The portMappings entry defines the relation between the port of the process inside a container (containerPort) and how it will be exposed (hostPort). The last entry contains a set of fields that define how we want to ship logs.
Using Managed Containers As A Service (CaaS)
269
That definition uses vfarcic/devops-toolkit-series as the image without a specific tag. When in the “real world”, you should always be specific instead of using whatever is the latest. You will also notice that we are not using AWS Elastic Container Registry (ECR), which is the likely candidate where you would store images if you chose AWS. Instead, we are using an image I created and stored in Docker Hub. Both deviations are for no other reason but for the simplicity of the exercises.
Since everything we need to deploy is defined (directly or indirectly) as Terraform, we just need to apply the definitions. 1
terraform init
2 3 4 5 6 7
terraform --var --var --var --var
apply \ lb_arn=$LB_ARN \ security_group_id=$SECURITY_GROUP_ID \ subnet_ids="$SUBNET_IDS" \ cluster_id=$CLUSTER_ID
You will be presented with the output that follows (limited to the relevant parts). 1 2 3 4 5
--+ resource "aws_ecs_service" "dts" { ... + resource "aws_ecs_task_definition" "dts" { ...
6 7
Plan: 2 to add, 0 to change, 0 to destroy.
8 9 10 11
Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.
12 13
Enter a value: yes
As expected, the plan states that 2 resources will be added (a service and a task definition). Confirm that you want to proceed by typing yes and pressing the enter key. We used variables for LB ARN, security group, subnet IDs, and cluster ID since I could not know what those values will be in your case. In the “real world” situation, you should add those as default values of Terraform variables.
After only a few seconds, our application has been deployed. It might not be available right away since the status of the containers is not monitored by Terraform. It should not take long, though.
Using Managed Containers As A Service (CaaS)
270
I would like to say that it was simple, but it wasn’t. To be more precise, it wasn’t when compared to other providers (e.g., Google Cloud, Azure). But, from AWS perspective, it was as simple (or as hard) as anything else. In the beginning, we had to set up a cluster by creating a bunch of resources, and now we had to create a service, a task definition, and a container definition. I’ll let you judge how simple and straightforward that is, at least for now. Later on, we’ll compare CaaS in AWS with similar solutions in Google Cloud and Azure and see, among other things, which one is easiest to work with. To be on the safe side and, at the same time, to learn a few commands, we’ll list all the ECS services and confirm that devops-toolkit-series was indeed created. 1 2
aws ecs list-services \ --cluster $CLUSTER_ID
The output is as follows. 1
{ "serviceArns": [ "arn:aws:ecs:us-east-1:036548781187:service/devops-toolkit-series" ]
2 3 4 5
}
Similarly, we can also list the ECS tasks in that cluster. 1 2
aws ecs list-tasks \ --cluster $CLUSTER_ID
The output is as follows. 1
{ "taskArns": [ "arn:aws:ecs:us-east-1:036548781187:task/72921efe-8ce2-4977-a720-61fcfce3baea" ]
2 3 4 5
}
Unlike Google Cloud Run, the dependency on AWS is vast. If we feel that someone else can offer us a better service or lower cost, we could not move there easily. We could indeed use the same container images in any other CaaS solution, but all the definitions around them would need to be rewritten. That is not necessarily the end of the world, but rather something you might want to consider when choosing what to use. Still, within the AWS ecosystem, ECS with Fargate is one of the places where you can deploy your applications with less lock-in than with other AWS services. All that’s left for us to be fully confident that everything works as expected is to open the application in a browser and confirm that it indeed works as expected.
Using Managed Containers As A Service (CaaS)
271
If you are a Linux or a Windows WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on how to do that in the Setting Up A Local Development Environment chapter. If you do use Windows, but with a bash emulator (e.g., GitBash) instead of WSL, the open command might not work. You should replace open with echo and copy and paste the output into your favorite browser.
1
open http://$DNS
You should see the home screen of a simple Web application with the list of all the books and courses I published. Feel free to navigate through the app and purchase as many books and courses as you can. Once you’re done, expense them to your manager. It’s a win-win situation (except for your manager).
Besides using aws CLI to observe ECS instances, we can visit AWS console if we need a UI. 1
open https://console.aws.amazon.com/ecs/home?region=us-east-1
As before, I will not provide instructions on how to click links and navigate inside a dashboard. You can do that yourself.
We are about to see how does our application behaves under moderate (not even heavy) load. We’ll use Siege¹³⁸ for that. It can be described as a “poor man’s” load testing and benchmark utility. While there are better tools out there, they are usually more complicated. Since this will be a “quick and dirty” test of availability, siege should do. I will run siege as a Pod. Feel free to follow along if you have a Kubernetes cluster at your disposal. Otherwise, you can skip executing the commands that follow and observe the results I will present.
We will be sending thousand concurrent requests for thirty seconds.
¹³⁸https://github.com/JoeDog/siege
Using Managed Containers As A Service (CaaS) 1 2 3 4 5
272
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "http://$DNS"
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
... Transactions: 7768 hits Availability: 100.00 % Elapsed time: 31.89 secs Data transferred: 53.69 MB Response time: 1.96 secs Transaction rate: 243.59 trans/sec Throughput: 1.68 MB/sec Concurrency: 477.37 Successful transactions: 7768 Failed transactions: 0 Longest transaction: 28.38 Shortest transaction: 0.20 ...
We can see that, in my case, around seven thousand requests were sent, and the resulting availability was 100 %. We’ll comment on that information later when we compare AWS ECS and Fargate with similar solutions available in other providers. For now, I will only say that having 100% availability is encouraging. Let’s run the siege again and check whether the results are consistent. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "http://$DNS"
The output is as follows.
Using Managed Containers As A Service (CaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
273
... Transactions: 7102 hits Availability: 100.00 % Elapsed time: 32.20 secs Data transferred: 49.09 MB Response time: 2.34 secs Transaction rate: 220.56 trans/sec Throughput: 1.52 MB/sec Concurrency: 515.05 Successful transactions: 7103 Failed transactions: 0 Longest transaction: 20.37 Shortest transaction: 0.21 ...
We can see that the results are similar. In my case, the number of transactions is a bit smaller, and the availability is still 100%. There’s not much more to know about ECS and Fargate, except to explore the additional fields we can use in Terraform and in container definition JSON. I’ll leave you to explore them alone. One important thing that we skipped is service discovery and domain resolution. We should probably use Route53 for that. We skipped it since it is not directly related to ECS. More importantly, I could not assume that you have a free domain ready to be used for the exercises. Another important thing we skipped is auto-scaling of the applications deployed to ECS with Fargate. It’s a subject in itself that you should explore alone. For now, I’d like to stress that auto-scaling in ECS is limiting, unintuitive, and far from easy, at least when compared with similar services provided by other vendors. That’s it. We’re done with the quick exploration of AWS ECS with Fargate as one of many possible solutions for using managed Containers as a Service (CaaS) flavor of serverless deployments. We’ll remove the application and the related resources. Since it was deployed through Terraform, a simple destroy command should do. 1 2 3 4 5
terraform --var --var --var --var
destroy \ lb_arn=$LB_ARN \ security_group_id=$SECURITY_GROUP_ID \ subnet_ids="$SUBNET_IDS" \ cluster_id=$CLUSTER_ID
Let’s get out of the directory with the definitions of the resources we just created.
Using Managed Containers As A Service (CaaS) 1
274
cd ../../../
All that is left is to remove whichever AWS resources (e.g., the cluster) we created initially. If you created them using the ecs-fargate.sh¹³⁹ Gist, the instructions on how to destroy everything are at the bottom. Next, we’ll explore Azure Container Instances as an alternative managed Containers as a Service (CaaS) implementation.
Deploying Applications To Azure Container Instances Unless you skipped the previous sections, you already saw how to use Google Cloud Run and AWS ECS with Fargate as managed Containers as a Service (CaaS) solutions. Let’s see whether we can do something similar in Azure. For that, the best candidate is probably Container Instances service. Azure Container Instances is a solution for scenarios where we would like to run isolated containers without orchestration. The last two words are the key to understanding the service. It is without orchestration. That piece of information will be critical for understanding the upsides and downsides of the service. Since it uses container images, we can write code in any language we want. We can use any application server and any dependencies. Long story short, if an application and everything it needs can be packaged into a container image, it can be deployed to Azure Container Instances. The service abstracts infrastructure management, allowing us to focus on our applications and the business value we are trying to create. We’ll discuss the pros and cons of using Azure Container Instances later. Right now, we need a bit of hands-on experience with it. All the commands from this section are available in the 04-02-aci.sh¹⁴⁰ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
Before we begin, you’ll need to ensure that you have a few pre-requisites set up. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow. ¹³⁹https://gist.github.com/fa047ab7bb34fdd185a678190798ef47 ¹⁴⁰https://gist.github.com/6d6041896ef1243233c11b51d082eb6e
Using Managed Containers As A Service (CaaS)
275
You will need an Azure¹⁴¹ account with sufficient permissions. Preferably, you should be the owner of the account or have admin permissions. You will also need to install Azure CLI (az)¹⁴². You will need to create a resource group and an Azure Container Registry (ACR). I prepared a Gist aci.sh¹⁴³ that you can use to set up the resources we need. Everything is defined as Terraform configs. I will assume that, by now, you feel comfortable working with Terraform and that you should be able to explore the files from that Gist on your own. Even if you decide to set up everything yourself, you can use those Terraform definitions to figure out what is needed. If you got scared at the word “Terraform”, you likely missed the section Infrastructure as Code (IaC), or you are very forgetful. If that’s the case, go through the exercises there, and come back when you’re ready.
What matters is that you will need to export environment variables REGISTRY_NAME, RESOURCE_GROUP, and REGION. The names should be self-explanatory. If you are using my Gist, the creation of the variables is already included. I’m mentioning it just in case you choose to go rogue and ignore the Gist. Finally, you will also need Docker¹⁴⁴ running on your laptop, and jq¹⁴⁵. You’ll see later why we need those. We might even choose to ditch the “standard” Azure way of deploying applications in favor of Docker. All in all, you will need the following. • Azure¹⁴⁶ account • Azure CLI (az)¹⁴⁷ • A few Azure resources and the environment variables REGISTRY_NAME, RESOURCE_GROUP, and REGION. Follow the instructions from aci.sh¹⁴⁸. • Docker¹⁴⁹ • jq¹⁵⁰ Azure Container Instances (ACI) can use container images stored in any registry. For simplicity, we’ll use the one provided by Azure. If that’s not your registry of choice, you should be able to modify the examples to use whichever you prefer. But do that later. It will be easier if you follow the examples as they are, and that means that you will be working with Azure Container Registry (ACR). We’ll push the image we will use using Docker, and the first step is to provide it with authentication so that it can access the Azure Container Registry. ¹⁴¹https://azure.microsoft.com ¹⁴²https://docs.microsoft.com/en-us/cli/azure/install-azure-cli ¹⁴³https://gist.github.com/34009f4c65683dd3a82081fa8d76cd85 ¹⁴⁴https://docs.docker.com/get-docker/ ¹⁴⁵https://stedolan.github.io/jq/download/ ¹⁴⁶https://azure.microsoft.com ¹⁴⁷https://docs.microsoft.com/en-us/cli/azure/install-azure-cli ¹⁴⁸https://gist.github.com/34009f4c65683dd3a82081fa8d76cd85 ¹⁴⁹https://docs.docker.com/get-docker/ ¹⁵⁰https://stedolan.github.io/jq/download/
Using Managed Containers As A Service (CaaS) 1
276
az acr login --name $REGISTRY_NAME
We will not build container images. That is not the goal of this section. We are trying to deploy applications packaged as container images. So, I will assume that you know how to build container images, and we will take a shortcut by using an image that I already built. Let’s pull it. 1 2
docker image pull \ vfarcic/devops-toolkit-series
Next, we’ll push that image into your registry. To do that, we’ll tag the image we just pulled by injecting the name of your registry as a subdomain of azurecr.io. That way, Docker will know that we want to push it to that specific registry. 1
export IMAGE=$REGISTRY_NAME.azurecr.io/devops-toolkit-series:0.0.1
2 3 4 5
docker image tag \ vfarcic/devops-toolkit-series \ $IMAGE
Now we can push the image to your Azure Container Registry. 1
docker image push $IMAGE
To be on the safe side, we will list all the images in the container registry. 1 2 3
az acr repository list \ --name $REGISTRY_NAME \ --output table
The output, limited to the relevant parts, is as follows. 1 2 3
Result --------------------devops-toolkit-series
That showed us that there is at least one devops-toolkit-series image. Let’s check whether the tag we pushed is there as well.
Using Managed Containers As A Service (CaaS) 1 2 3 4
277
az acr repository show-tags \ --name $REGISTRY_NAME \ --repository devops-toolkit-series \ --output table
The output is as follows. 1 2 3
Result -------0.0.1
We can see that the image with the tag 0.0.1 is indeed stored in the registry. If you are scared of terminals and feel dizzy without the occasional presence of a broader color palette, you can visit ACR from Azure Console. Let’s see what the name of the registry is. 1
echo $REGISTRY_NAME
In my case, it is 001xymma. Remember what your output is. You’ll need it soon. Unfortunately, Azure URLs are too complicated for me to provide a direct link, so we’ll open the console’s home screen, and navigate to the registry manually. If you are a Linux or a Windows WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on how to do that in the Setting Up A Local Development Environment chapter. If you do use Windows, but with a bash emulator (e.g., GitBash) instead of WSL, the open command might not work. You should replace open with echo and copy and paste the output into your favorite browser.
1
open https://portal.azure.com
Please type container registries in the search box and select it. You should see the list of all the container registries you have. Click the one we are using. If you already forgot the name, go back to your terminal to see the output of echo $REGISTRY_NAME. Azure treats base images as repositories. So, in our case, there should be a repository devops-toolkitseries. Select Repositories from the left-hand menu, and click the devops-toolkit-series repository. Confirm that the tag 0.0.1 is there. Feel free to open it, and you will see the Docker pull command as well as the full manifest.
Using Managed Containers As A Service (CaaS)
278
Now that we have the container image we want to deploy stored in a registry from which Azure can pull it, we can proceed to the most important part of the story. We are about to deploy our application, packaged as a container image, using Azure Container Instances. To do that, we need to make a couple of decisions. What is the image we want to run? That one is easy since we just pushed the image we’re interested in. We’ll also need to tell the service which resource group and which region it should use. We already have that information in environment variables initialized when we created the resources with Terraform. We also need to choose a subdomain. In the “real world” situations, you would associate your applications with your domains or subdomains. But, for simplicity reasons, we’ll use a subdomain of azurecontainer.io. To avoid potential conflicts created by others coming up with the same subdomain, we’ll make it, more or less, unique by using a timestamp. 1
export SUBDOMAIN=devopstoolkitseries$(date +%Y%m%d%H%M%S)
Finally, we will also need to retrieve the Azure Container Registry credentials we created early on. 1 2
az acr credential show \ --name $REGISTRY_NAME
The output is as follows. 1
{ "passwords": [ { "name": "password", "value": "..." }, { "name": "password2", "value": "..." } ], "username": "sbzz2mll"
2 3 4 5 6 7 8 9 10 11 12 13
}
We need the username and one of the passwords from that output. We’ll extract those using jq and store them in environment variables.
Using Managed Containers As A Service (CaaS) 1 2 3 4
279
export ACR_USER=$( az acr credential show \ --name $REGISTRY_NAME \ | jq -r '.username')
5 6 7 8 9
export ACR_PASS=$( az acr credential show \ --name $REGISTRY_NAME \ | jq -r '.passwords[0].value')
Now we have everything we need to deploy the container image into Azure Container Instances. 1 2 3 4 5 6 7 8 9
az container create \ --resource-group $RESOURCE_GROUP \ --name devops-toolkit-series \ --location $REGION \ --image $IMAGE \ --dns-name-label $SUBDOMAIN \ --ports 80 \ --registry-username $ACR_USER \ --registry-password $ACR_PASS
The output is a very long JSON with much more information than what we need for this exercise. I’ll leave you to explore it yourself. For now, we’ll focus on checking whether the container is indeed running. We can, for example, output the status of the container we just deployed. 1 2 3 4
az container show \ --resource-group $RESOURCE_GROUP \ --name devops-toolkit-series \ --out table
The most important part of the output is probably the Status column. It shows that the container is ‘Running. Now that we are relatively confident that the application is running or, at least, that Azure thinks it is, let’s check whether it is accessible. Since we are not using a custom domain, the application is accessible through a subdomain of azurecontainer.io. The full address is predictable. It is a subdomain constructed using the DNS label we have in the SUBDOMAIN variable and the region. For simplicity, we’ll store the full domain in the environment variable ADDR.
Using Managed Containers As A Service (CaaS) 1
280
export ADDR=http://$SUBDOMAIN.$REGION.azurecontainer.io
Now that we know the URL, we can open the application in our favorite browser. 1
open $ADDR
You should see the home screen of a simple Web application with the list of all the books and courses I published. Feel free to navigate through the app and purchase as many books and courses as you can. Once you’re done, expense them to your manager. It’s a win-win situation (except for your manager).
We are about to see how does our application behaves under moderate (not even heavy) load. We’ll use Siege¹⁵¹ for that. It can be described as a “poor man’s” load testing and benchmark utility. While there are better tools out there, they are usually more complicated. Since this will be a “quick and dirty” test of availability, siege should do. I will run siege as a Pod. Feel free to follow along if you have a Kubernetes cluster at your disposal. Otherwise, you can skip executing the commands that follow and observe the results I will present.
We will be sending thousand concurrent requests during thirty seconds. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output, limited to the relevant parts, is as follows.
¹⁵¹https://github.com/JoeDog/siege
Using Managed Containers As A Service (CaaS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
281
... Transactions: 7779 hits Availability: 100.00 % Elapsed time: 30.04 secs Data transferred: 53.76 MB Response time: 2.00 secs Transaction rate: 258.95 trans/sec Throughput: 1.79 MB/sec Concurrency: 518.44 Successful transactions: 7779 Failed transactions: 0 Longest transaction: 28.60 Shortest transaction: 0.18 ...
We can see that, in my case, over seven thousand requests were sent, and the resulting availability was 100 %. We’ll keep commenting on that information for later when we compare Azure Container Instances with similar solutions available in other providers. For now, I will only say that having 100% availability is encouraging. Let’s run the siege again and check whether the results are consistent. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 1000 --time 30S "$ADDR"
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
... Transactions: 8407 hits Availability: 100.00 % Elapsed time: 29.81 secs Data transferred: 58.10 MB Response time: 2.07 secs Transaction rate: 282.02 trans/sec Throughput: 1.95 MB/sec Concurrency: 583.96 Successful transactions: 8407 Failed transactions: 0 Longest transaction: 27.79 Shortest transaction: 0.18 ...
Using Managed Containers As A Service (CaaS)
282
We can see that, in my case, this time, it handled over eight thousand requests. The availability is still 100 %. There’s not much more to know about Azure Container Instances, except to explore the additional arguments we can use when executing az container create. I’ll leave you to explore them alone. Actually, there is one crucial thing you should know. There is no orchestrator behind Azure Container Instances. That means that there is no horizontal scaling. We can run only one replica of an application. As a result, it cannot be highly-available, and it does not scale automatically. There are quite a few other things we could not do. A simple description of the service is that it allows us to run containers in isolation. If that is not good enough, we would need to switch to Azure Kubernetes Service (AKS) or change the provider. Azure Container Instances is a way for us to do, more or less, the same as what we would do using Docker on a local machine. Whether that is useful or not is up to you to decide. Later on, we’ll compare it with other Containers as a Service (CaaS) solutions and see in more detail what is what. We’ll remove the application we deployed. 1 2 3
az container delete \ --resource-group $RESOURCE_GROUP \ --name devops-toolkit-series
You will be asked whether you are sure you want to perform this operation?. Type y and press the enter key. You might think that we are finished exploring Azure Container Instances, but we are not. There might be a better way to use it than the execution of az container commands. The bad news is that the exploration of the alternative, and potentially a better way to deploy applications to Azure Container Instances is not part of this book. The good news is that it is available as a YouTube video titled Using Docker To Deploy Applications To Azure Container Instances¹⁵². Please take a look let me know what you think. All that is left is to remove whichever Azure resources we created. If you created them using the aci.sh¹⁵³ Gist, the instructions on how to destroy everything are at the bottom. Now that we explored managed Containers as a Service in the three major Cloud computing providers, we should probably comment on what we learned and try to compare them. Stay tuned.
To CaaS Or NOT To CaaS? Should we use managed Containers as a Service (CaaS)? That must be the most crucial question we should try to answer. Unfortunately, it is hard to provide a universal answer since the solutions ¹⁵²https://youtu.be/9n4I_IJYndc ¹⁵³https://gist.github.com/34009f4c65683dd3a82081fa8d76cd85
Using Managed Containers As A Service (CaaS)
283
differ significantly from one provider to another. Currently (July 2020), CaaS can be described as wild west with solutions ranging from amazing to useless. Before we attempt to answer the big question, let’s go through some of the things we learned by exploring Google Cloud Run, AWS ECS with Fargate, and Azure Container Instances. We can compare those three from different angles. One of those can be simplicity. After all, ease of use is one of the most essential benefits of serverless computing. It is supposed to allow engineers to provide code or binaries (in one form or another) with a reasonable expectation that the platform of choice will do most of the rest of the work. From the simplicity perspective, both Google Cloud Run and Azure Container Instances are exceptional. They allow us to deploy our container images without almost any initial setup. Google needs only a project, while Azure requires only a resource group. On the other hand, AWS needs over twenty different bits and pieces (resources) to be assembled before we can even start thinking about deploying something to ECS. Even after all the infrastructure is set up, we need to create a task definition, a service, and a container definition. If simplicity is what you’re looking for, ECS is not it. It’s horrifyingly complicated, and it’s far from “give us a container image, we’ll take care of the rest” approach we are all looking for when switching to serverless deployments. Surprisingly, a company that provides such amazing Functions as a Service solution (Lambda) did not do something similar with ECS. If AWS took the same approach with ECS as with Lambda, it would likely be the winner. But it didn’t, so I am going to give it a huge negative point. From the simplicity of setup and deployment perspective, Azure and Google are clear winners. Now that we mentioned infrastructure in the context of the initial setup, we might want to take that as a criterion as well. There is no infrastructure for us to manage when using CaaS in Google Cloud or Azure. They take care of all the details. AWS, on the other hand, forces us to create a full-blown cluster. That alone can disqualify AWS ECS with Fargate from being considered as a serverless solution. I’m not even sure whether we could qualify it as Containers as a Service. As a matter of fact, I would prefer using Elastic Kubernetes Engine (EKS). It’s just as easy, if not easier, than ECS and, at least, it adheres to widely accepted standards and does not lock us into a suboptimal proprietary solution from which there is no escape. How about scalability? Do our applications scale when deployed into managed Containers as a Service solutions? The answer to that question changes the rhythm of this story. Google Cloud Run is scalable by design. It is based on Knative, which is a Kubernetes resource designed for serverless workloads. It scales without us even specifying anything. Unless we overwrite the default behavior, it will create a replica of our application for every hundred concurrent requests. If there are no requests, no replicas will run. If it jumps to three hundred, it will scale to three replicas. It will queue requests if none of the replicas can handle them, and scale up and down to accommodate fluctuations in traffic. All that will happen without us providing any specific information. It has sane defaults while still providing the ability to fine-tune the behavior to match our particular needs.
Using Managed Containers As A Service (CaaS)
284
Applications deployed to ECS are scalable as well. But it is not easy. Scaling applications deployed to ECS is complicated and limiting. Even if we can overlook those issues, it does not scale to zero replicas. At least one replica of our application needs to run at all times since there is no built-in mechanism to queue requests and spin up new replicas. From that perspective, scaling applications in ECS is not what we would expect from serverless computing. It is similar to what we would get from HorizontalPodAutoscaler in Kubernetes. It can go up and down, but never to zero replicas. Given that there is a scaling mechanism of sorts, but that it cannot go down to zero replicas and that it is limiting in what it can actually do, I can only say that ECS only partially fulfills the scalability needs of our applications, at least in the context of serverless computing. How about Azure Container Instances? Unlike Google Cloud Run and ECS, it does not use a scheduler. There is no scaling of any kind. All we can do is run single replica containers isolated from each other. That alone means that Azure Container Instances cannot be used in production for anything but small businesses. Even in those cases, it is still not a good idea to use ACI for production workloads. The only use-case I can imagine would be for situations in which your application cannot scale. If you have one of those old, often stateful applications that can run only in single-replica mode, you might consider Azure Container Instances. For anything else, the inability to scale is a show stopper. Simply put, Azure Container Instances provide a way to run Docker containers in Cloud. There is not much more to it, and we know that Docker alone is not enough for anything but development purposes. I would say that even development with Docker alone is not a good idea, but that would open a discussion that I want to leave for another time.
Another potentially important criterion is the level of lock-in. ECS (with or without Fargate) is fully proprietary and forces us to rely entirely on AWS. The amount of resources we need to create and the format for writing application definitions ensures that we are locked into AWS. If you choose to use it, you will not be able to move anywhere else, at least not easily. That does not necessarily mean that the benefits do not outweigh the potential cost behind being locked-in, but, instead, that we might need to be aware of it when making the decision whether to use it or not. The issue with ECS is not lock-in itself. There is nothing wrong with using proprietary solutions that solve problems in a better way than open alternatives. The problem is that ECS is by no means any better than Kubernetes. As a matter of fact, it is a worse solution. So, the problem with being locked into ECS is that you are locked into a service that is not as good as the more open counterpart provided by the same company (AWS EKS). That does not mean that EKS is the best managed Kubernetes service (it is not), but that, within the AWS ecosystem, it is probably a better choice.
Using Managed Containers As A Service (CaaS)
285
Azure Container Instances are also fully proprietary but, given that all the investment is in creating container images and running a single command, you will not be locked. The investment is very low, so if you choose to switch to another solution or a different provider, you should be able to do that with relative ease. Google Container Run is based on Knative, which is open source and open standard. Google is only providing a layer on top of it. You can even deploy it using Knative definitions, which can be installed in any Kubernetes cluster. From the lock-in perspective, there is close to none. How about high-availability? Google Cloud Run was the only solution that did not produce 100 % availability in our tests with siege. So far, that is the first negative point we could give it. That is a severe downside. That does not mean that it is not highly available, but rather that it tends to produce only a few nines after the decimal (e.g., 99.99). That’s not a bad result by any means. If we did more serious testing, we would see that over a more extended period and with a higher number of requests, the other solutions would also drop below 100 % availability. Nevertheless, with a smaller sample, Azure Container Instances and AWS ECS did produce better results than Google Cloud Run, and that is not something we should ignore. Azure Container Instances, on the other hand, can handle only limited traffic. The inability to scale horizontally inevitably leads to failure to be highly-available. We did not experience that will our tests with siege mostly because a single replica was able to handle thousand concurrent requests. If we increased the load, it would start collapsing by reaching the limit of what one replica can handle. On the other hand, ECS provides the highest availability, as long as we set up horizontal scaling. We need to work for it. Finally, the most important question to answer is whether any of those services is production-ready. We already saw that Azure Container Instances should not be used in production, except for very specific use-cases. Google Cloud Run and AWS ECS, on the other hand, are production-ready. Both provide all the features you might need when running production workloads. The significant difference is that ECS exists for much longer, while Google Cloud Run is a relatively new service, at least at the time of this writing (July 2020). Nevertheless, it is based on Google Kubernetes Engine (GKE), which is considered the most mature and stable managed Kubernetes we can use today. Given that Google Cloud Run is only a layer on top of GKE, we can safely assume that it is stable enough. The bigger potential problem is in Knative itself. It is a relatively new project that did not yet reach the first GA release (at the time of this writing, the latest release is 0.16.0). Nevertheless, major software vendors are behind it. Even though it might not yet be battle-tested, it is getting very close to being the preferable way to run serverless computing in Kubernetes. To summarize, Azure Container Instances are not, and never will be, production-ready. AWS ECS is fully there, and Google Cloud Run is very close to being production-ready. Finally, can any of those services be qualified as serverless? To answer that question, let’s define what the features we expect from managed serverless computing are.
286
Using Managed Containers As A Service (CaaS)
It is supposed to remove the need to manage infrastructure or, at least, to simplify it greatly. It should provide scalability and high-availability, and it should charge us for what our users use while making sure that our apps are running only when needed. We can summarize those as follows. • No need to manage infrastructure • Out-of-the-box scalability and high-availability • “Pay what your users use” model If we take those three as the base evaluation whether something is serverless or not, we can easily discard both Azure Container Instances and AWS ECS with Fargate. Azure Container Instances service does not have out-of-the-box scalability and high-availability. As a matter of fact, it has no scalability of any kind and, therefore, it cannot be highly available. On top of that, we do not pay what our users use since it cannot scale to zero replicas, so our app is always running, no matter whether someone is consuming it. As such, our bill will be based on the amount of pre-assigned resources (memory and CPU). The only major serverless computing feature that it does provide is hands-off infrastructure. AWS ECS with Fargate does provide some sort of scalability. It’s not necessarily out-of-the-box experience, but it is there. Nevertheless, it fails to abstracts infrastructure management, and it suffers from the same problem as ACI when billing is concerned. Given that we cannot scale our applications to zero replicas when they are not used, we have to pay for resources they are consuming independently of our users’ needs. Google Cloud Run is, by all accounts, a serverless implementation of Containers as a Service. It removes the need to manage infrastructure, it provides horizontal scaling as an out-of-the-box solution while still allowing us to fine-tune the behavior. It scales to zero when not in use, so it does adhere to the “pay what your users use” model. Google Cloud Run is, without doubt, the best of the three. Before we jump into my personal thoughts about managed CaaS, here’s a table that summarizes the findings. Easy to use Hands-off infrastructure Horizontal scaling Open (no lock-in) High-availability Production-ready Serverless
ACI Yes Yes No Yes No No No
ECS No No Partial No Yes Yes No
GCR Yes Yes Yes Yes Almost Almost Yes
Using Managed Containers As A Service (CaaS)
287
Personal Thoughts About Managed CaaS I am confident that the future of serverless computing is in Containers as a Service. The ability to deploy any container image combined with letting compute providers handle infrastructure, scaling, and other demanding tasks is the winning combination. It allows us to focus on our business goals while letting others handle the rest of the work. Container images themselves enable us to have parity between local development and production. Or, to be more precise, they allow us to use the same artifacts (images), no matter where we’re running them. Unlike Functions as a Service solutions that are very limiting, Containers as a Service give us much more freedom and serve a much wider number of use-cases. Finally, on top of all that, CaaS can apply the “pay what your users use” model, as one of the most exciting capabilities of serverless computing. All in all, Serverless is the future, and Containers as a Service is the best candidate to get us to that future. But there is a problem. Not everything is painted with unicorn colors. Today (July 2020), Containers as a Service solutions are either immature or do not comply with the serverless paradigm. Azure Container Instances are too simple for any serious use. It is Docker running in Cloud, and not much more. As such, it’s not worth even commenting on it. ECS with Fargate fails on too many levels when considered as a serverless computing solution. It is too complicated, it does not remove the need to handle infrastructure, it does not have out-of-thebox scalability and availability, it does not scale to zero replicas when not in use, and it does not adhere to the “pay what your users use” model. If we use a lot of imagination, we could say that it is managed Containers as a Service solution, but it is by no means serverless. From my perspective, ECS does not bring any additional value that would make me choose to use it over a good managed Kubernetes solution. Given that AWS has Elastic Kubernetes Service (EKS), you might be wondering why we are even discussing ECS. The reason is simple. We are exploring Containers as a Service solutions which abstract underlying technology (e.g., ECS, EKS, etc.). At the same time, AWS is still pushing for ECS and considers it a preferable way to run containers, so I am just going with the flow and presenting what AWS considers the best option.
The only solution that shines as managed Containers as a Service is Google Cloud Run. That is not a surprise. Google is at the forefront of Kubernetes, and it has, by far, the best managed Kubernetes service. For now, everyone else is trying to catch-up on that front. Given that Kubernetes is the best candidate for Containers as a Service, it is no wonder that Google has the best and, potentially, the only CaaS solution worth evaluating. While others are trying to catch with Kubernetes, Google was building a layer on top of GKE, and the result is Cloud Run. Today (July 2020), the only managed Containers as a Service worth considering, at least among the “big three providers”, is Google Cloud Run. Everything else is either going into the wrong direction
Using Managed Containers As A Service (CaaS)
288
(ECS with Fargate), hit a dead-end, or was never meant to be used in production (Azure Container Instances). If you are using Google, you should consider Cloud Run for some, if not all, of your applications. If you can move from wherever you are to Google, you might consider that option as well. If you are decided to stay in Azure or AWS, Containers as a Service solutions over there are not good choices. But do not be discouraged by that. CaaS is the natural evolution of containers and schedulers like Kubernetes. New solutions will come, and the existing ones will improve. That much is certain. Even if you do not want to wait, there is always the option to switch from managed to self-managed Containers as a Service solution. That happens to be the subject of the next section.
Using Self-Managed Containers As A Service (CaaS) We already explored a few Containers as a Service solutions. We saw the pros and cons of using AWS ECS with Fargate, Azure Container Instances, and Google Cloud Run. All those solutions were managed CaaS. They are provided and managed by other vendors. When using them, we do not need to think about the underlying infrastructure, services, plumbing, scaling, and other things not directly related to developing applications. By using managed CaaS, we are externalizing most of the work required for the successful deployment and management of our applications. However, managed solutions are not always the best path forward. There are cases when using a service is not enough. Sometimes we need more control over our applications. In other instances, we might not be even allowed to use services. There might be regulatory or other restrictions imposed. In this section, we will explore self-managed Containers as a Service. We will try to accomplish the same goals as if it would be any other serverless Containers as a Service solution, but without relinquishing control. Instead of letting providers like AWS, Azure, and Google Cloud manage our applications, we will do it ourselves. The self-managed serverless computing solution we should use should be simple. It should not require much more than a container image and a few essential arguments or parameters. It should scale automatically, even to zero replicas, if there is no traffic. We’ll deploy container images in a similar way as, let’s say, we would do with Google Cloud Run. The only substantial difference is that we’ll have to manage the infrastructure and the platform ourselves.
Using Knative To Deploy And Manage Serverless Workloads Knative¹⁵⁴ is a platform sitting on top of Kubernetes. It tries to help with the deployment and management of serverless workloads. It is split into two major components; Serving and Eventing. For our purposes, we’ll focus on Knative Serving, and leave Eventing for some other time. From now on, I will refer to Knative Serving as simply Knative. Do not take that as a sign that Eventing is not worth exploring, but rather that it is not in the scope of this section. ¹⁵⁴https://knative.dev/
Using Self-Managed Containers As A Service (CaaS)
290
Knative Build was a third component under the Knative umbrella. However, over time, it split away from Knative, and, today, it is known as Tekton Pipelines¹⁵⁵.
A short version of the description of Knative would be that it provides means to run serverless workloads in Kubernetes. We can qualify it as a self-managed serverless solution. It allows anyone to run applications as Containers as a Service, except that it is not a service managed by a vendor. It is a solution for self-managed serverless deployments of applications packaged as container images. That is not to say that Knative has to be self-managed. Quite a few providers adopted (or are adopting) Knative as a way to provide managed Containers as a Service solution. What makes Knative a very interesting solution, besides its technical capabilities, is that it is becoming the de facto standard for serverless computing, especially when narrowed down to the solutions running on top of Kubernetes. Google is using it as the base for Cloud Run, which we already explored. Other providers (big and small) are in the process of adopting it as well. It is open source and developed by a vast community. Big names like Google, Pivotal, IBM, SAP, and RedHat are investing heavily in it. As an additional bonus, given that it can run in any Kubernetes cluster, you will not be locked to any single vendor. If I had to make a bet on which solution will become the de facto standard for serverless workloads, it would be Knative. Time will tell whether I’m right. But, for now, that seems like the safest bet we could make. That’s it. That’s all the description I will give before we jump into practical examples. I believe that practice is the best way to learn something and that there is no better way to evaluate the feasibility of using something than to use it. Let’s start with the requirements and the setup.
Installing And Configuring Knative I already stated that Knative is a platform running on top of Kubernetes, so it should be no surprise that it is the first requirement. However, it cannot be any Kubernetes. To be more precise, it can be any Kubernetes, as long as it is version 1.16 and above. All the commands from this section are available in the 04-03-knative.sh¹⁵⁶ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
To make things simpler, I created Gists for creating a Kubernetes cluster in different flavors. You’ll find the requirements and the commands for Docker Desktop, Minikube, Google Kubernetes Engine (GKE), AWS Elastic Kubernetes Service (EKS), and Azure Kubernetes Service. That does not mean that you have to use any of those Gists, nor that other Kubernetes flavors are not supported. Instead, ¹⁵⁵https://github.com/tektoncd/pipeline ¹⁵⁶https://gist.github.com/dc4ba562328c1d088047884026371f1f
Using Self-Managed Containers As A Service (CaaS)
291
those Gists are only helpers that you might choose to use or to ignore. All you really need is a Kubernetes cluster version 1.16+ and with sufficient capacity. Just bear in mind that I tested all the commands in clusters available in the Gists. So, if you choose to go rogue and roll out your own, you might need to change a thing or two. • • • • •
Docker Desktop: docker-5gb-4cpu.sh¹⁵⁷ Minikube: minikube-5gb-4cpu.sh¹⁵⁸ GKE: gke-simple.sh¹⁵⁹ EKS: eks-simple.sh¹⁶⁰ AKS: aks-simple.sh¹⁶¹
Now that we have a Kubernetes cluster, the first thing we should do is install Knative. The initial installation of Knative is straightforward. All we have to do is apply YAML definitions that will create Custom Resource Definitions (CRDs) and deploy the core components. Later on, you’ll see that the situation complicates a bit with service mesh. For now, we’ll focus on deploying the CRDs and the core components. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
1 2 3
kubectl apply \ --filename https://github.com/knative/serving/releases/download/v0.19.0/serving-\ crds.yaml
4 5 6 7
kubectl apply \ --filename https://github.com/knative/serving/releases/download/v0.19.0/serving-\ core.yaml
Knative was deployed inside the knative-serving Namespace, so let’s take a quick look at the Pods running there.
¹⁵⁷https://gist.github.com/bf30b06cbec9f784c4d3bb9ed1c63236 ¹⁵⁸https://gist.github.com/1a2ffc52a53f865679e86b646502c93b ¹⁵⁹https://gist.github.com/ebe4ad31d756b009b2e6544218c712e4 ¹⁶⁰https://gist.github.com/8ef7f6cb24001e240432cd6a82a515fd ¹⁶¹https://gist.github.com/f3e6575dcefcee039bb6cef6509f3fdc
292
Using Self-Managed Containers As A Service (CaaS) 1 2
kubectl --namespace knative-serving \ get pods
The output is as follows. 1 2 3 4 5
NAME activator-... autoscaler-... controller-... webhook-...
READY 1/1 1/1 1/1 1/1
STATUS Running Running Running Running
RESTARTS 0 0 0 0
AGE 74s 73s 73s 73s
We’ll explore some of those Pods later on. For now, the important thing is to confirm that they are all running. If one of them is still pending, please wait for a few moments, and repeat the previous command. Knative depends on a service mesh for parts of its functionality. Currently (August 2020), it supports Ambassador, Contour, Gloo, Istio, Kong, and Kourier. Given that this is not a section dedicated to service mesh, we’ll choose Istio since it is the most commonly used. Functionally, Knative works the same no matter which service mesh we use, so the choice of using Istio should not result in a different experience than using any other. I already prepared a YAML to install Istio and stored it in the vfarcic/devops-catalog-code repo. Let’s take a look at the definition we are about to apply. You can skip the git clone command from the snippet that follows if you already have the local copy of the repo from the exercises in one of the other chapters.
1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
3 4
cd devops-catalog-code
5 6
git pull
We cloned the repo, entered into the directory with the local copy, and pulled the latest version, just in case you cloned the repo earlier, and I changed something in the meantime. Now we can take a look at the definition of the IstioOperator we are about to deploy. 1
cd knative/istio
Using Self-Managed Containers As A Service (CaaS)
293
Since this is chapter is dedicated to Knative, we are not going to comment on that YAML, nor on any other Istio-specific topics. That could be a subject for some other time. Even if you are not familiar with Istio, the only thing that matters in the context of Knative is that we are about to create IstioOperator, which, in turn, will deploy and configure Istio inside the cluster. We will not use Istio directly. It will only be the enabler for Knative to do what it needs to do. Feel free to explore the YAML on your own, or just trust me that it will set up Istio, and move on. Before we create the IstioOperator, you will need to install istioctl CLI¹⁶², unless you have it already. Now that we have the YAML definition and istioctl, we can finally deploy the Istio operator, which, in turn, will deploy and configure everything we need. 1
istioctl install --skip-confirmation
It will take a few moments until everything is up and running. Be patient. To be on the safe side, we’ll list the Pods in the istio-system Namespace and confirm that they are running. 1 2
kubectl --namespace istio-system \ get pods
The output is as follows. 1 2 3
NAME READY STATUS RESTARTS AGE istio-ingressgateway-... 1/1 Running 0 36s istiod-... 1/1 Running 0 64s
Assuming that the status of all the Pods is running, we are finished installing Istio. However, now we need to make a few modifications to the Knative setup. We should enable mutual TLS (mTLS) so that Knative and Istio can communicate securely. Fortunately, that is very easy. All we have to do is add the label istio-injection with the value enabled to the knative-service Namespace. 1 2
kubectl label namespace knative-serving \ istio-injection=enabled
We’ll also set the mTLS mode to PERMISSIVE through the PeerAuthentication resource. As always, I prepared a YAML with the definition we need.
¹⁶²https://istio.io/latest/docs/ops/diagnostic-tools/istioctl/
Using Self-Managed Containers As A Service (CaaS) 1
294
cat peer-auth.yaml
The output is as follows. 1 2 3 4 5 6 7 8
apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: knative-serving spec: mtls: mode: PERMISSIVE
Let’s apply that definition. 1
kubectl apply --filename peer-auth.yaml
Finally, we’ll need to install the Knative Istio controller. Its installation is separate from Knative core since Istio is only one out of quite a few service meshes it supports. 1 2 3
kubectl apply \ --filename https://github.com/knative/net-istio/releases/download/v0.19.0/releas\ e.yaml
Now that we have both Istio and Knative installed and configured, we need to figure out through which address we’ll be able to access the application we will deploy. This is the part where instructions differ depending on the Kubernetes platform you’re using. If you are using a Kubernetes cluster sitting below an external load balancer accessible through a public IP, we can use xip.io¹⁶³ to simulate a domain. In those cases, all we need is that IP. That would be the case of GKE and AKS. You’ll notice that I did not mention EKS, even though it also creates an external load balancer. Unlike GKE and AKS, it provides a domain with IP that might change over time. While we could work around that, we’ll go with a simpler version for EKS and “fake” the domain by injecting a header into requests. We’ll use xip.io¹⁶⁴ since I could not assume that you have a “real” domain that you can use for the exercises or, if you do, that you did not configure its DNSes to point to the cluster. I might be wrong. If you do have a domain, you’ll see later how to tell Knative to use it, and you should be able to skip the “discovery” of the IP. ¹⁶³http://xip.io/ ¹⁶⁴http://xip.io/
Using Self-Managed Containers As A Service (CaaS)
295
Minikube and Docker Desktop are a “special” case. Unless you employ witchcraft and wizardry, those are running on your laptop and are usually not accessible from outside. So, we won’t be able to leverage xip.io¹⁶⁵ to create ad-hoc domains. No matter which Kubernetes platform you’re using, the instructions on how to retrieve the IP are coming next. Please execute the commands that follow only if you are using Minikube. They will retrieve the IP of the virtual machine, and the port through which the istio-ingressgateway Service is exposed. The two will be combined into a full address and assigned to the environment variable INGRESS_HOST.
1
export INGRESS_IP=$(minikube ip)
2 3 4 5 6
export INGRESS_PORT=$(kubectl \ --namespace istio-system \ get service istio-ingressgateway \ --output jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
7 8
export INGRESS_HOST=$INGRESS_IP:$INGRESS_PORT
Please execute the command that follows only if you are using Docker Desktop. It will assign 127.0.0.1 (localhost) as the address to the environment variable INGRESS_HOST.
1
export INGRESS_HOST=127.0.0.1
Please execute the commands that follow only if you are using GKE or AKS. The first will retrieve the IP of the load balancer through which we can access the istio-ingressgateway Service. The second command will use that IP to generate xip.io address that we will use to access the application deployed through Knative.
¹⁶⁵http://xip.io/
Using Self-Managed Containers As A Service (CaaS) 1 2 3 4
296
export INGRESS_IP=$(kubectl \ --namespace istio-system \ get service istio-ingressgateway \ --output jsonpath='{.status.loadBalancer.ingress[0].ip}')
5 6
export INGRESS_HOST=$INGRESS_IP.xip.io
Please execute the command that follows only if you are using EKS. It will retrieve the hostname of the load balancer through which we can access the istio-ingressgateway Service.
1 2 3 4
export INGRESS_HOST=$(kubectl \ --namespace istio-system \ get service istio-ingressgateway \ --output jsonpath='{.status.loadBalancer.ingress[0].hostname}')
Only one more thing is left for us to do before we start deploying applications using Knative. We might need to tell it which base address to use. Later on, you’ll see that Knative auto-generates a unique address for each application. We’ll get there. For now, let’s take a look at the ConfigMap config-domain. 1 2 3
kubectl --namespace knative-serving \ get configmap config-domain \ --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
apiVersion: v1 data: _example: | ... example.com: | ... example.org: | selector: app: nonprofit ... svc.cluster.local: | selector: app: secret kind: ConfigMap ...
297
Using Self-Managed Containers As A Service (CaaS)
If we focus on the data section of that ConfigMap and read the comments, we can see that we can define the patterns used to generate our applications’ addresses. We can have a catch-all address (example.com) used only if none of the other patterns are matched. We can assign domains (example.org) to applications with specific labels (app: nonprofit). We can even define which apps will not be exposed through Ingress (svc.cluster.local). Feel free to consult the documentation for additional use cases. Since we will deploy only one application, we can set a single catch-all domain to whatever is the value we stored in the environment variable INGRESS_HOST. But there’s a catch. As I already mentioned, xip.io works only with publicly accessible IPs, so that’ll work only for GKE and AKS. If you’re using Minikube, Docker Desktop, or EKS, you do not need to execute the following command. We’ll keep using the default domain example.com, and you’ll see later how we’ll “fake it”. Please execute the command that follows only if you are using GKE or AKS.
1 2 3 4 5 6 7 8
echo "apiVersion: v1 kind: ConfigMap metadata: name: config-domain namespace: knative-serving data: $INGRESS_HOST: | " | kubectl apply --filename -
Finally, to be on the safe side, let’s confirm that all the Pods related to Knative are indeed running. 1 2
kubectl --namespace knative-serving \ get pods
The output is as follows. 1 2 3 4 5 6 7
NAME activator-... autoscaler-... controller-... istio-webhook-... networking-istio-... webhook-...
READY 1/1 1/1 1/1 2/2 1/1 1/1
STATUS Running Running Running Running Running Running
RESTARTS 2 3 2 3 2 3
AGE 7m28s 7m28s 7m28s 4m56s 4m56s 7m27s
All the Pods are running, at least in my case, so we can move on. That’s it. I’ll admit that it might not have been the most straightforward setup in the world. However, you’ll see soon that it is worth it. It will be smooth sailing from now on.
Using Self-Managed Containers As A Service (CaaS)
298
Painting The Big Picture Before we dive into the actual usage of Knative, let’s see which components we got and how they interact with each other. We’ll approach the subject by trying to figure out the flow of a request. It starts with a user. When we send a request, it goes to the external load balancer, which, in our case, forwards it to IstioGateway accessible through a Kubernetes Service created when we installed Istio. That’s the same service that created the external load balancer if you are using GKE, EKS, or AKS. In the case of Minikube and Docker Desktop, there is no external load balancer, so you should use your imagination. It could also be internal traffic but, for simplicity reasons, we’ll focus on users. The differences are trivial.
From the external LB, requests are forwarded to the cluster and picked up by the Istio Gateway. Its job is to forward requests to the destination Service associated with our application. However, we do not yet have the app, so let’s deploy something.
Figure 4-3-1: Flow of a request from a user to the Istio Gateway
Using Self-Managed Containers As A Service (CaaS)
299
We’ll simulate that this is a deployment of a serverless application to production, so we’ll start by creating a Namespace. 1
kubectl create namespace production
Since we are using Istio, we might just as well tell it to auto-inject Istio proxy sidecars (Envoy). That is not a requirement. We could just as well use Istio only for Knative internal purposes, but since we already have it, why not go all the way in and use it for our applications? As you already saw when we installed Knative, all we have to do is add the istio-injection label to the Namespace. 1 2
kubectl label namespace production \ istio-injection=enabled
Now comes the big moment. We are about to deploy our first application using Knative. To simplify the process, we’ll use kn CLI for that. Please visit the Installing the Knative CLI¹⁶⁶ for the instructions on how to install it. Remember that if you are using Windows Subsystem For Linux (WSL), you should follow the Linux instructions.
In the simplest form, all we have to do is execute kn service create and provide info like the Namespace, the container image, and the port of the process inside the container. 1 2 3 4
kn service create devops-toolkit \ --namespace production \ --image vfarcic/devops-toolkit-series \ --port 80
You might receive an error message similar to RevisionFailed:
Revision "devops-toolkit-...-1" failed with message: 0/3 nodes are available: 3 Insufficient cpu. If you did, your cluster does not have enough capacity. If you have
Cluster Autoscaler, that will correct itself soon. If you created a GKE or AKS cluster using my Gist, you already have it. If you don’t, you might need to increase the capacity by adding more nodes to the cluster or increasing the size of the existing nodes. Please re-run the previous command after increasing the capacity (yourself or through Cluster Autoscaler).
The output is as follows. ¹⁶⁶https://knative.dev/docs/install/install-kn/
Using Self-Managed Containers As A Service (CaaS) 1
300
Creating service 'devops-toolkit' in namespace 'production':
2 3 4 5 6 7 8 9 10
0.030s ion. 0.079s 0.126s 31.446s 31.507s 31.582s 31.791s
The Configuration is still working to reflect the latest desired specificat\ The Route is still working to reflect the latest desired specification. Configuration "devops-toolkit" is waiting for a Revision to become ready. ... Ingress has not yet been reconciled. Waiting for load balancer to be ready Ready to serve.
11 12 13 14
Service 'devops-toolkit' created to latest revision 'devops-toolkit-...-1' is availa\ ble at URL: http://devops-toolkit.production.34.75.214.7.xip.io
We can see that the Knative service is ready to serve and that, in my case, it is available through the subdomain devops-toolkit.production. It is a combination of the name of the Knative service (devops-toolkit), the Namespace (production), and the base domain (34.75.214.7.xip.io). If we ever forget which address was assigned to a service, we can retrieve it through the routes. 1 2
kubectl --namespace production \ get routes
The output is as follows. 1 2
NAME URL READY REASON devops-toolkit http://devops-toolkit.production.... True
Finally, let’s see whether we can access the application through that URL. The commands will differ depending on whether you assigned xip.io as the base domain or kept example.com. If it is xip.io, we can open it in a browser. On the other hand, if the base domain is set to example.com, we’ll have to inject the URL as the header of a request. We can use curl for that. The alternative is to change your hosts file. If you do, you should be able to use open commands. Please execute the command that follows if you are using Minikube, Docker Desktop, or EKS. It will send a simple HTTP request using curl. Since the base domain is set to example.com, but the service through which the app is accessible is set to a different host, we’ll “fake” the domain by adding the header into the request.
Using Self-Managed Containers As A Service (CaaS) 1 2
301
curl -H "Host: devops-toolkit.production.example.com" \ http://$INGRESS_HOST
Please execute the command that follows if you are using GKE or AKS.
If you are a Linux or a WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on how to do that in the Setting Up A Local Development Environment chapter. If you do not have the open command (or the alias), you should replace open with echo and copy and paste the output into your favorite browser.
1
open http://devops-toolkit.production.$INGRESS_HOST
If you used curl, you should see the HTML of the application as the output in your terminal. On the other hand, if you executed open, the home screen of the Web app we just deployed should have opened in your default browser. How did that happen? How did we manage to have a fully operational application through a single command? We know that any application running in Kubernetes needs quite a few types of resources. Since this is a stateless application, there should be, as a minimum, a Deployment, which creates a ReplicaSet, which creates Pods. We also need a HorizontalPodAutoscaler to ensure that the correct number of replicas is running. We need a Service through which other processes can access our applications. Finally, if an application should be accessible from outside the cluster, we would need an Ingress configured to use a specific (sub)domain and associate it with the Service. We might, and often do, need even more than those resources. Yet, all we did was execute a single kn command with a few arguments. The only explanation could be that the command created all those resources. We’ll explore them later. For now, trust me when I say that a Deployment, a Service, and a Pod Autoscaler was created. On top of that, the Ingress Gateway we already commented on was reconfigured to forward all requests coming from a specific (sub)domain to our application. It also created a few other resources like a route, a configuration, an Istio VirtualService, and others. Finally, and potentially most importantly, it enveloped all those resources in a revision. Each new version of our app would create a new revision with all those resources. That way, Knative can employ rolling updates, rollbacks, separate which requests go to which version, and so on.
Using Self-Managed Containers As A Service (CaaS)
302
Figure 4-3-2: The application deployed through Knative
Creating all the resources we usually need to run an application in Kubernetes is already a considerable advantage. We removed the clutter and were able to focus only on the things that matter. All we specified was the image, the Namespace, and the port. In a “real world” situation, we would likely specify more. Still, the fact is that Knative allows us to skip defining things that Kubernetes needs, and focus on what differentiates one application from another. We’ll explore that aspect of Knative in a bit more detail later. For now, I hope you already saw that simplicity is one of the enormous advantages of Knative, even without diving into the part that makes our applications serverless. Now that sufficient time passed, we might want to take a look at the Pods running in the production Namespace. 1 2
kubectl --namespace production \ get pods
The output states that no resources were found in production namespace. If, in your case, there is still a Pod, you are indeed a fast reader, and you did not give Knative sufficient time. Wait for a few moments, and re-run the previous command. Knative detected that no one was using our application for a while and decided that it is pointless to keep it running. That would be a massive waste of resources (e.g., memory and CPU). As a result, it scaled the app to zero replicas. Typically, that would mean that our users, when they decide to continue interacting with the application, would start receiving 5XX responses. That’s what would usually happen when none of the replicas are running. But, as you can probably guess, there’s much more to it than scaling to zero replicas and letting our users have a horrible experience. Knative is a solution for serverless workloads, and, as such, it not only scales our application, but it also queues the requests when there are no replicas to handle incoming requests. Let’s confirm that.
Using Self-Managed Containers As A Service (CaaS)
303
Please execute the command that follows if you are using Minikube, Docker Desktop, or EKS.
1 2
curl -H "Host: devops-toolkit.production.example.com" \ http://$INGRESS_HOST
Please execute the command that follows if you are using GKE or AKS.
1
open http://devops-toolkit.production.$INGRESS_HOST
As you can see, the application is available. From the user’s perspective, it’s as if it was never scaled to zero replicas. When we sent a request, it was forwarded to the Ingress Gateway. But, since none of the replicas were available, instead of forwarding it to the associated Service, it sent it to Knative Activator. It, in turn, instructed the Autoscaler to increase the number of replicas of the Deployment. As you probably already know, the Deployment modified the ReplicaSet, which, in turn, created the missing Pod. Once a Pod was operational, it forwarded the queued requests to the Service, and we got the response. The Autoscaler knew what to do because it was configured by the PodScaler created when we deployed the application. In our case, only one Pod was created since the amount of traffic was very low. If the traffic increased, it could have been scaled to two, three, or any other number of replicas. The exact amount depends on the volume of concurrent requests.
Using Self-Managed Containers As A Service (CaaS)
304
Figure 4-3-3: Knative Activator and Autoscaler
We’ll explore the components and the scaling abilities in a bit more detail soon. For now, we’ll remove the application we created with Knative CLI since we are about to see a better way to define it. 1 2
kn service delete devops-toolkit \ --namespace production
That’s it. The application is no more. We are back where we started.
Defining Knative Applications As Code Executing commands like kn service create is great because it’s simple. But it is the wrong approach to deploying any type of applications, Knative included. Maintaining a system created through ad-hoc commands is a nightmare. The initial benefits from that approach are often overshadowed with the cost that comes later. But you already know that. You already understand the benefits of defining everything as code, storing everything in Git, and reconciling the actual and the desired state. I’m sure that you know the importance of the everything-as-code approach combined with GitOps. I hope you do since that is not the subject of this chapter. We’ll move on with the assumption that you want to have a YAML file that defines your application. It could be some other format but, given that almost everything is YAML in the Kubernetes world, I will assume that’s what you need. So, let’s take a look at how we would define our application. As you can probably guess, I already prepared a sample definition for us to use.
Using Self-Managed Containers As A Service (CaaS) 1
305
cat devops-toolkit.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: devops-toolkit spec: template: metadata: annotations: autoscaling.knative.dev/minScale: "0" autoscaling.knative.dev/maxScale: "3" spec: containerConcurrency: 100 containers: - image: vfarcic/devops-toolkit-series ports: - containerPort: 80 resources: limits: memory: 256Mi cpu: 100m
That definition could be shorter. If we’d want to accomplish the same result as what we had with the kn service create command, we wouldn’t need the annotations and the resources section. But I wanted to show you that we can be more precise. That’s one of the big advantages of Knative. It can be as simple or as complicated as we need it to be. But we do not have time to go into details of everything we might (or might not) want to do. Instead, we are trying to gain just enough knowledge to decide whether Knative is worth exploring in more detail and potentially adopting it as a way to define, deploy, and manage some (if not all) of our applications. You can probably guess what that definition does. The annotations tell Knative that we want to scale to 0 replicas if there is no traffic and that there should never be more than 3 replicas. For example, we could choose never to scale below 2 replicas, and go way above 3. That would give us scalability and high-availability, without making our applications serverless, without scaling down to zero replicas. The containerConcurrency field is set to 100, meaning that, in a simplified form, there should be one replica for every hundred concurrent requests, while never going above the maxScale value. The image, ports, and resources fields should be self-explanatory since those are the same ones we would typically use in, let’s say, a Deployment.
Using Self-Managed Containers As A Service (CaaS)
306
There are also some limitations we might need be aware of. The most important one is that we can have only one container for each application managed by Knative. If you try to add additional entries to the containers array, you’d see that kubectl apply would throw an error. That might change in the future, but, for now (August 2020), it is something you should be aware of. That’s it. Let’s apply that definition and see what we’ll get. 1 2
kubectl --namespace production apply \ --filename devops-toolkit.yaml
We created a single resource. We did not specify a Deployment, nor we created a Service. We did not define a HorizontalPodAutoscaler. We did not create any of the things we usually do. Still, our application should have all those and quite a few others. It should be fully operational, it should be scalable, and it should be serverless. Knative created all those resources, and it made our application serverless through that single short YAML definition. That is a very different approach from what we typically expect from Kubernetes. Kubernetes is, in a way, a platform to build platforms. It allows us to create very specialized resources that provide value only when combined together. An application runs in Pods, Pods need ReplicaSets to scale, ReplicaSets need Deployments for applying new revisions. Communication is done through Services. External access is provided through Ingress. And so on and so forth. Usually, we need to create and maintain all those, and quite a few other resources ourselves. So, we end up with many YAML files, a lot of repetition, and with a lot of definitions that are not valuable to end-users, but instead required for Kubernetes’ internal operations. Knative simplifies all that by requiring us to define only the differentiators and only the things that matter to us. It provides a layer on top of Kubernetes that, among other things, aims to simplify the way we define our applications. We’ll take a closer look at some (not all) of the resources Knative created for us. But, before we do that, let’s confirm that our application is indeed running and accessible. Please execute the command that follows if you are using Minikube, Docker Desktop, or EKS.
1 2
curl -H "Host: devops-toolkit.production.example.com" \ http://$INGRESS_HOST
Please execute the command that follows if you are using GKE or AKS.
Using Self-Managed Containers As A Service (CaaS) 1
307
open http://devops-toolkit.production.$INGRESS_HOST
You already saw a similar result before. The major difference is that, this time, we applied a YAML definition instead of relying on kn service create to do the work. As such, we can store that definition in a Git repository. We can apply whichever process we use to make changes to the code, and we can hook it into whichever CI/CD tool we are using. Now, let’s see which resources were created for us. The right starting point is kservice since that is the only one we created. Whatever else might be running in the production Namespace was created by Knative and not us. 1 2
kubectl --namespace production \ get kservice
The output is as follows. 1 2 3
NAME URL LATESTCREATED LATESTREADY READY \ REASON devops-toolkit http://devops-toolkit... devops-toolkit-... devops-toolkit-... True \
4
As I already mentioned, that single resource created quite a few others. For example, we have revisions. But, to get to revisions, we might need to talk about Knative Configuration. 1 2
kubectl --namespace production \ get configuration
The output is as follows. 1 2
NAME LATESTCREATED LATESTREADY READY REASON devops-toolkit devops-toolkit-... devops-toolkit-... True
The Configuration resource contains and maintains the desired state of our application. Whenever we change Knative Service, we are effectively changing the Configuration, which, in turn, creates a new Revision. 1 2
kubectl --namespace production \ get revisions
The output is as follows.
Using Self-Managed Containers As A Service (CaaS) 1 2
308
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON devops-toolkit-k8j9j devops-toolkit devops-toolkit-k8j9j 1 True
Each time we deploy a new version of our application, a new immutable revision is created. It is a collection of almost all the application-specific resources. Each has a separate Service, a Deployment, a Knative PodAutoscaler, and, potentially, a few other resources. Creating revisions allows Knative to decide which request goes where, how to rollback, and a few other things.
Figure 4-3-4: Knative Configuration and Revisions
Now that we mentioned Deployments, Services, and other resources, let’s confirm that they were indeed created. Let’s start with Deployments. 1 2
kubectl --namespace production \ get deployments
The output is as follows. 1 2
NAME READY UP-TO-DATE AVAILABLE AGE devops-toolkit-...-deployment 0/0 0 0 13m
Deployment is indeed there. The curious thing is that 0 out of 0 replicas are ready. Since it’s been a while since we interacted with the application, Knative decided that there is no point running it. So, it scaled it to zero replicas. As you already saw, it will scale back up when we start sending requests to the associated Service. Let’s take a look at them as well.
309
Using Self-Managed Containers As A Service (CaaS) 1 2
kubectl --namespace production \ get services,virtualservices
The output is as follows. 1 2 3 4 5 6 7 8
NAME PORT(S) AGE service/devops-toolkit ... 2m47s service/devops-toolkit-... 80/TCP 3m6s service/devops-toolkit-...-private 80/TCP,... 3m6s
TYPE
CLUSTER-IP
ExternalName
EXTERNAL-IP
\
cluster-local-gateway.\
ClusterIP
10.23.246.205
\
ClusterIP
10.23.242.13
\
9 10 11 12
NAME GATEWAYS HOSTS AGE virtualservice.... [knative-serving/...] [devops-...] 2m48s virtualservice.... [mesh] [devops-toolkit...] 2m48s
We can see that Knative created Kubernetes Services, but also Istio VirtualServices. Since we told it that we want to combine it with Istio, it understood that we need not only Kubernetes core resources, but also those specific to Istio. If we chose a different service mesh, it would create whatever makes sense for it. Further on, we got the PodAutoscaler. 1 2
kubectl --namespace production \ get podautoscalers
The output is as follows. 1 2
NAME DESIREDSCALE ACTUALSCALE READY REASON devops-toolkit-... 0 0 False NoTraffic
PodAutoscaler is, as you can guess by its name, in charge of scaling the Pods to comply with the changes in traffic, or whichever other criteria we might use. By default, it measures the incoming traffic, but it can be extended to use formulas based on queries from, for example, Prometheus. Finally, we got a Route. 1 2
kubectl --namespace production \ get routes
The output is as follows.
Using Self-Managed Containers As A Service (CaaS) 1 2
310
NAME URL READY REASON devops-toolkit http://devops-toolkit.... True
Routes are mapping endpoints (e.g., a subdomain) to one or more revisions of the application. They can be configured in quite a few different ways, but, in its essence, it is the entity that routes the traffic to our applications. We are almost finished. There is only one crucial thing left to observe, at least from the perspective of a quick overview of Knative. What happens when many requests are “bombing” our application? We saw that when we do not interact with the app, it is scaled down to zero replicas. We also saw that when we send a request to it, it scales up to one replica. But, what would happen if we start sending five hundred concurrent requests? Take another look at devops-toolkit.yaml and try to guess. It shouldn’t be hard. Did you guess how many replicas we should have if we start sending five hundred concurrent requests? Let’s assume that you did, and let’s see whether you were right. We’ll use Siege¹⁶⁷ to send requests to our application. To be more specific, we’ll use it to send a stream of five hundred concurrent requests over sixty seconds. We’ll also retrieve all the Pods from the production Namespace right after siege is finished “bombing” the application. As before, the commands will differ slightly depending on the Kubernetes platform you’re using. You will NOT be able to use Siege with Docker Desktop. That should not be a big deal since the essential thing is the output, which you can see here.
Please execute the command that follows if you are using minikube or EKS.
1 2 3 4 5 6 7 8 9
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 500 --time 60S \ --header "Host: devops-toolkit.production.example.com" \ "http://$INGRESS_HOST" \ && kubectl --namespace production \ get pods
Please execute the command that follows if you are using GKE or AKS.
¹⁶⁷https://github.com/JoeDog/siege
311
Using Self-Managed Containers As A Service (CaaS) 1 2 3 4 5 6 7 8
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 500 --time 60S \ "http://devops-toolkit.production.$INGRESS_HOST" \ && kubectl --namespace production \ get pods
The output, in my case, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
... Transactions: 40697 hits Availability: 100.00 % Elapsed time: 59.53 secs Data transferred: 83.72 MB Response time: 0.22 secs Transaction rate: 683.64 trans/sec Throughput: 1.41 MB/sec Concurrency: 149.94 Successful transactions: 40699 Failed transactions: 0 Longest transaction: 5.30 Shortest transaction: 0.00 ... NAME READY STATUS devops-toolkit-...-deployment-... 3/3 Running devops-toolkit-...-deployment-... 3/3 Running devops-toolkit-...-deployment-... 3/3 Running
RESTARTS 0 0 0
AGE 58s 60s 58s
We can see that, in my case, over forty thousand requests were sent, and the availability is 100.00 %. That might not always be the situation, so don’t be alarmed if, in your case, it’s a slightly lower figure. Your cluster might not even have enough capacity to handle the increase in workload and might need to scale up. In such a case, the time required to scale up the cluster might have been too long for all the requests to be processed. You can always wait for a while for all the Pods to terminate and try again with increased cluster capacity. For now, Knative does not give 100% availability. I was lucky. If you have huge variations in traffic, you can expect something closer to 99.9% availability. But that is only when there is a huge difference like the one we just had. Our traffic jumped from zero to a continuous stream of five hundred concurrent requests within milliseconds. For the “normal” usage, it should be closer to 100% (e.g., 99.99%) availability.
Using Self-Managed Containers As A Service (CaaS)
312
What truly matters is that the number of Pods was increased from zero to three. Typically, there should be five Pods since we set the containerConcurrency value to 100, and we were streaming 500 concurrent requests. But we also set the maxScale annotation to 3, so it reached the limit of the allowed number of replicas. While you’re reading this, Knative probably already started scaling down the application. It probably scaled it to one replica, to keep it warm in case new requests come in. After a while, it should scale down to nothing (zero replicas) as long as traffic keeps being absent. The vital thing to note is that Knative does not interpret the traffic based only on the current metrics. It will not scale up when the first request that cannot be handled with the existing replicas kicks in. It will also not scale down to zero replicas the moment all requests stop coming in. It changes things gradually, and it uses both current and historical metrics to figure out what to do next. Assuming that you are not a very fast reader, the number of Pods should have dropped to zero by now. Let’s confirm that. 1 2
kubectl --namespace production \ get pods
The output states that no resources were found in production namespace. In your case, a Pod might still be running, or the status might be terminating. If that’s the case, wait for a while longer and repeat the previous command. There are many aspects of Knative that we did not explore. This chapter’s goal was to introduce you to Knative so that you can see whether it is a tool worth investing in. I tried to provide as much essential information as I could while still being quick and concise. If you think that Knative is the right choice for some, if not all of your applications, please visit its documentation and start digging deeper. Now, let’s undo everything we did.
Destroying The Resources We are finished with our quick overview of Knative, and all that’s left is to remove the application. 1 2
kubectl --namespace production delete \ --filename devops-toolkit.yaml
Since that is the only app we deployed to the production Namespace, we might just as well eliminate it completely. 1
kubectl delete namespace production
Finally, let’s go a few directories back and get to the same place where we started.
Using Self-Managed Containers As A Service (CaaS) 1
313
cd ../../../
Please note that we did not uninstall Knative and Istio. I assumed that you used a throw-away cluster for the exercises and that there is no need to provide the commands to remove them. We are about to destroy the cluster we created unless you choose to stick with it for a while longer. If you are using EKS, you will not be able to destroy the cluster right away. The Ingress Gateway deployed with Istio created an external load balancer. We need to destroy it first. Otherwise, some of the cluster resources might not be removable since they might be tied to the external LB. So, we’ll need to remove the Istio Ingress Gateway first.
1 2
kubectl --namespace istio-system \ delete service istio-ingressgateway
If you created a cluster using one of my Gists, you will find the instructions on how to destroy the cluster at the bottom. Feel free to use those commands unless you plan on keeping the cluster up and running for other reasons.
Self-Managed Vs. Managed CaaS We explored both managed and self-managed Containers as a Service (CaaS). Examples of the former were Azure Container Instances, AWS ECS with Fargate, and Google Cloud Run. The results were mixed, ranging from useless to amazing. Now, on the other hand, we explored Knative as a potential candidate for running self-managed serverless CaaS workloads. The potentially most important question left unanswered is whether we should use managed or self-managed CaaS? For the sake of not repeating myself, I’ll skip explaining which managed CaaS solutions are worth using. We already went through that at the end of the Using Managed Containers As A Service (CaaS) section. For the sake of comparison, I’ll assume that all managed CaaS is equally good, even though that’s not the case, and generalize the discussion into a simple managed vs. self-managed choice. More often than not, I recommend using something-as-a-service, as long as that fulfills the requirements and the needs. But, in the case of CaaS, the choice is not that clear. I could list reasons like “do not waste time on managing infrastructure” and “let others worry about it so that you can focus on your business goals.” In the case of self-managed CaaS, those statements are true, but mostly if you can avoid having a Kubernetes cluster altogether. For many, that is not an option. No matter whether CaaS becomes 10% or 90% of the workload running in containers, there is almost always the need to run something in your Kubernetes cluster. Now, if you do have to run a Kubernetes cluster, the overhead of using self-managed CaaS like, for example, Knative, is negligible. That is, more or less, equally valid no matter whether that is a
Using Self-Managed Containers As A Service (CaaS)
314
managed Kubernetes cluster like GKE, EKS, or AKS, or an on-prem cluster. Most of the work involves managing infrastructure and the cluster as a whole. If you are using Kubernetes, you have that overhead, no matter whether you are using Knative, Google Cloud Run, or something completely different. That is not to say that there are no benefits in using managed CaaS if we already have a Kubernetes (and we want to keep it). But there are other potential reasons. Integrating self-managed serverless CaaS workload with the rest of the system tends to be easier if everything is in one place. Or, at least, if parts of the rest are in the same place. For example, if we have a stateful app running in a Kubernetes cluster, running a stateless app that depends on it inside the same cluster might be easier, better, or faster, than running it somewhere else. Those benefits are usually not sufficient for us to choose one solution over the other. If the overhead from managing your infrastructure has to be paid no matter the choice, either option is valid. The good news is that deploying applications as Containers as a Service is, most of the time, simple and straightforward. There is very little code involved, so switching from one solution to another is not a big deal. For example, you could use Knative in your Kubernetes cluster and later switch to Google Cloud Run while using almost the same YAML definitions. That’s the advantage of using a service from a vendor (e.g., Google Cloud Run) that relies on an open-source project (e.g., Knative). Even if what your vendor is offering is not based on anything that you could run yourself (e.g., Azure Container Instances), the effort is still small. Serverless CaaS solutions usually involve a relatively small amount of work, no matter which solution we choose and where we are running it. All in all, if you are running on-prem, the choice is a no brainer. You have to use self-managed CaaS. If you are running in Cloud and do not have or do not want to manage a Kubernetes cluster, managed CaaS is the way to go. The gray area is in the cases when you do have a Kubernetes cluster, and you’re running in Cloud. In such a case, I am not yet sure which option to choose. It will likely depend on personal preferences more than anything else. Before we move to the next subject, I will use this opportunity for a call to action. Would you like me to explore other self-managed serverless Containers as a Service solutions? If you do, please let me know which one you’d like me to add to the mix, and I’ll do my best to include it.
There Is More About Serverless There is more material about this subject waiting for you on the YouTube channel The DevOps Toolkit Series¹⁶⁸. Please consider watching one of the videos that follow, and bear in mind the list will likely grow over time. • Copilot - What AWS ECS and Fargate Container Management Should Have Been All Along¹⁶⁹ • Amazon Lambda Containers - How to Package AWS Functions as Container Images¹⁷⁰ ¹⁶⁸https://www.youtube.com/c/TheDevOpsToolkitSeries ¹⁶⁹https://youtu.be/YCCFK2RRm7U ¹⁷⁰https://youtu.be/DsQbBVr-GwU
Using Centralized Logging A long time ago in a galaxy far, far away… We used to store logs in files. That wasn’t so bad when we had a few applications, and they were running on dedicated servers. Finding logs was easy given the nature of static infrastructure. Locating issues was easy as well when there was only an application or two. But times changed. Infrastructure grew in size. Now we have tens, hundreds, or even thousands of nodes. The number of applications increased as well. More importantly, everything became dynamic. Nodes are being created and destroyed. We started scaling them up and down. We began using schedulers, and that means that our applications started “floating” inside our clusters. Everything became dynamic and volatile. Everything increased in size. As a result, “hunting” logs and going through endless entries to find issues became unacceptable. We needed a better way. Today, the solution is in centralized logging. As our systems became distributed, we had to centralize the location where we store logs. As a result, we got “log collectors” that would gather logs from all the parts of the systems, and ship them to a central database. Centralized logging became the de-facto standard and a must in any modern infrastructure. There are many managed third party solutions like Datadog¹⁷¹, Splunk¹⁷², and others. Cloud providers started offering solutions like Google Cloud Operations¹⁷³ (formerly known as Stackdriver), AWS CloudWatch¹⁷⁴, and others. Almost any Cloud provider has a solution. When it comes to self-hosted options, ELK stack¹⁷⁵ (ElasticSearch, LogStash, and Kibana) became the de-facto standard. Elasticsearch is used as the database, Logstash for transferring logs, and Kibana for querying and visualizing them. We are about to explore self-managed logging and leave managed logging aside. Given that I already stated that the ELK stack is the de-facto standard, you probably think it will be our choice. But it will not. We will go down a different route. But, before we continue, let me introduce you to Vadim. This chapter is based on his work, and I am eternally grateful that he chose to help out.
About Vadim Using his own words… “Hello, my name is Vadim Gusev. I’m a DevOps engineer at Digitalpine, internet ads, and marketing startup. I’m fascinated by ever-growing Cloud Native Landscape and continuously looking for tools, ¹⁷¹https://www.datadoghq.com/ ¹⁷²https://www.splunk.com/ ¹⁷³https://cloud.google.com/products/operations ¹⁷⁴https://aws.amazon.com/cloudwatch/ ¹⁷⁵https://www.elastic.co/elastic-stack
Using Centralized Logging
317
technologies, and practices that would make my life easier and provide my team with a more robust, more efficient, and worry-free experience. As a long time DevOps Toolkit Series reader, I was happy to know that the new Catalog book is in the making. I’ve gladly accepted Viktor’s invitation to contribute my report on tools that I’ve had a positive experience in production. Hope you like this chapter!”
Why Not Using The ELK Stack? The problem with the ELK stack lies in its ad-hoc nature. Elasticsearch is a database built around Lucene. It is a full-text search engine. It is designed for fuzzy search across vast amounts of human-written text. It handles complex tasks, resulting in a CPU and memory intensive solution. In particular, it creates a sizeable in-memory index that “eats” resources for breakfast. ElasticSearch can quickly become the most demanding application in a system. That is especially true if we introduce clusterization, sharding, and other scaling options. What I’m really trying to say is that the ELK stack is over-qualified for the task. It provides a full-text analytics platform at high resource and administrative cost, while the essential logging requirement is just a “distributed grep”. There was increasing demand for such tools and several attempts at creating distributed “grep-style” log aggregation platforms. Many failed due to lack of “their own Kibana”, even when aggregation and storage were done right, but querying and visualization were lack-luster or completely absent. All in all, the ELK stack is demanding on resources and hard to manage. It might be overkill if our needs focus on logs, and not all the other tasks it can perform. The problem is that, within self-hosted solutions, there wasn’t much choice until recently.
Using Loki To Store And Query Logs Grafana Labs¹⁷⁶ started working on the project Loki¹⁷⁷ somewhere around mid-2019. It describes itself as “like Prometheus, but for Logs.” Given that Prometheus is the most promising tool for storing metrics, at least among open source projects, that description certainly sparks interest. The idea is to abandon text indexing and, instead, attach labels to log lines in a way similar to metric labels in Prometheus. In fact, it is using precisely the same Prometheus labeling library. It might not be apparent why this is a brilliant idea, so let’s elaborate a bit. Indexing a massive amount of data is very compute intensive and tends to create issues when running at scale. By using labels instead of indexing logs, Loki made itself compute-efficient. But that’s not the only benefit. Having the same set of labels on application logs and metrics helps immensely in correlating those two during investigations. On top of that, logs are queried with the same PromQL as metrics stored ¹⁷⁶https://grafana.com/ ¹⁷⁷https://github.com/grafana/loki
Using Centralized Logging
318
in Prometheus. To be more precise, Loki uses LogQL, which is a subset of PromQL. Given that querying metrics tend to need reacher query language, the decision to use a subset makes sense. On top of all that, if you adopted Prometheus, you will not need to learn yet another query language. Even if Prometheus is not your choice, its metrics format and query language are being adopted by other tools, so it is an excellent investment to learn it. Further on, the UI for exploring logs is based on Grafana, which happens to be the de-facto standard in the observability world. The release 6.0 of Grafana added the Explore screen with the support for Loki included from day one. It allows us to query and correlate both logs from Loki and metrics from Prometheus in the same view. The result of those additions means that suddenly we went from having not only a good log querying solution but one that is arguably better than Kibana. That’s enough talk. Let’s get our hands dirty. The first step is to set up a cluster and install the Loki stack and a few other tools.
Installing Loki, Grafana, Prometheus, And The Demo App As you can probably guess, we’ll run the Loki stack inside a Kubernetes cluster. That is our first requirement. We’ll need a Kubernetes cluster with the NGINX Ingress controller. The address through which we can access Ingress should be stored in the environment variable INGRESS_HOST. That’s it. Those are all the pre-requirements. All the commands from this section are available in the 05-logging.sh¹⁷⁸ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
As always, I prepared Gists for creating clusters based on Docker Desktop, Minikube, Google Kubernetes Engine (GKE), AWS Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS). Use them, or roll out your own cluster. Just remember that, if you choose to go rogue with your own, you might need to change a command or two in the following examples. • • • • •
Docker Desktop: docker-3gb-2cpu.sh¹⁷⁹ Minikube: minikube.sh¹⁸⁰ GKE: gke-simple-ingress.sh¹⁸¹ EKS: eks-simple-ingress.sh¹⁸² AKS: aks-simple-ingress.sh¹⁸³
¹⁷⁸https://gist.github.com/838a3a716cd9eb3c1a539a8d404d2077 ¹⁷⁹https://gist.github.com/0fff4fe977b194f4e9208cde54c1aa3c ¹⁸⁰https://gist.github.com/2a6e5ad588509f43baa94cbdf40d0d16 ¹⁸¹https://gist.github.com/925653c9fbf8cce23c35eedcd57de86e ¹⁸²https://gist.github.com/2fc8fa1b7c6ca6b3fefafe78078b6006 ¹⁸³https://gist.github.com/e24b00a29c66d5478b4054065d9ea156
Using Centralized Logging
319
Now that you have a cluster with Ingress and the address stored in the environment variable INGRESS_HOST, you’ll need to ensure that you have helm CLI. We’ll use it to deploy all the tools we’ll need as well as a demo app. I’m sure you already have it. In case you don’t, please visit the Installing Helm¹⁸⁴ page for the information. Remember that if you are using Windows Subsystem For Linux (WSL), you should follow the Linux instructions to install Helm CLI.
Now that we have all the pre-requisites, we can jump into the installation of the Loki stack and the tools it will interact with. To simplify the process, I created a few definitions to help us out. They are stored in the vfarcic/devops-catalog-code repository, so let’s clone it. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
Don’t worry if git
clone threw an error stating that the destination path 'devops-catalog-code' already exists and is not an empty directory. You likely
already have it from the previous exercises.
Next, we’ll get inside the local copy of the repo and pull the latest revision if you cloned it before, and I made some changes in the meantime. 1
cd devops-catalog-code
2 3
git pull
The definitions we’ll use are in the monitoring directory, so let’s get there. 1
cd monitoring
We’ll deploy all the tools we’ll use inside the Namespace monitoring, so let’s create it. ¹⁸⁴https://helm.sh/docs/intro/install/
Using Centralized Logging 1
320
kubectl create namespace monitoring
Let’s start with the Loki stack. It is available in the Grafana’s Helm repo, so the first step is to add it. 1 2
helm repo add loki \ https://grafana.github.io/loki/charts
3 4
helm repo update
Now we can install it. Default values will do. 1 2 3 4 5
helm upgrade --install \ loki loki/loki-stack \ --namespace monitoring \ --version 0.40.0 \ --wait
Next, we will install Grafana. This time, we’ll need to tweak the defaults. Specifically, we’ll need to add both Loki and Prometheus as data sources and to enable Ingress so that we can access it. We could set up the data sources manually through the UI, but that would be boring. Besides, I already prepared the YAML with the Helm values we can use for those customizations. 1
cat grafana-loki.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
ingress: enabled: true service: type: LoadBalancer datasources: datasources.yaml: apiVersion: 1 datasources: - name: Loki type: loki url: http://loki:3100 access: proxy isDefault: true - name: Prometheus type: prometheus url: http://prometheus-server access: proxy
Using Centralized Logging
321
We won’t comment on that YAML, except saying that it contains the values we’ll use to configure the Helm deployment of Grafana. Please check the documentation of the Chart if you’re interested in details. For now, I’ll assume that the values are self-explanatory. There’s one thing that is missing from that YAML. It does not contain the ingress.hosts value. I could not know in advance the address through which your cluster is accessible, so we’ll set that one through the --set argument. We’ll use xip.io¹⁸⁵ since I could not assume that you have a “real” domain that you can use for the exercises or, if you do, that you configured its DNS to point to the cluster.
1 2
helm repo add stable \ https://kubernetes-charts.storage.googleapis.com
3 4 5 6 7 8 9 10
helm upgrade --install \ grafana stable/grafana \ --namespace monitoring \ --version 5.5.5 \ --values grafana-loki.yaml \ --set ingress.hosts="{grafana.$INGRESS_HOST.xip.io}" \ --wait
Two are done, one is left to go. We are missing Prometheus. 1 2 3 4 5
helm upgrade --install \ prometheus stable/prometheus \ --namespace monitoring \ --version 11.12.0 \ --wait
You might wonder why we also have Prometheus datasource. If that’s the case, keep wondering, since I’ll postpone revealing the reasons for a while longer.
Let’s see what’s running in the monitoring Namespace. 1
kubectl --namespace monitoring get pods
The output is as follows. ¹⁸⁵http://xip.io/
322
Using Centralized Logging 1 2 3 4 5 6 7 8 9 10 11 12 13
NAME grafana-... loki-0 loki-promtail-... loki-promtail-... loki-promtail-... prometheus-alertmanager-... prometheus-kube-state-metrics-... prometheus-node-exporter-... prometheus-node-exporter-... prometheus-node-exporter-... prometheus-pushgateway-... prometheus-server-...
READY 1/1 1/1 1/1 1/1 1/1 2/2 1/1 1/1 1/1 1/1 1/1 2/2
STATUS Running Running Running Running Running Running Running Running Running Running Running Running
RESTARTS 0 0 0 0 0 0 0 0 0 0 0 0
AGE 2m31s 5m32s 5m32s 5m32s 5m32s 90s 90s 91s 91s 91s 90s 90s
Let’s go through all those components and paint a picture of how they all fit together. We have a Kubernetes cluster with a control plane and worker nodes. Inside those nodes are Pods running our applications. Inside each of the nodes, we installed Loki Promtail Pods (loki-promtail-...) deployed through a DaemonSet. Each of those ships contents of local logs. It does that by discovering targets using the same service discovery mechanism as Prometheus. It also attaches labels to all the log streams, before pushing them to either a Loki instance (as in our case) or to Grafana Cloud. Further on, we got the before mentioned Loki instance (loki-0). It is the primary server responsible for storing logs and processing queries. We are currently running a single replica, but it could be scaled horizontally to meet whichever needs we might have. What matters is that, in our case, Promtail running in each node is pushing logs to the Loki server. Next, we got Grafana (grafana-...). Initially, it was designed as a tool to visualize metrics from quite a few sources. But, since version 6, it can also visualize logs. We’ll use it as the user interface to query and visualize information collected across the whole cluster. The logs themselves are stored in a storage bucket. It can be S3, Azure Storage, Google Cloud Storage, NFS, or any other type of storage.
323
Using Centralized Logging
Figure 5-1: The Loki stack
Together, Loki Server, Loki Promtail, and Grafana serve similar purposes as ElasticSearch combined with LogStash and Kibana. Except that Loki is not a general-purpose database but focused on logs. Its design is based on past experience of working with similar tools and, hopefully, it managed to avoid some of the pitfalls, and provide an improved experience. As such, it is more efficient, requires fewer resources, and it is easier to maintain. In Kubernetes terms, the Loki setup can be boiled down to log gathering DaemonSet, query API Deployment, and a storage bucket. API can be split into Ingester and Querier, and each can be scaled independently if needed. Finally, we also installed a few Prometheus components. We got the Prometheus Server (prometheus-server-...) that serves a similar purpose as the Loki server but focused on metrics. Prometheus is capable of discovering all the apps running in a cluster and scraping metrics, if available. On top of that, it also collects information from Kubernetes itself. In our case, Grafana is already configured to use both the Loki server and Prometheus as data sources.
324
Using Centralized Logging
Figure 5-2: The Loki stack with Prometheus
The Node Exporter (prometheus-node-exporter) provides endpoints for Prometheus to pull systemlevel data related to nodes. Just as Loki Promtail, it runs as a DaemonSet resulting in a Pod in each node. Further on, we got the AlertManager (prometheus-alertmanager-...), which is in charge of sending alerts based on queries in Prometheus. Finally, PushGateway helps Prometheus pull metrics from locations not designed for such a mechanism. The important thing to note is that we will be exploring only the Loki stack and Grafana. We installed Prometheus mostly to demonstrate that both logs and metrics can coexist. We will not explore it further than that. The only thing missing, before we see Loki in action, is to deploy a demo application. After all, we need an entity that produces logs. We’ll deploy the go-demo-9 application and use it as a source of logs we’ll explore soon. You should already be familiar with the app from the previous chapters, so we can go through the setup process quickly and without detailed explanations.
Using Centralized Logging 1
325
cd ../helm
2 3 4
helm repo add bitnami \ https://charts.bitnami.com/bitnami
5 6
helm dependency update go-demo-9
7 8
kubectl create namespace production
9 10 11 12 13 14
helm upgrade --install \ go-demo-9 go-demo-9 \ --namespace production \ --wait \ --timeout 10m
We moved to the directory with the Chart of the application, added Bitnami repo containing the chart of the MongoDB dependency used by the app, and updated the local copy of the dependencies. Further on, we created the production Namespace and installed the application inside it. To be on the safe side, we’ll send a request to the newly deployed app to validate that it works and can be accessed. 1 2
curl -H "Host: go-demo-9.acme.com" \ "http://$INGRESS_HOST"
Now we are ready to “play” with Loki.
Playing With The Loki Stack We made all the preparations. We deployed all the apps, and we applied all the configuration we need to give Loki as spin. Now we are ready to dive in. We will open Grafana since that will be our user interface. If you are a Linux or a WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on how to do that in the Setting Up A Local Development Environment chapter. If you do not have the open command (or the alias), you should replace open with echo and copy and paste the output into your favorite browser.
Using Centralized Logging 1
326
open http://grafana.$INGRESS_HOST.xip.io/explore
I just realized that xip.io works with local addresses. That’s good news since that means that it can work with Docker Desktop and Minikube. The bad news is that I might have told you otherwise in the previous chapters. It did not work before, and it’s my fault for not checking that lately. I should have known that improved since the last time I experimented with it.
We are presented with the login screen. The username is admin, but the default password is not a hard-coded value like admin or 123456. That’s a good thing in general, but it also means that we need to retrieve it. The password is stored in the field admin-password inside the Secret grafana. All we have to do is retrieve it and decode it. 1 2 3 4
kubectl --namespace monitoring \ get secret grafana \ --output jsonpath="{.data.admin-password}" \ | base64 --decode ; echo
Please copy the password and go back to the login screen. Type admin as the username and paste the password. I probably do not need to tell you that the next step is to click the Log in button. We are presented with the input line (the one next to the drop-down with Log labels selected). That’s where we can type queries, so let’s try something simple. 1
{job="production/go-demo-9-go-demo-9"}
If you are familiar with Prometheus, you should have no problems with that syntax. Actually, it’s straightforward enough, so you shouldn’t have an issue with it, no matter the familiarity with Prometheus. We are asking it to retrieve all the logs belonging to the job called production/go-demo-9-go-demo-9. The job is one of the labels automatically injected by Promtail. You can think of it as the Deployment (go-demo-9-go-demo-9) with a prefix containing the Namespace (production). The two combined provide a unique ID since there cannot be two Deployments in the same Namespace with the same name. Press shift and enter keys to execute the query. If keyboard shortcuts are not your thing, click the Run Query button in the top-right corner of the screen. I will skip reminding you to run queries. Whenever I instruct you to type or change a query, I will assume that you will execute it without me saying anything.
Using Centralized Logging
327
You should see all the logs coming from go-demo-9-go-demo-9 running in the Namespace production. Or, to be more precise, you won’t see all the logs, but rather those generated during the last hour. The period of the logs retrieved can be adjusted through the drop-down in the top menu. It is currently set to the Last 1 hour.
Figure 5-3: The list of all the lost belonging to go-demo-9 and generated during the last hour
Feel free to “play” with the drop-down lists and the buttons. Explore it for a while. If you’re out of ideas, you can, for example, expand one of the log entries to see the additional information. The output contains logs of our application, but we are swarmed with GET request to / lines, so it is hard to see whatever we might want to see. Please modify the query with the snippet that follows. 1
{job="production/go-demo-9-go-demo-9"} != "GET request to /"
Voila! The annoying noise is gone. The output is limited to all the log entries that do NOT contain GET request to /. We told Loki that we want the log entries with values with labels job being production/go-demo-9-go-demo-9 but limited to those that are different than GET request to /. If you have the programming background, you probably guessed that != means different than.
Using Centralized Logging
328
You might not see any logs. If that’s the case, you are likely too slow. Those matching the query were generated over an hour ago. Please increase the duration if that’s the case.
Figure 5-4: The list of all the lost belonging to go-demo-9, generated during the last hour, and filtered to output only those that contain a specific string
If you are familiar with Linux and would be looking to accomplish a similar outcome based on logs stored in a file, you would probably execute a command similar to the one that follows. Do NOT run the command that follows. It is meant to show the equivalent “old” way of dealing with logs. You probably do not have the log file it uses, so the command would fail anyway.
1 2
cat /var/log/go-demo-9-go-demo-9.log \ | grep -v "GET request to /"
That is precisely what we’ve got in Loki, but in a dynamic environment of a Kubernetes cluster with possibly hundreds of replicas and dozens of release versions of our application. We can think of Loki as a solution for “distributed grep” commands.
Using Centralized Logging
329
As you’ve got the taste of what Loki does, let’s step back and get a more formal introduction into its syntax and capabilities.
Exploring Loki Query Syntax As I mentioned earlier, LogQL is a slightly modified subset of the well known PromQL many of us are using daily. That does not mean that you need to be an “expert” in PromQL, but rather that there is a similarity that some can leverage. Those with no prior knowledge should still have an easy time to grasp it. It’s relatively simple. That being said, I will not teach you LogQL nor PromQL. You can do that yourself through the documentation. Our goal is to evaluate whether Loki might be the right choice for you, so we’ll explore it briefly, without going into details, especially not on the query language level. Let’s take a look at last expression, and compare it to PromQL used by Prometheus. 1
{job="production/go-demo-9-go-demo-9"} != "GET request to /"
First of all, there is no metric name you might be used to when working with Prometheus. Actually our beloved some_metric{foo="bar"} is just a shorthand for {__name__="some_metric", foo="bar"}. So, there is not a really big difference there. We are selecting a log stream specifying labels just as we do with metrics in Prometheus. That part of the query is called log stream selector. The usual equal and not equal operators (=, !=) are present, alongside their regex counterparts (=∼, !∼). What is notable in Loki is the job label. It is a convenience label that consists of a Namespace and a replication controller name. As you hopefully know, replication controllers are Deployment, StatefulSet, and DaemonSet. Basically, it is the name of the workload we want to investigate. What is really different is the latter part (!= "GET request to /"), called filter expression. The log exploration routine usually involves narrowing down the log stream to relevant parts. In the good old days of Linux servers, we used to grep logs. For example, to get all problematic requests for example.com, we would do something like this like the command that follows. Do NOT run the command that follows. It is meant to show the equivalent “old” way of dealing with logs.
1 2 3
cat access.log \ | grep example.com \ | grep -v 200
That command would output the content of the access.log file and pipe it to the grep command. In turn, it would search for lines containing example.com, and pipe it further to yet another grep. The latter grep would filter out (remove) all lines that contain 200 (HTTP OK code). The equivalent LogQL query would look like the snippet that follows.
Using Centralized Logging 1
330
{job="kube-system/nginx-ingress-controller"} |= "example.com" != "200"
Feel free to type that query in Grafana and execute it. The result should be all the entries containing example.com that do not include 200. In other words, that query returns all entries that mention example.com and contain errors (non-200 response codes). Do not be confused if you do not see any results. That’s normal. It means that the app did not produce error entries that also contain example.com. We’ll simulate errors soon. The query is not really correct since response codes above 200 and below 400 are also not errors, but let’s not be picky about it.
Up to now, we were executing log queries that are supposed to return log lines. The recent addition to LogQL is metric queries that are more in line with PromQL. They calculate values based on log entries. For example, we might want to count log lines or calculate the rate at which they are produced. We can perform such tasks in the Explore mode, but we can also create real dashboards based on metric queries. Let’s return to Grafana and create a new dashboard. 1
open http://grafana.$INGRESS_HOST.xip.io/dashboard/new
You should be presented with a screen with a prominent blue button saying + Add new panel. Click it. Search for the field next to the drop-down list with the Log labels selected, type the query that follows, and execute it. 1
topk(10, sum(rate({job=~".+"}[5m])) by (job))
If you are familiar with PromQL, you might get an idea of what we did. We retrieved 10 noisiest workloads in our cluster grouped by the job. You might see only the result on the far right of the graph. That’s normal since, by default, the graph shows the last six hours, and you likely have Loki running for a much shorter period. If you see “pretty colors” across the whole graph, you are a slow reader or live in Spain, so you took a lunch break and returned four hours later.
331
Using Centralized Logging
Figure 5-5: A graph with metric queries
That is quite noisy, so let’s exclude built-in components by excluding workloads from the kube-system Namespace. Please execute the query that follows. 1
topk(10, sum(rate({namespace!="kube-system"}[5m])) by (job))
You should see fewer jobs since those from the kube-system Namespace are now excluded. That was already quite useful and indistinguishable from the PromQL query. But we are working with log lines, so let’s try to utilize that. Please execute the query that follows. 1
topk(10, sum(rate(({namespace!="kube-system"} |= "error")[5m])) by (job))
This time, we excluded all workloads in kube-system (just as before), but we also filtered results to only those containing the word error. In my case, that is only the promtail running in the monitoring Namespace. It probably failed the first time it run. It depends on the Loki server, which tends to take more time to boot.
Using Centralized Logging
332
That query combined features of both PromQL and LogQL. We used the topk() function from PromQL and the filter expression (|= "error") from LogQL. Please visit the Log Query Language¹⁸⁶ section of the documentation for more info. Most of the time, we are not interested in application logs. If you’re looking at logs often, either your system is unstable most of the time, or you are bored and cannot afford a subscription to Netflix. Most of us tend to look at logs only when things go wrong. We find out that there is an issue by receiving alerts based on metrics. Only after that, we might explore logs to deduce what the problem is. Even in those cases, logs rarely provide value alone. We can have meaningful insights only when logs are combined with metrics. In those dire times, when things do go wrong, we need all the help we can get to investigate the problem. Luckily, Grafana’s Explore mode allows us to create a split view. We can, for example, combine results from querying logs with those coming from metrics. Let’s try it out. 1
open http://grafana.$INGRESS_HOST.xip.io/explore
Just as before, type the query that follows and press shift and enter keys to execute it. 1
{job="production/go-demo-9-go-demo-9"} |= "ERROR"
As you can probably guess, that query returns logs from Loki. It contains only those with the word ERROR and associated with go-demo-9-go-demo-9 running in the production Namespace. Next, we’ll create a second view. Please press the Split button in the top-right corner of the page and select Prometheus as the source for that panel (it is currently set to Loki). Type the query that follows into the second view and press shift and enter keys to execute it. 1
sum(rate(http_server_resp_time_count[2m])) by(path)
That query returns metrics from Prometheus. To be more precise, it returns the response rate grouped by the path. Those two queries might not be directly related. However, they demonstrate the ability to correlate different queries that can be even from various sources. Through those two, we might be able to deduce whether there is a relation between the errors from a specific application and the responses from the whole cluster. Do not take those as a suggestion that they are the most useful queries. They are not. They are just a demonstration that multiple queries from different sources can be presented together. ¹⁸⁶https://github.com/grafana/loki/blob/master/docs/sources/logql/_index.md
333
Using Centralized Logging
The output of both queries is likely empty or is very sparse. That’s normal since our demo app is not receiving any traffic for a while now. We’ll change that soon.
Figure 5-6: Grafana Explore page with split view without results
Since we are trying to correlate request metrics with errors recorded in logs, we should generate a bit of traffic to make that a bit more meaningful. We will use Siege¹⁸⁷ to storm our application with requests. We’ll do it twice, once with “normal” requests, and once with the endpoint that produces errors. First, let’s get the baseline address with the path pointing to /demo/hello. That one never returns errors. 1
export ADDR=http://go-demo-9-go-demo-9.production/demo/hello
Now we can run Siege. We’ll be sending 10 concurrent requests during 30 seconds. 1 2 3 4 5
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 10 --time 30S "$ADDR" ¹⁸⁷https://github.com/JoeDog/siege
334
Using Centralized Logging
The output is not that important. What matters is that we sent hundreds of requests to the endpoint that should be responding with the status code 200. The availability should be 100%, or slightly lower. After the Pod is done, go back to Grafana and re-run both queries by clicking the buttons with blue circled arrows or selecting the fields with the queries and pressing shift and enter keys. Prometheus should draw a spike, and Loki will show nothing. That is as expected since the query with logs from Loki should display only the entries with errors, and we did not generate any. Let’s repeat Siege but, this time, with the path /demo/random-error. As you can guess, it generates random errors. 1
export ADDR=http://go-demo-9-go-demo-9.production/demo/random-error
2 3 4 5 6 7
kubectl run siege \ --image yokogawa/siege \ --generator run-pod/v1 \ -it --rm \ -- --concurrent 10 --time 30S "$ADDR"
And again, after the Pod is finished executing, go to Grafana and refresh both panels. As you can see, the Prometheus’s graph got a new group of entries for the new path, and Loki returned Something, somewhere, went wrong! messages.
Figure 5-7: Grafana Explore page with split view
Using Centralized Logging
335
If we re-run Siege by sending requests to /demo/random-error several times, we will clearly see the correlation between red bars in Loki’s graph and request rate spikes to a specific path in the Prometheus panel. As you can see, Loki, combined with Grafana, is a solution for log aggregation that combines low administration footprint and integration with other observability tools like Grafana and Prometheus. A lot of capabilities are left for you to explore. You can graph metric queries and set thresholds and alerts in Grafana. If your application is instrumented (if it exposes internal metrics), you can combine them with the logs. You could even add request tracing and teach Loki to parse trace IDs and highlight them as links to a UI specialized in tracing.
Destroying The Resources We are done with this chapter, so let’s clean up all the tools we installed. There’s probably no need to comment on what we’ll do since it’s similar to what we did at the end of all the previous sections. We’ll just do it. 1 2
helm --namespace monitoring \ delete grafana loki prometheus
3 4 5
helm --namespace production \ delete go-demo-9
6 7
kubectl delete namespace monitoring
8 9
kubectl delete namespace production
10 11
cd ../../
If you created a cluster using one of my Gists, you will find the instructions on how to destroy the cluster at the bottom. Feel free to use those commands unless you plan to keep the cluster up and running for other reasons.
Deploying Applications Using GitOps Principles Almost everything we do ends up with a deployment of an application or a suite of apps. Today, that looks like an easy task. Virtual machines and, later on, Cloud greatly simplified deployments, and Kubernetes brought them to the next level. All we have to do is execute kubectl apply and voila. The application is running together with all the associated resources. If it needs storage, it is mounted. If it needs to be used by other apps, it is discoverable. Metrics are exposed, and logs are shipped. Our apps became fault-tolerant and scalable. They are deployed without any downtime, and they can roll back automatically if there are potential issues. We learned that deploying Kubernetes YAML files was not enough. We needed some sort of templating, so we got Helm and Kustomize. We managed to get to the point of executing the deployment commands from continuous delivery pipelines. We advanced a lot. Only a few years ago, the situation was very different. If we would go further back in time and bring today’s processes and tooling with us, it would look like “magic”. Yet, we are still moving forward and constantly improving the way we work, including how we deploy applications. Today, running commands ourselves is considered a bad practice. No one is supposed to ever deploy anything by executing kubectl apply, helm upgrade, or any other command, especially not in production. Processes are supposed to be run by machines, and not by us. That might lead you to think that we should define the commands as steps in our continuous deployment pipelines and that those commands should use definitions stored in Git repositories. That is indeed better than what we were doing in the past. It is compliant with some of the core principles behind GitOps. Specifically, we are trying to establish Git as a boundary between the tasks performed by us (humans) and machines. I will not go into details of what GitOps is, and what it isn’t. I will not explain the principles behind it, nor the main benefits of applying them. I will not dive into GitOps simply because I already did that and published it on YouTube. So, long story short, go and watch What Is GitOps And Why Do We Want It?¹⁸⁸. Come back here when you’re done. It is essential that you do watch it since the rest of the text assumes that you do. Don’t worry. I’ll wait for you to come back.
We are supposed to write code, including declarative definitions of the state of infrastructure and applications. Today, those declarative definitions are usually in the YAML format. So, we define the desired state and store it in Git. That’s where our job ends, and that’s where machines start theirs. ¹⁸⁸https://youtu.be/qwyRJlmG5ew
Deploying Applications Using GitOps Principles
337
Once we push something to Git, notifications are sent to tools that initiate a set of processes executed by machines, and without our involvement. Those notifications are usually sent to continuous delivery tools like Jenkins, Spinnaker, and Codefresh. However, that poses a few problems on its own. Or, to give it a positive spin, there is still room for improvement. Letting Git send webhooks to notify production clusters that there is a change is insecure. It means that Git needs to have access to production clusters. If a CD tool runs inside a production cluster and Git can communicate with it, others could do the same. Therefore, our cluster is exposed. Now, you could say that your CD tool is not running inside the production cluster. Indeed, that is a better option. It could be somewhere else. You might even be using CD-as-a-Service solution like, let’s say, Codefresh¹⁸⁹. Still, the problem is the same. If a CD tool runs somewhere else, it’s still almost the same as with Git sending webhooks. Something needs to have access to your cluster or an API of the application that runs in it. What if I would tell you that we can configure clusters so that no one has access to it and can still deploy applications as frequently as we want? What if I say that neither Git, nor a CD platform, nor any other tool should be able to reach your production clusters? What if I say that nobody, including you, should be able to access it beyond, maybe, the initial setup? Wouldn’t that be ideal? Now, if you’ve been in this industry for a while, you probably remember the days when almost no one had access to production clusters. There were usually a few “privileged” people, and that was about it. Those were horrible times. We had to open JIRA issues and write a ton of documents and justifications of why something should be deployed to production so that one of the “privileged” few would do it. We were deploying once a year or even less frequently. If you remember those times, saying that fewer people and tools should have access to the cluster probably evokes nightmares you were trying to forget. But I did not say that. I didn’t say that we should go back to reducing the number of people and tools with access to production. I said that nothing and no one should have access. That sounds even worse, if not impossible. It almost certainly feels like a bad idea that is not even doable. Yet, that is precisely what I’m saying. That is what we are supposed to, and that is what we should do. We should be defining deployments through a declarative format. We should be defining the desired state and store it in Git. But there’s much more to deployments than defining state. We also need to be able to describe the state of whole environments, no matter whether those are Namespaces, clusters, or something else. Environments are the backbone of deployments, so let’s spend a few minutes defining what they are and what we need to manage them effectively.
Discussing Deployments And Environments Let’s start by defining what an environment is. An environment can be a Namespace, a whole cluster, or even a federation of clusters. ¹⁸⁹https://codefresh.io/
Deploying Applications Using GitOps Principles
338
That didn’t help, so let’s try a different approach. An environment is a collection of applications and their associated resources. Production can be considered an environment. It can be located in a single Namespace, it could span the whole cluster, or it can be distributed across the fleet of clusters. The scope of what production is differs from one organization to another. What matters is that it is a collection of applications. Similar can be said for staging, pre-production, integration, previews, and many other environments. What they all have in common is that there are one or more logically grouped applications. To manage environments, we need quite a few things. We need to be able to define permissions, restrictions, limitations, and similar policies and constraints. We also need to be able to deploy and manage applications both individually and as a group. If we make a new release of a single independently deployable application (e.g., a microservice), we need to be able to update the state of an environment to reflect the desire to have a new release of an application in it. On the other hand, an environment is an entity by itself. It fulfills certain expectations when all the pieces are put together. That might be a front-end application that communicates with several backends that use a few databases to store their state. It could be many different things. What matters is that only when the whole puzzle is assembled, that environment fulfills its purpose. As such, we need to be able to manage an environment as a whole as well. We might need to deploy multiple applications at, more or less, the same time. We might need to be able to recreate the whole environment from scratch. Or, we might have to duplicate it so that it serves as, let’s say, staging used for testing before promoting a release to production. Given that we are focused on GitOps principles, we also need to have different sources of truth. The obvious one is an application. A repository that contains an app contains its code, its build scripts, its tests, and its deployment definitions. That allows the team in charge of that application to be in full control and develop it effectively. On the other hand, given that desired states of whole environments also need to be stored in Git, we tend to use separate repositories. Those usually contain manifests. However, they might not be the type that you are used to. If each repo containing an application has the manifest of that application, it would be silly to copy those same definitions into environment-specific repositories. Instead, environment repos often contain references to all the manifests of individual applications and environment-specific parameters applied to those apps. An application running in production will almost certainly not be accessible through the same address as the same app running in staging. Scaling patterns might be different, as well. There are many other variations that we can observe from one environment to another. Since everything is defined as code, and the goal is to have the desired state of everything, we must have a mechanism to define those differences. As a result, we tend to have environment-specific repositories that contain the references to manifests of individual applications, combined with all the things that are unique to those environments.
Deploying Applications Using GitOps Principles
339
So, as a general rule, we can distinguish the source of truth being split into two groups. We have individual applications, and we have environments. Each group represents different types of the desired state. It’s the logical separation. How we apply that separation technically is an entirely different matter. We might have one repository for each application and one repo for each environment. We might split environments into various branches of the same repo or use directories inside the master branch. We might even go with a monorepo. Personally, I prefer to keep each app and each environment in different repositories. I believe that it provides more decoupling and that it allows teams to work with more independence. Also, I prefer to treat master branches as the only branches that contain the source of truth, and all others as temporary. Still, my personal preferences should not deter you from organizing your work in whichever way fits the best. The only thing that matters is that everything is code. The code is the source of truth. It is the desired state, and it is stored in Git repositories. Your job is to push changes to Git, and it’s up to machines to figure out how to convert the actual state into the desired one. As long as we follow that logic, the rest can differ. I will not hold it against you if you choose to use a monorepo or prefer branches to distinguish environments. As a matter of fact, everything we will explore applies equally no matter whether there are one or a thousand repositories.
Off We Go We will try to figure out how to apply GitOps principles in their purest form. We will see how we can focus only and exclusively on defining the desired state in Git and letting the processes inside the cluster figure out what to do. We’ll do that without any communication from the outside world, except for the initial setup. At least, that is the intention. We are yet to see whether that makes sense or not. Initially, I thought to start the journey exploring how we were deploying in the distant past, and then to move into explanations on how we were doing that recently. But I chose to skip all that. I am sure that you already know how to deploy without containers. I’m sure that you have practical knowledge of executing kubectl, helm, or whichever tool you are using. I am positive that you already switched from doing that manually and already automated all that through CD pipelines. So, I’ll skip the past. I’ll go straight into the present and provide a glimpse of the future.
Let me summarize what is waiting for us or, to be more precise, what will be our mission. We will try to set up a system in which no one will ever execute kubectl apply, helm install, or any similar command, at least not after the initial setup. Now, to be clear, when I say no one, I mean precisely that. Neither you nor your colleagues will be running those commands. No one will be able to install or update an application directly. Now, I bet I know what is going through your head. You probably think that I will show you how to tell Jenkins, Spinnaker, Codefresh, or any other continuous delivery tool to do that by defining
Deploying Applications Using GitOps Principles
340
those same commands in CD pipelines. Maybe you’re thinking that we will run one of those inside the same cluster, or that we’ll let them control remote agents. If that’s what you’re thinking, you’re wrong. We will neither run those commands ourselves nor create scripts and let some tools run them. We will take a completely different approach, and you’ll love it. I bet that, in the end, you will ask yourself: “why didn’t I do this years ago?” Let’s go down the rabbit hole and see what is at the bottom.
Applying GitOps Principles Using Argo CD Argo CD¹⁹⁰ describes itself as “a declarative, GitOps continuous delivery tool for Kubernetes.” That is, in my opinion, wrong and misleading. Argo CD is not a “continuous delivery tool”. Much more is needed for it to be able to claim that. Continuous delivery (CD) is the ability to deploy changes to production as fast as possible, without sacrificing quality and security. On the surface, it might sound that a tool that can deploy changes to production is a CD tool, but that is not the case. Continuous delivery is about automating all the steps from a push to a code repository all the way until a release is deployed to production. As such, Argo CD does not fit that description. I wanted to get out of the way right away so that you do not end up having false expectations. Please watch the Continuous Delivery (CD) Is Not What Some Are Trying To Sell You¹⁹¹ video on YouTube if you’d like to get a better understanding behind what CD is and what it isn’t.
So, I’ll change the official description of Argo CD into being a “declarative GitOps deployment tool for Kubernetes.” That initial negativity should not diminish the value of Argo CD. It is one of the best, if not the best tool we have today to deploy applications inside Kubernetes clusters. It is based on GitOps principles, and it is a perfect fit to be a part of continuous delivery pipelines. It provides all the building blocks we might need if we would like to adopt GitOps principles for deployments and inject them inside the process of application lifecycle management. Now, let me give you a different explanation of what Argo CD is. Argo CD is a tool that helps us forget the existence of kubectl apply, helm install, and similar commands. It is a mechanism that allows us to focus on defining the desired state of our environments and pushing definitions to Git. It is up to Argo CD to figure out how to converge our desires into reality. That’s all the description I’ll provide for now. We’ll jump straight into examples, and, through them, we’ll discuss the process, the patterns, and the architecture. Let’s go! ¹⁹⁰https://argoproj.github.io/argo-cd/ ¹⁹¹https://youtu.be/hxJP1JoG4zM
Applying GitOps Principles Using Argo CD
342
Installing And Configuring Argo CD As you can probably guess, we’ll run Argo CD inside a Kubernetes cluster. That is our first requirement. We’ll need a Kubernetes cluster with the NGINX Ingress controller. The address through which we can access Ingress should be stored in the environment variable INGRESS_HOST. There are a few other things we’ll need. We’ll get to them soon. For now, let’s focus on a cluster. All the commands from this section are available in the 06-01-deploy-argo.sh¹⁹² Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
As always, I prepared Gists for creating clusters based on Docker Desktop, Minikube, Google Kubernetes Engine (GKE), AWS Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS). Use them, or roll out your own cluster. Just remember that if you choose to go rogue with your own, you might need to change a command or two in the following examples. If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
• • • • •
Docker Desktop: docker-3gb-2cpu.sh¹⁹³ Minikube: minikube.sh¹⁹⁴ GKE: gke-simple-ingress.sh¹⁹⁵ EKS: eks-simple-ingress.sh¹⁹⁶ AKS: aks-simple-ingress.sh¹⁹⁷
Now that you have a cluster with Ingress and the address stored in the environment variable INGRESS_HOST, we’ll need to ensure that you have the helm CLI. We’ll use it to deploy all the tools we’ll need as well as a demo app. I’m sure you already have it. In case you don’t, please visit the Installing Helm¹⁹⁸ page for the information. ¹⁹²https://gist.github.com/ae00efa6892fcb0b295bbdba73bef3ad ¹⁹³https://gist.github.com/0fff4fe977b194f4e9208cde54c1aa3c ¹⁹⁴https://gist.github.com/2a6e5ad588509f43baa94cbdf40d0d16 ¹⁹⁵https://gist.github.com/925653c9fbf8cce23c35eedcd57de86e ¹⁹⁶https://gist.github.com/2fc8fa1b7c6ca6b3fefafe78078b6006 ¹⁹⁷https://gist.github.com/e24b00a29c66d5478b4054065d9ea156 ¹⁹⁸https://helm.sh/docs/intro/install/
Applying GitOps Principles Using Argo CD
343
Remember that if you are using Windows Subsystem For Linux (WSL), you should follow the Linux instructions to install Helm CLI.
Now that we have all the pre-requisites, we can jump into the installation of Argo CD. To simplify the process, I created a few definitions to help us out. They are stored in the vfarcic/devops-catalogcode¹⁹⁹ repository, so let’s clone it. 1 2
git clone \ https://github.com/vfarcic/devops-catalog-code.git
Don’t worry if git
clone threw an error stating that the destination path 'devops-catalog-code' already exists and is not an empty directory. You likely
already have it from the previous exercises.
Next, we’ll get inside the local copy of the repo and pull the latest revision if you cloned it before, and I made some changes in the meantime. 1
cd devops-catalog-code
2 3
git pull
Before we proceed, we’ll install argocd CLI. It is not mandatory. We can do everything without it by using kubectl. Still, argocd CLI can simplify some operations. You already have quite a few CLIs, so one more should not be an issue. As you can probably guess, the installation instructions differ from one operating system to another. Please execute the commands that follow if you are using macOS.
1
brew tap argoproj/tap
2 3
brew install argoproj/tap/argocd
Please execute the commands that follow if you are using Linux or Windows with WSL.
¹⁹⁹https://github.com/vfarcic/devops-catalog-code
Applying GitOps Principles Using Argo CD 1 2 3 4
344
VERSION=$(curl --silent \ "https://api.github.com/repos/argoproj/argo-cd/releases/latest" \ | grep '"tag_name"' \ | sed -E 's/.*"([^"]+)".*/\1/')
5 6 7
sudo curl -sSL -o /usr/local/bin/argocd \ https://github.com/argoproj/argo-cd/releases/download/$VERSION/argocd-linux-amd64
8 9
sudo chmod +x /usr/local/bin/argocd
Now we are ready to install Argo CD. We could install it with kubectl or Kustomize, but, given that we already have helm, and that the demo applications we’ll use are based on Helm charts, it is probably the best choice, at least within the context of our exercises. As you will see later, this might be the last application you’ll ever install by executing ad-hoc commands from a terminal. Once we’re finished exploring Argo CD, you might decide to remove helm from your laptop. The first step is to create a Namespace where Argo CD will reside. 1
kubectl create namespace argocd
Next, we need to add the repo with Argo charts to the local Helm repository. 1 2
helm repo add argo \ https://argoproj.github.io/argo-helm
We would not be able to install the chart as-is. We’ll need to make a few tweaks so that it works in our cluster. We’ll do that by providing helm with a few additional values. All but one are already available in a file inside the vfarcic/devops-catalog-code²⁰⁰ repo we cloned earlier. So, let’s take a quick look at what we have. 1
cat argo/argocd-values.yaml
The output is as follows.
²⁰⁰https://github.com/vfarcic/devops-catalog-code
Applying GitOps Principles Using Argo CD 1 2 3 4 5 6
345
server: ingress: enabled: true extraArgs: insecure: true installCRDs: false
We are enabling ingress so that we can access Argo CD UI from a browser. Given that we do not have SSL certificates, we will let it know that it is okay to be insecure. If you choose to use Argo CD “for real”, you should not do that. You should be using certificates for all public-facing applications and mutual TLS for internal traffic.
Further on, we set installCRDs to false. Helm 3 removed the install-crds hook, so CRDs need to be installed as if they are “normal” Kubernetes resources. Think of it as a workaround. There’s one thing that is missing from that YAML. It does not contain the server.ingress.hosts value. I could not know in advance the address through which your cluster is accessible, so we’ll set that one through the --set argument. We’ll use xip.io²⁰¹ since I could not assume that you have a “real” domain that you can use for the exercises or, if you do, that you configured its DNS to point to the cluster.
That’s it. Now we’re ready to install Argo CD. 1 2 3 4 5 6 7
helm upgrade --install \ argocd argo/argo-cd \ --namespace argocd \ --version 2.8.0 \ --set server.ingress.hosts="{argocd.$INGRESS_HOST.xip.io}" \ --values argo/argocd-values.yaml \ --wait
The process should finish a few moments later, and you should be presented with information on how to access the UI and how to retrieve the initial password. Don’t waste time trying to memorize it. I’ll walk you through it. Before we start using Argo CD, we might want to retrieve the password that was generated during the installation. It happens to be the same as the name of the Pod, so all we have to do is retrieve the one with the specific label, output the name, and do a bit of cutting to get what we need. ²⁰¹http://xip.io/
Applying GitOps Principles Using Argo CD 1 2 3 4 5
346
export PASS=$(kubectl --namespace argocd \ get pods \ --selector app.kubernetes.io/name=argocd-server \ --output name \ | cut -d'/' -f 2)
We stored the password in the environment variable PASS. Now we can use it to login to Argo CD from the CLI. 1 2 3 4 5 6
argocd login \ --insecure \ --username admin \ --password $PASS \ --grpc-web \ argocd.$INGRESS_HOST.xip.io
Let’s take a look at the password itself. 1
echo $PASS
The output, in my case, is as follows. 1
argocd-server-745949fb6d-p6shn
You will probably not be able to remember that password. Even if you would, there is no good reason to waste your brain capacity on such futile memories. Let’s change the password to something else. 1
argocd account update-password
You will be asked to provide the existing password. Copy and paste the output of echo $PASS. Further on, you will be requested to enter a new password twice. We are finally ready to open Argo CD UI. If you are a Linux or a WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on doing that in the Setting Up A Local Development Environment chapter. If you do not have the open command (or the alias), you should replace open with echo and copy and paste the output into your favorite browser.
Applying GitOps Principles Using Argo CD 1
347
open http://argocd.$INGRESS_HOST.xip.io
Please type admin as the Username and whatever you chose for the Password.
Figure 6-1-1: Argo CD sign in screen
Click the SIGN IN button. I had issues with the Argo CD UI in Internet Explorer. If you are a Windows user, I strongly suggest switching to Microsoft Edge²⁰² instead.
After signing in, you will be presented with the Argo CD home screen that lists all the applications. We have none, so there is not much to look at. We’ll change that soon. Let’s take a look at the Pods that constitute Argo CD. 1
kubectl --namespace argocd get pods
The output is as follows. ²⁰²https://www.microsoft.com/en-us/edge
348
Applying GitOps Principles Using Argo CD 1 2 3 4 5 6
NAME argocd-application-controller-... argocd-dex-server-... argocd-redis-... argocd-repo-server-... argocd-server-...
READY 1/1 1/1 1/1 1/1 1/1
STATUS Running Running Running Running Running
RESTARTS 0 0 0 0 0
AGE 43s 43s 43s 43s 43s
We got a few distinct components. I will not go into details of what they are and what they’re used for. That’s reserved for a different occasion. For now, the important thing to note is that they are all Running, so we can assume that Argo CD was installed correctly. For now, we have a Kubernetes cluster with Argo CD running inside. It can be accessed through the UI or through Kube API. The only thing left before we try to deploy a few applications is to get out of the local copy of the vfarcic/devops-catalog-code²⁰³ repository. 1
cd ../
With Argo CD up-and-running, we can move to the more exciting part of the exercises.
Deploying An Application With Argo CD Now that we have Argo CD up-and-running let’s take a look at one of the demo applications we will deploy. It’s in the vfarcic/devops-toolkit²⁰⁴ repo. 1 2
git clone \ https://github.com/vfarcic/devops-toolkit.git
3 4
cd devops-toolkit
Kubernetes YAML files that define the application are in the k8s directory. Let’s take a peek at what’s inside. 1
ls -1 k8s
The output is as follows.
²⁰³https://github.com/vfarcic/devops-catalog-code ²⁰⁴https://github.com/vfarcic/devops-toolkit
Applying GitOps Principles Using Argo CD 1 2 3
349
deployment.yaml ing.yaml service.yaml
You can probably guess from the names of those files that there is a Deployment, an Ingress, and a Service. There’s probably no need to look at the content. They are as ordinary and uneventful as they can be. We could execute something like kubectl apply --file k8s to deploy that application, but we will not. We’ll take a different approach. Instead of telling Kube API that we’d like to have the resources defined in those files, we will inform Argo CD that there is a Git repository vfarcic/devops-toolkit²⁰⁵ it should use as the desired state of that application. But, before we do that, let’s create a Namespace where we’d like that application to reside. 1
kubectl create namespace devops-toolkit
Instead of creating the application defined in the k8s directory, we will create an Argo CD app that will contain the address of the repository and the path to the app. 1 2 3 4 5
argocd app create devops-toolkit \ --repo https://github.com/vfarcic/devops-toolkit.git \ --path k8s \ --dest-server https://kubernetes.default.svc \ --dest-namespace devops-toolkit
That was uneventful. Wasn’t it. If you retrieve the Pods inside the devops-toolkit Namespace, you’ll see that there are none. We did not yet deploy anything. So far, all we did, was establish a relation between the repository where the definition of the application resides and Argo CD. We can confirm that by opening the UI. 1
open http://argocd.$INGRESS_HOST.xip.io
As you can see, there is one application. Or, to be more precise, there is a definition of an application in Argo CD records. But, there is none in the cluster. The reason for that lies in the Status that is, currently, set to OutOfSync. We told Argo CD about the existence of a repository where the desired state is defined, but we never told it to sync it. So, it did not yet even attempt to converge the actual into the desired state. Now you must feel like a kid in front of a cookie jar. You are tempted to click that SYNC button, aren’t you? Do it. Press it, and let’s see what happens. ²⁰⁵https://github.com/vfarcic/devops-toolkit
Applying GitOps Principles Using Argo CD
350
Figure 6-1-2: Argo CD synchornization dialog
You’ll see a new dialog window with, among other things, the list of resources that can be synchronized. You can see that it figured out that we have a Service, a Deployment, and an Ingress resource defined in the associated Git repository. Click the SYNCHRONIZE button to start the process. You should be able to observe that the status changed to Progressing. After a while, it should switch to Healthy and Synced. We could have accomplished the same result through the argocd CLI. Everything we can do through the UI can be done with the CLI, and vice versa. As a matter of fact, we could do everything we need through kubectl as well. Any of the three should work. But, given that I do not want to influence you too much at this early stage, I will be mixing Argo CD UI with argocd and kubectl commands. That way, you will be able to experience different approaches to accomplish the same result. On the other hand, that will save me from being too opinionated and insist on using only one of those three. That might come later.
Whatever was defined in the k8s directory in the vfarcic/devops-toolkit²⁰⁶ repository is now synced inside the cluster. The actual state is now the same as the desired one, and we can confirm that by listing all the resources in the devops-toolkit Namespace. 1 2
kubectl --namespace devops-toolkit \ get all
The output is as follows. ²⁰⁶https://github.com/vfarcic/devops-toolkit
Applying GitOps Principles Using Argo CD 1 2
351
NAME READY STATUS RESTARTS AGE pod/devops-toolkit-... 1/1 Running 0 2m1s
3 4 5
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/devops-toolkit ClusterIP 10.3.253.103 80/TCP 2m2s
6 7 8
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/devops-toolkit 1/1 1 1 2m2s
9 10 11
NAME DESIRED CURRENT READY AGE replicaset.apps/devops-toolkit-... 1 1 1 2m2s
We could do quite a few other things, but we will not, at least not with the current setup. I showed you a quick and dirty way to deploy one application with Argo CD. If we would like to treat each application in isolation and deploy them manually by clicking buttons, we could just as well stop here. But, more often than not, we want to control whole environments by following GitOps principles. We might also want to use pull requests to decide what should be deployed and when. We might want to control who can do what. The problem is that we might have gone in the wrong direction. So, let’s delete the application we deployed and start over. 1
argocd app delete devops-toolkit
That command deleted the Argo CD application. We can confirm that by going back to the UI. 1
open http://argocd.$INGRESS_HOST.xip.io
As you can see, the application is gone. We confirmed that it was removed from Argo CD. However, that does not necessarily mean that all the resources of that application are gone as well. Let’s confirm that as well. 1 2
kubectl --namespace devops-toolkit \ get all
The output claims that no resources were found in devops-toolkit namespace. The app is gone. It’s wiped out. Finally, before we explore a potentially better way to deploy and manage applications with Argo CD, let’s remove the whole devops-toolkit Namespace.
Applying GitOps Principles Using Argo CD 1
352
kubectl delete namespace devops-toolkit
There’s one more thing I almost forgot to mention. This application repository also contains a Helm chart located in the directory helm. Let’s take a quick look at what’s inside. 1
ls -1 helm
The output is as follows. 1 2 3 4
Chart.yaml README.md templates values.yaml
As you can see, it is a “standard” Helm chart. We won’t go deeper into it than that. Helm is not the subject of this section, and I will assume that you are familiar with it. If you’re not, you probably skipped over the Packaging, Deploying, And Managing Applications section. The “real” reason I showed you that chart is that we will switch to Helm. I wanted to show you that Argo CD can use Kubernetes YAML files so that you understand that’s possible. It could have been Kustomize or Jsonnet as well. The logic is the same, only the definitions of the applications differ. Anyway, we will switch to Helm, and we will not need to do anything else with the application repo, so let’s get out. 1
cd ..
That’s it when single applications are concerned. Let’s dive into the “real deal”.
Defining Whole Environments We need at least three distinct types of definitions. We need to define a manifest for each of the applications we are working on. We already got that. We saw the manifest of the devops-toolkit app. As a matter of fact, We saw two manifests of the same app, one in the “pure” Kubernetes YAML format, and one as a Helm chart. However, we are missing one more. We need manifests of whole environments (e.g., production). Those can be split into two groups. We need a way to define references to all the apps running in an environment with all the environment-specific parameters. We also need environment-specific policies like, for example, the Namespaces in which the apps can run, quotas and limits, allowed types of resources, and so on. All in all, we need:
Applying GitOps Principles Using Argo CD
353
• Application-specific manifests • Environment-specific manifests We already have the first group, so let’s explore the second. While we’re at it, let’s try to automate everything instead of relying on the SYNCHRONIZE and similar buttons. After all, UIs are supposed to help us gain insights, not convert us into button-clicking machines. So, we need environment-specific manifests. Given that an environment usually contains more than one application, we cannot have those in a repo of one of the apps. So, it makes sense to keep separate repositories for environments. It could be one repo for all the environments. In that case, they can be split into branches or directories. If we’d use directories, we would effectively have a monorepo. Neither of those two strategies is a bad one. Those are as valid as any other, but we will not use them. Instead, we’ll create a separate repository for each environment. As you will see later, it is easy to configure Argo CD to use a different strategy, so do not take one-repo-per-env as the only option. Think of it as my preference, and not much more. I had to choose one for the examples. I already created a repo with the definition of a production environment, so let’s open it. 1
open https://github.com/vfarcic/argocd-production
We’ll need to make some changes to the manifests in that repo, and I’m not very eager to let you take control of my repository. So, please fork the repo. If you do not know how to fork a GitHub repo, the only thing I can say is “shame on you”. Google how to do it. I will not spend time explaining that.
Next, we’ll clone the newly forked repository. Please replace [...] with your GitHub organization in the command that follows. If you forked the repo into your personal account, then the organization is your GitHub username.
1 2
# Replace `[...]` with the GitHub organization export GH_ORG=[...]
3 4 5
git clone \ https://github.com/$GH_ORG/argocd-production.git
6 7
cd argocd-production
I already mentioned that we need to find a way to restrict the environment. We might need to be able to define which Namespaces can be used, which resources we are allowed to create, what the quotas
Applying GitOps Principles Using Argo CD
354
and limits are, and so on. We can do some of those things using “standard” Kubernetes resources. For example, we can define resource limits and quotas for each Namespace. I’m sure you know how to do that. But some things are missing in Kubernetes. Specifically, there is no easy way to define policies for a whole environment without restricting it to a single Namespace. That’s where Argo CD Projects come in. We already used an Argo CD project without even knowing. When we created the first Argo CD application, it was placed into the default Project. That’s similar to Kubernetes Namespaces. If we do not specify any, it is the one called default. That is okay initially, but, as the number of applications managed by Argo CD increase, we might want to start organizing them inside different projects. So, what are Argo CD Projects? Projects provide a logical grouping of applications. They are useful when Argo CD is used by multiple teams. That becomes evident when we take a look at the features it provides. It can restrict what may be deployed through the concept of trusted Git repositories. It can also be used to define in which clusters and Namespaces we are allowed to deploy apps. Further on, it allows us to specify the kinds of permitted objects (e.g., CRDs, Deployments, DaemonSets, etc.) and role-based access control (RBAC). Now, we will not go through all the possibilities Projects enable, nor through all the permutations. That would take too much time. Instead, we’ll take a look at a definition I prepared, assuming that, later on, you will consult the documentation if you choose to adopt Argo CD. For now, I believe that a simple Project should suffice to give you an idea of how it works. 1
cat project.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: argoproj.io/v1alpha1 kind: AppProject metadata: name: production namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: description: Production project sourceRepos: - '*' destinations: - namespace: production server: https://kubernetes.default.svc - namespace: argocd server: https://kubernetes.default.svc
Applying GitOps Principles Using Argo CD 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
355
clusterResourceWhitelist: - group: '' kind: Namespace namespaceResourceBlacklist: - group: '' kind: ResourceQuota - group: '' kind: LimitRange - group: '' kind: NetworkPolicy namespaceResourceWhitelist: - group: 'apps' kind: Deployment - group: 'apps' kind: StatefulSet - group: 'extensions/v1beta1' kind: Ingress - group: 'v1' kind: Service
We are specifying that the production Project can use any of the repositories (sourceRepos set to '*'). Further on, we defined that the applications can be deployed only to two destinations. Those are Namespaces production and argocd. We also white-listed the Namespace as the only clusterwide allowed resource (clusterResourceWhitelist). On the Namespace level, we are black-listing ResourceQuota, LimitRange, and NetworkPolicy. Within that project, we will not be able to create any of those resources. Finally, we white-listed Deployment, StatefulSet, Ingress, and Service. In this hypothetical scenario, those are the only resources we can create within the Namespaces production and argocd. Let’s create the project and see what we’ll get. 1 2
kubectl apply \ --filename project.yaml
Please note that we could have accomplished the same outcome through the argocd proj create command. We’re using kubectl apply mostly to demonstrate that everything related to Argo CD can be done by applying Kubernetes YAML manifests. When working with manifests, I feel that there is no sufficient advantage in using argocd CLI. I tend to use it mostly for observing the outcomes of what ArgoCD did, and even that not very often. Still, you should explore argocd CLI in more detail later and decide how useful it is.
To be on the safe side, we’ll list all the Argo CD projects and confirm that the newly created one is indeed there. As you saw from the definition, they are AppProject resources. Knowing that we can use a typical kubectl get to retrieve them.
Applying GitOps Principles Using Argo CD 1 2
356
kubectl --namespace argocd \ get appprojects
The output is as follows. 1 2 3
NAME AGE default 23m production 6s
Now, if you get sick of watching terminal for too long, you can retrieve the projects through the UI as well. 1
open http://argocd.$INGRESS_HOST.xip.io/settings/projects
You should see the same two projects we saw in the terminal. Please expand production if you’d like to see more info.
Figure 6-1-3: Argo CD project screen
Now that we established that we’ll allow the apps in the Argo CD production project to be deployed to the production Namespace, we should probably create it.
Applying GitOps Principles Using Argo CD 1
357
kubectl create namespace production
Next, let’s take a look at the manifests that will define our applications. They are located in the helm directory. 1
ls -1 helm
The output is as follows. 1 2
Chart.yaml templates
On the first look, this looks like a typical minimalistic Helm chart. Actually, it does not look like that. It is a Helm chart. What makes it “special” are the resources defined in the templates. 1
ls -1 helm/templates
The output is as follows. 1 2
devops-paradox.yaml devops-toolkit.yaml
Judging by the names, you can probably guess that those two files define two applications. Let’s take a look at one of those. 1
cat helm/templates/devops-toolkit.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: devops-toolkit namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: project: production source: path: helm repoURL: https://github.com/vfarcic/devops-toolkit.git
Applying GitOps Principles Using Argo CD 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
358
targetRevision: HEAD helm: values: | image: tag: latest ingress: host: devopstoolkitseries.com version: v3 destination: namespace: production server: https://kubernetes.default.svc syncPolicy: automated: selfHeal: true prune: true
The definition is accomplishing a similar objective as the argocd app create command we executed earlier. The major difference is that this time, it is defined as a file and stored in Git. By doing that, we are complying with the most important principle of GitOps. Instead of executing ad-hoc commands, we are defining the desired state and storing it in Git. What matters is that the definition in front of us specifies an Argo CD Application, and not much more. It belongs to the project called production, and it contains a reference to the repository (repoUrl) and the path inside it. That is the reference to the location of the definition. It will link the Application to a specific repo. As you will see later, every time we change the content of the helm directory inside the vfarcic/devops-toolkit²⁰⁷ repo, it will apply those changes. However, that will not happen often. We do not tend to change definitions of our applications frequently, at least not those residing in application repos. We also got the helm section. Inside it are a few variables that will overwrite those defined in the Chart stored in vfarcic/devops-toolkit²⁰⁸. The idea is to replace the values of variables that are specific to this environment. In this particular case, we have the image.tag. Every time we want to promote a new release to production, we can just change that value and let Argo CD do whatever needs to be done for the state of the production environment to converge to it. The ingress.host value is yet another one that is specific to the environment. Assuming that this application might run in environments other than production (e.g., staging, integration, etc.), we need the ability to define it in each. Moving on… The destination should be self-explanatory. We want to run that application in the production Namespace inside the local server. In this context, https://kubernetes.default.svc means that it will run in the same cluster as Argo CD. ²⁰⁷https://github.com/vfarcic/devops-toolkit ²⁰⁸https://github.com/vfarcic/devops-toolkit
Applying GitOps Principles Using Argo CD
359
Finally, syncPolicy is set to automated and, within it, with selfHeal and prune set to true. Having syncPolicy.automated means that, no matter the sub-values, it will sync an application whenever there is a change in manifests stored in Git. Its function is the same as the function of the SYNCHRONIZE button we clicked earlier. When syncPolicy.automated.selfHeal is set to true, it will synchronize the actual state (the one in the cluster) with the desired state (the Git repo) whenever there is a drift. It will synchronize not only when we change a manifest in a Git repo, but also if we make any changes to the actual state. If we manually change what is running in the cluster, it will auto-correct that. Effectively, if we keep this option enabled, we cannot change anything related to that application directly. It will assume that any manual intervention is unintentional and will undo the changes by converging the actual into the desired state. You, or any other human, will effectively lose the ability to “play” with the live cluster, at least when that application is concerned. The selfHeal option might sound “dangerous”, but it is a good thing to enable. If nothing else, it enforces the rule that no one should change anything in a cluster manually through the direct access.
Finally, when syncPolicy.automated.prune is set to true, Argo CD will sync even if we delete files from the repo. It will assume that the deletion of the files is an expression of the desire to remove the resources defined in them from the live cluster. Both selfHeal and prune are set to false by default. Actually, even syncPolicy.automated is disabled by default. As a security precaution, the project authors decided that we need to be explicit about whether we want automation and which level of it we desire. Today, we are being brave and going all in. That is the ultimate goal, isn’t it? Having a single application in the production environment would not allow me to show you some of the features of Argo CD, so we have a second one as well. Let’s take a quick look. 1
cat helm/templates/devops-paradox.yaml
I will not show the output nor comment on it. It’s almost the same as the first one. It’s there mostly to demonstrate that we can have any number of applications defined in an environment. There’s not much more to it beyond what we saw when we explored the definition of devops-toolkit. Now we can use a command like helm upgrade --install and Argo CD would gain the knowledge about those two apps and start managing them. But we will not do that. Applying that Chart would not produce the effect we are hoping to get. To begin with, it would monitor the repositories vfarcic/devops-toolkit²⁰⁹ and vfarcic/devops-paradox²¹⁰. Those repos do have manifests of the applications, but not the overwrites specific to production. ²⁰⁹https://github.com/vfarcic/devops-toolkit ²¹⁰https://github.com/vfarcic/devops-paradox
Applying GitOps Principles Using Argo CD
360
In other words, devops-paradox.yaml and devops-toolkit.yaml definitions are referencing app repos, so if we apply them directly, those are the repos that would be synced on changes. If, on the other hand, we make changes to devops-paradox.yaml and devops-toolkit.yaml, those would not be synchronized since there is no reference to the argocd-production repo where those files reside.
Creating An Environment As An Application Of Applications What we need is to define an Argo CD app of the apps. We’ll create an application that will reference those two applications (or whichever other we add there later). Those, in turn, will be referencing the “base” app manifests stored in application repositories. I just realized all that might be confusing, so let me show you yet another Argo CD Application definition. 1
cat apps.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: production namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: project: production source: repoURL: https://github.com/vfarcic/argocd-production.git targetRevision: HEAD path: helm destination: server: https://kubernetes.default.svc namespace: production syncPolicy: automated: selfHeal: true prune: true
Applying GitOps Principles Using Argo CD
361
That one is almost the same as the previous Application definition. The major difference is that it references the helm directory of the vfarcic/argocd-production²¹¹ repository. That’s the directory with the Chart that has the applications defined in devops-paradox.yaml and devops-toolkit.yaml files. As a result, Argo CD will synchronize changes to vfarcic/argocd-production²¹² and the repos referenced in devops-paradox.yaml and devops-toolkit.yaml. However, before we proceed, we need to make a change to that file. If you paid attention, you probably noticed that the repoUrl is set to https://github.com/vfarcic/argocd-production.git. That’s my repo. It belongs to the vfarcic organization. We need to change that address to be your fork of that repo, or, to be more precise, to use your GitHub organization. As usual, we’ll use a bit of sed magic for that. 1 2 3
cat apps.yaml \ | sed -e "s@vfarcic@$GH_ORG@g" \ | tee apps.yaml
Let’s persist those changes by pushing them to the repo. 1
git add .
2 3
git commit -m "Changed the org"
4 5
git push
We are finally ready to apply that definition. We do not need to tell Argo CD about all those we explored. By applying the definition of the app of the apps, Argo CD will follow the references set in repoURL fields and figure out all the rest. 1 2
kubectl --namespace argocd apply \ --filename apps.yaml
Let’s see what happened through the UI. 1
open http://argocd.$INGRESS_HOST.xip.io
We can see that there are three applications. We got production. It is the app of the apps that references the helm directory inside the argocdproduction repo. There, in turn, is the Chart that defines two applications. Those are devops-paradox and devops-toolkit. As you already know, those two are also Argo CD apps. Each references ²¹¹https://github.com/vfarcic/argocd-production ²¹²https://github.com/vfarcic/argocd-production
Applying GitOps Principles Using Argo CD
362
the repository of an application repo where Deployments, Services, Ingresses, and other resource definitions are stored. Let’s see what happens if we click the production application box. You should see that, from the Argo CD point of view, the production app consists of two resources. Expand them by clicking the show 2 hidden resources link. The relation should now be clearer. We can see that production consists of devops-paradox and devops-toolkit applications. In this context, even though Argo CD sees production as yet another application, it is realistically the full production environment. Right now, the production consists of two applications. If we would push additional apps to the helm directory inside the argocd-production repo, those would be synchronized automatically. Similarly, any changes to any of the referenced repositories will be applied automatically to the cluster. Now, let’s say that we want to see the details of the devops-toolkit app. For example, we might want to see which resources are defined as part of that application, the status of each, and so on. We can do that easily by clicking the Open application icon in devops-toolkit. That’s the one represented as a square with an arrow pointing towards the top-right corner. Don’t be afraid. Click it. That is the picture worth looking at. We can see that the Argo CD application consists of a Service, a Deployment, and an Ingress. The Service has an endpoint, and the Deployment created a ReplicaSet, which created a Pod.
Figure 6-1-4: Argo CD application view
We can even open the application in a browser by clicking the Open application icon in the ingress resource devops-toolkit-devops-toolkit. You should see the home page of devopstoolkitseries.com²¹³. That’s both good and bad news. On the bright side, it worked. It opened the address defined in Ingress. However, that’s not the address ²¹³https://devopstoolkitseries.com
363
Applying GitOps Principles Using Argo CD
we should have. The devopstoolkitseries.com²¹⁴ URL is where my “real” production of that app is running. Ingress of that app is configured with that domain. So, even though the app runs in your cluster, the Ingress has my hostname. We’ll use that as an excuse to see how to update applications controlled with Argo CD. But first, let’s see what we got by querying the cluster. For all we know, Argo CD UI might be “faking” it. 1
kubectl --namespace production get all
The output is as follows. 1 2 3
NAME READY STATUS RESTARTS AGE pod/devops-paradox-devops-paradox-... 1/1 Running 0 5m31s pod/devops-toolkit-devops-toolkit-... 1/1 Running 0 5m31s
4 5 6 7
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/devops-paradox ClusterIP 10.3.241.233 80/TCP 5m31s service/devops-toolkit ClusterIP 10.3.243.125 80/TCP 5m32s
8 9 10 11
NAME READY UP-TO-DATE AVAILABLE deployment.apps/devops-paradox-devops-paradox 1/1 1 1 deployment.apps/devops-toolkit-devops-toolkit 1/1 1 1
AGE 5m31s 5m31s
12 13 14 15
NAME DESIRED CURRENT READY AGE replicaset.apps/devops-paradox-devops-paradox-... 1 1 1 5m32s replicaset.apps/devops-toolkit-devops-toolkit-... 1 1 1 5m32s
We can see that all the resources were indeed created and that the application’s Pods are indeed running. Given that I already commented on the “issue” with the Ingresses, let’s see what we got. 1
kubectl --namespace production get ingresses
The output is as follows. 1 2 3
NAME HOSTS ADDRESS PORTS AGE devops-paradox-devops-paradox devopsparadox.com 35.237.185.59 80 7m devops-toolkit-devops-toolkit devopstoolkitseries.com 35.237.185.59 80 7m1s
Now that’s something we need to change. Those two apps are running in your cluster, but they are using my domains. We’ll change that soon. ²¹⁴https://devopstoolkitseries.com
Applying GitOps Principles Using Argo CD
364
The best part is that we accomplished all that by executing a single command. All we did was deploy a single resource that defines a single application. Argo CD figured out the rest by following repo references recursively. Given that we have only two applications, that might not seem like much. But imagine a “real” production with tens, hundreds, or even thousands of applications. The process and simplicity would still be the same. All we would need to do is create a single application of applications. From there on, our job would be to push changes to manifests in associated repositories and let Argo CD do the rest. Let’s see how does the management of applications through Argo CD look like beyond the initial setup.
Updating Applications Through GitOps Principles As we already discussed, any change to the cluster should start with a change of the desired state and end with the push of that change to Git. After that, it’s up to the system to converge the actual into the desired state. We already have such a system set up, and all we’re missing is to see it in action. We are going to correct two potential issues. As we already discussed, we’ll need to change the address of the applications. On top of that, and assuming that you paid attention, both apps are currently using the latest tag of their respective images. That’s bad. One should never do that, except when running in a local development environment. We need to be specific, or we risk losing control of what is running where. With those two issues in mind, we can fix our problems by changing the latest to a specific version and devopstoolkitseries.com to a domain that will be pointing to your cluster. Given that both the tag and the Ingress host are defined as Helm variables, we do not need to touch the manifest in the application’s repo. More importantly, those two values are specific to the production environment, so the logical place to apply those changes is in the manifest inside the argocd-production repo. We are already inside the local copy of that repo in our terminal session, so we can simply overwrite those two variables with new values. Since you already know that I prefer changing files through commands, we’ll resort to sed magic one more time. 1 2 3 4
cat helm/templates/devops-toolkit.yaml \ | sed -e "s@[email protected]@g" \ | sed -e "[email protected]@devops-toolkit.$INGRESS_HOST.xip.io@g" \ | tee helm/templates/devops-toolkit.yaml
The output is as follows.
Applying GitOps Principles Using Argo CD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
365
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: devops-toolkit namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io spec: project: production source: path: helm repoURL: https://github.com/vfarcic/devops-toolkit.git targetRevision: HEAD helm: values: | image: tag: 2.9.17 ingress: host: devops-toolkit.35.237.185.59.xip.io version: v3 destination: namespace: production server: https://kubernetes.default.svc syncPolicy: automated: selfHeal: true prune: true
As you can see, we updated the Argo CD Application that is defined and stored in the local copy of the argocd-production repository. We did not touch the “real” application manifest in the vfarcic/devops-toolkit²¹⁵ repo. We are focusing on the changes to the devops-toolkit app and leaving devops-paradox intact. Changing one application in production should be enough to demonstrate how everything works.
Let’s push those changes and see what happens.
²¹⁵https://github.com/vfarcic/devops-toolkit
Applying GitOps Principles Using Argo CD 1
366
git add .
2 3
git commit -m "New release"
4 5
git push
We changed the tag of the image from latest to 2.9.17 and updated the Ingress host. We did not touch the live system. We only made changes to a file and pushed it to the Git repository. The rest should be done by Gremlins working behind the hood of Argo CD. Let’s see what the image of the container currently running inside the cluster is. 1 2 3
kubectl --namespace production get \ deployment devops-toolkit-devops-toolkit \ --output jsonpath="{.spec.template.spec.containers[0].image}"
The output, in my case, is as follows. 1
vfarcic/devops-toolkit-series:latest
As you can see, it is still the latest tag. That can be explained in two ways. Maybe something went wrong, or maybe I was too impatient. For now, we’ll assume that it is the latter case, so I’ll wait for a while. If you’re getting the same output, I suggest you take this opportunity to stretch your legs. Take a short walk, go grab a coffee, or take this opportunity to check your emails. After a while, we can repeat the same command hoping that “gremlins” did their job. 1 2 3
kubectl --namespace production get \ deployment devops-toolkit-devops-toolkit \ --output jsonpath="{.spec.template.spec.containers[0].image}"
The output, in my case, is as follows. 1
vfarcic/devops-toolkit-series:2.9.17
As you can see, the tag of the image used by the container changed. Now it is 2.9.17, thus proving that Argo CD found out that we changed the desired state and synced the state of the cluster. We also changed the definition of the Ingress host. Let’s see whether that was applied as well. 1
kubectl --namespace production get ingresses
The output is as follows.
Applying GitOps Principles Using Argo CD 1 2 3 4 5 6
367
NAME HOSTS ADDRESS PORT\ S AGE devops-paradox-devops-paradox devopsparadox.com 35.237.185.59 80 \ 10m devops-toolkit-devops-toolkit devops-toolkit.35.237.185.59.xip.io 35.237.185.59 80 \ 10m
We can see that, at least in my case, the host of devops-toolkit is now set to devops-toolkit.35.237.185.59.xip.io To be on the safe side, let’s open the newly assigned host in a browser and confirm that we can indeed access devops-toolkit running inside our cluster. 1
open http://devops-toolkit.$INGRESS_HOST.xip.io
You should see the page with the books and the courses. Unlike before, this time, you are seeing the response from the app running inside your cluster. We saw that we can create and update resources in an environment. How about removing something? What happens, for example, if we remove the whole devops-paradox.yaml file from the Git repo? 1
rm helm/templates/devops-paradox.yaml
2 3
git add .
4 5
git commit -m "Removed DOP"
6 7
git push
Let’s go back to the UI and observe what’s going on. 1
open http://argocd.$INGRESS_HOST.xip.io
Initially, you should see all three apps (production, devops-catalog, and devops-paradox). Go stretch your legs. By the time you come back, devops-paradox should disappear.
Applying GitOps Principles Using Argo CD
368
Figure 6-1-5: Argo CD home screen with an environment and an application
We can confirm that devops-paradox is indeed gone by retrieving all the pods from the production Namespace. 1
kubectl --namespace production get pods
The output is as follows. 1 2
NAME READY STATUS RESTARTS AGE devops-toolkit-devops-toolkit-... 1/1 Running 0 7m50s
It’s gone as if it never existed. One of the essential things to note is that we did not have to execute commands like kubectl apply or helm update beyond the initial creation of a few resources. We also did not create a webhook so that Git can notify some in-cluster processes that there were changes to the source code. We could have blocked all incoming traffic to the cluster, and everything would still work. We would still be able to change any aspect of the production or any other environment. It all happened without any outside intervention because Argo CD is based on a “pull model”. Instead of waiting for notifications that the desired state changed, it is monitoring the repositories and waiting for us to make a change. That makes it a much more secure solution given that we can block all ingress traffic. That’s it. The process is simple yet very powerful. Even though Argo CD is a very misleading name, it does a great job. It’s not doing CD, as the name suggests. Instead, it is in charge of deployments through GitOps principles. It is a piece of the puzzle that makes application lifecycle automation much easier. From now on, we do not need to worry about figuring out what to deploy and how
Applying GitOps Principles Using Argo CD
369
to do it. All that is expected from us is to modify manifests and push them to Git repositories. Everything else related to deployments will be handled by Argo CD.
Destroying The Resources We are done with this chapter, so let’s clean up all the tools we installed. There’s probably no need to comment on what we’ll do since it’s similar to what we did at the end of all the previous sections. So let’s just do it. 1
kubectl delete namespace argocd
2 3
kubectl delete namespace production
4 5
cd ..
If you created a cluster using one of my Gists, you will find the instructions on how to destroy the cluster at the bottom. Feel free to use those commands unless you plan to keep the cluster up and running for other reasons.
There Is More About GitOps There is more material about this subject waiting for you on the YouTube channel The DevOps Toolkit Series²¹⁶. Please consider watching one of the videos that follow, and bear in mind the list will likely grow over time. • What Is GitOps And Why Do We Want It?²¹⁷ • Environments Based On Pull Requests (PRs): Using Argo CD To Apply GitOps Principles On Previews²¹⁸ • Flux CD v2 With GitOps Toolkit - Kubernetes Deployment And Sync Mechanism²¹⁹ ²¹⁶https://www.youtube.com/c/TheDevOpsToolkitSeries ²¹⁷https://youtu.be/qwyRJlmG5ew ²¹⁸https://youtu.be/cpAaI8p4R60 ²¹⁹https://youtu.be/R6OeIgb7lUI
Applying Progressive Delivery We will go through progressive delivery, and we will go through it fast so that we can get to the “real deal” quickly. This might be the fastest introduction to a subject, so get ready.
Progressive delivery is a deployment practice that aims at rolling out new features gradually. It enforces the gradual release of a feature while, at the same time, tries to avoid any downtime. It is an iterative approach to deployments. There you go. That’s the definition. It is intentionally broad since progressive delivery encompasses quite a few practices, like blue-green deployments, rolling updates, canary deployments, and so on. We’ll see soon how does it look like in practice. For now, the most crucial question is not what it is, but why we want it and which problems does it solve? The “traditional” deployment mechanism consists of shutting down the old release and deploying a new one in its place. I call it “big bang” deployments, even though the more commonly used term is “recreate strategy”. The major problem with the “big bang” deployments is downtime. Between shutting down the old release and deploying a new one in its place, there is a period during which the application is not available. That’s the time during which neither the old nor the new release are running. Users do not like that, and business hates the idea of not being able to serve users. Such downtime might be one of the main reasons why we had infrequent releases in the past. If there is inevitable downtime associated with the deployment of new releases, it makes sense not to do it often. The fewer releases we do during a year, the less downtime caused by new deployments we have. But that is also bad. Users do not like it when a service is not available, but they also do not like not getting new features, nor are they thrilled with the prospect of not having the bugs fixed. Right now, you might be asking yourself, “why would anyone use the “big bang” deployment strategy if it produces downtime?” There are two common answers to that question. To begin with, you might not know that there are better ways to deploy software. That is an easy problem to solve. All you have to do is continue reading, and you’ll soon find out how to do it better. But, there is a more problematic reason for using that strategy. Sometimes, deploying new releases in a way that produces downtime is the only option we have. Sometimes, the architecture of our applications does not permit anything but the shut-it-down-firstand-deploy-later approach.
Applying Progressive Delivery
372
To begin with, if an application cannot scale, there is no other option. It is impossible to deploy a new release with zero-downtime without running at least two replicas of an application in parallel. No matter which zero-downtime deployment strategy we choose, the old and new releases will run concurrently, even if only for few milliseconds. If an application cannot scale horizontally, it cannot have more than one replica. “Horizontal scaling is easy to solve,” you might say. “All we have to do is set replicas field of a Kubernetes Deployment or a StatefulSet to a value higher than 1, and voila, the problem is solved.” Right? To begin with, stateful applications that cannot replicate data between replicas cannot scale horizontally. Otherwise, the effects of scaling would be catastrophic. That means that we either have to change our application to be stateless or figure out how to replicate data. The latter option is a horrible one, and close to impossible to do well. Consistent and reliable replication is hard. It’s so complicated that even some databases have a hard time doing that. Accomplishing that for our apps is a waste of time. It’s much easier and better to use an external database for all the state of our applications, and, as a result, they will become stateless. So, we might be inclined to think that if an application can scale, it can use progressive delivery. That would be too hasty. There’s more to it than being stateless. Progressive delivery means not only that multiple replicas of an application need to run in parallel, but also that two releases will run in parallel. That means that there is no guarantee which version a user will see, nor with which release other applications are communicating with. As a result, each release needs to be backward compatible. It does not matter whether it is a change to the schema of a database, a change to the API contract, a change to the front-end, etc. If it is going to be deployed without downtime, two releases will run in parallel, even if only for a few milliseconds, and there is no telling which one will be “hit” by a user or a process. Even if we employ feature toggles (feature flags, or whatever else we call it these days), backward compatibility is a must. Where did we get so far? We know that horizontal scaling and backward compatibility are requirements. They are unavoidable. Those are not the only requirements for progressive delivery, but they should be a good start. Others are mostly based on specific processes and architecture of our applications. Nevertheless, we’ll skip commenting on them since they often vary from one case to another. Also, I promised that this will be the shortest introduction to a subject. As you can see, I’m already failing on that promise, so let’s move on. Optionally, you might need to have continuous delivery or continuous deployment pipelines. You might need to have a firm grasp of traffic management. You might have to invest in observability and alerting. Many things are not strict requirements for progressive delivery, but your life will only become harder without them. That’s not the outcome we should strive for. We’ll go through some of those when we reach the practical examples. For now, what matters is that progressive delivery is NOT the practice well suited for immature teams. It requires a high level of experience. That might not look that way initially, but when we reach production and, especially when we’re dealing with a large scale, things can quickly get out
Applying Progressive Delivery
373
of hand if the processes and the tools we are typically using are not accompanied by extensive experience and high maturity. You’ve been warned. All in all, progressive delivery is an advanced technique of deploying new releases that lowers the risk, reduces the blast radius of potential issues, allows us to test in production, and so on and so forth. I won’t bore you with details, since you’ll see them in action soon. Now, let’s go back to the initial promise of going through theory fast. There is only one crucial question left to answer before moving to the “fun part”. Which types of progressive delivery we have? Progressive delivery is an umbrella for different deployment practices, with only one thing in common. They are all rolling out new releases progressively. That can be over a few milliseconds, a few minutes, or even a few days. The duration varies from one case to another. What matters is that progressive delivery is an iterative approach to the deployment process. It can be rolling updates, blue-green deployments, canary deployments, and quite a few others. They are all variations of the same idea. I will not even attempt to explain in detail how each of those deployment strategies works. If you are familiar with them, great. If you are not, I prepared a video that you can watch on YouTube. Please visit Progressive Delivery Explained - Big Bang (Recreate), Blue-Green, Rolling Updates, Canaries²²⁰. Through it, I can indeed claim that this was a quick introduction to the subject, and we can go into practical examples right away. But, before we do, there is a warning. We will not explore the “big bang” (the recreate) strategy since you are either already using it, and hence you know what it is, or you do not need it because you are lucky to work in a company that does not have any legacy application. Similarly, I will not go through the rolling updates strategy because you are almost certainly using it, even if you might not know that you do. That is the default deployment strategy in Kubernetes. Nearly all the examples from the other sections used it. Finally, we will not explore blue-green deployments because they are pointless today. They made sense in the past when we had the static infrastructure, when applications we mutable, when it was expensive to roll back, and so on and so forth. It is the first commonly used progressive delivery strategy that made a lot of sense in the past, but not anymore. That leaves us with only one progressive delivery strategy worth exploring, and you probably already guessed which one it is. We are about to dive into canary deployments. If you read the book or watched the Udemy course Canary Deployments To Kubernetes Using Istio and Friends²²¹, you might think that what’s coming will be based on the same material. That’s not the case. We’ll explore a completely different tool. That book/course was based on Flagger, and now we will explore what I believe is a better tool. We’ll dive into canary deployments with Argo Rollouts. ²²⁰https://youtu.be/HKkhD6nokC8 ²²¹https://www.devopstoolkitseries.com/posts/canary/
Using Argo Rollouts To Deploy Applications We are going to explore one of the flavors of progressive delivery using Argo Rollouts²²². So, it probably stands to reason that a quick explanation of the project is in order. Argo Rollouts provides advanced deployment capabilities. It supports blue-green and canary strategies. Given that I already “thrashed” blue-green, we’ll focus on canaries.
Saying that Argo Rollouts does deployments using blue-green or canary strategies would be an understatement. It integrates with Ingress controllers like NGINX and AWS Application Load Balancer (ALB) and service meshes like Istio and those supporing the service mesh interface (SMI) like, for example LinkerD. Through them, it can control traffic making sure that only requests matching specific criteria are reaching new releases. On top of that, it can query metrics from various providers and make decisions whether to roll forward or to roll back based on the results. Those metrics can come from Kubernetes, Prometheus, Wavefront, Keyenta, and a few other sources. Long story short, Argo Rollouts is a robust and comprehensive solution that encompases many different combinations of processes and tools. At the same time, it is very simple and intuitive to use. Now, I could continue for a long time explaining the benefits and the downsides of Argo Rollouts. We could start comparing it with other solution, or debate for a while why we should adopt it instead of others. But I will not do any of those things. Instead, we’ll jump straight into practical examples. Once you get the “feel” of how it works, you should be able to make the decision whether it is a tool worth adopting. I am assuming that you are looking for hands-on examples instead of lengthy theory. Given that this is not a conversation and that you cannot tell me that I’m wrong, I’ll have to assume that you agree with me. So, let’s get going.
Installing And Configuring Argo Rollouts I already mentioned that Argo Rollouts can integrate with Ingress or with service meshes. Given that service mesh allows a few additional possibilities, and that Istio is probably the most commonly ²²²https://argoproj.github.io/argo-rollouts/
Using Argo Rollouts To Deploy Applications
375
used one, we’ll choose it for the examples. That does not mean, in any form or way, that you must use Istio. All the examples we’ll explore can be easily applied to other solutions with only a few trivial tweaks to the examples. As you can guess, we’ll need a Kubernetes cluster. Given that I already stated that the examples will use Istio, we’ll need to install it as well. I tested all the examples with Istio 1.7.3, but other versions should work as well. Just bear in mind that I did not test them, so you might need to make a few tweaks. The examples that follow will assume that you created the environment variable ISTIO_HOST with the IP through which Istio Gateway is accessible. That’s it. Those are all the requirements. We’ll need a Kubernetes cluster with Istio and the environment variable ISTIO_HOST. All the commands from this section are available in the 07-01-progressive-argo-rollouts.sh²²³ Gist. Feel free to use it if you’re too lazy to type. There’s no shame in copy & paste.
As always, you can create what we need yourself, or you can use one of the Gists I prepared. If you choose the latter, they are as follows. • • • • •
Docker Desktop: docker-istio.sh²²⁴ Minikube: minikube-istio.sh²²⁵ GKE: gke-istio.sh²²⁶ EKS: eks-istio.sh²²⁷ AKS: aks-istio.sh²²⁸ If you are a Windows user, I will assume that you are running the commands from a Bourne Again Shell (Bash) or a Z Shell (Zsh) and not PowerShell. That should not be a problem if you followed the instructions on setting up Windows Subsystem for Linux (WSL) explained in the Setting Up A Local Development Environment chapter. If you do not like WSL, a Bash emulator like GitBash should do. If none of those is an acceptable option, you might need to modify some of the commands in the examples that follow.
Argo Rollouts offers a kubectl plugin that might come in handy, so let’s install it as well. Please execute the command that follows if you are using macOS.
²²³https://gist.github.com/4e75e84de9e0f503fb95fdf312de1051 ²²⁴https://gist.github.com/a3025923ad025215fe01594f937d4298 ²²⁵https://gist.github.com/1ab5f877852193e8ebd33a97ae170612 ²²⁶https://gist.github.com/d5c93afc83535f0b5fec93bd03e447f4 ²²⁷https://gist.github.com/2ebbabc3ff515ed27b2e46c0201fb1f8 ²²⁸https://gist.github.com/2ec945256e3901fee1a62bb04d8b53b0
Using Argo Rollouts To Deploy Applications 1 2
376
brew install \ argoproj/tap/kubectl-argo-rollouts
Please execute the commands that follow if you are using Linux or Windows with WSL.
1 2
curl -LO https://github.com/argoproj/argo-rollouts/releases/download/v0.9.1/kubectl-\ argo-rollouts-linux-amd64
3 4
chmod +x kubectl-argo-rollouts-linux-amd64
5 6 7
sudo mv ./kubectl-argo-rollouts-linux-amd64 \ /usr/local/bin/kubectl-argo-rollouts
To be on the safe side, we can validate that the plugin indeed works as expected by listing all the additional commands we just added to kubectl. 1
kubectl argo rollouts --help
The output, limited to the Available Commands, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
... Available Commands: abort Abort a rollout create Create a Rollout, Experiment, AnalysisTemplate, ClusterAnalysisTemplat\ e, or AnalysisRun resource get Get details about rollouts and experiments help Help about any command list List rollouts or experiments pause Pause a rollout promote Promote a rollout restart Restart the pods of a rollout retry Retry a rollout or experiment set Update various values on resources terminate Terminate an AalysisRun or Experiment version Print version ...
There’s only one more thing missing. We need to install Argo Rollouts inside the Kubernetes cluster.
Using Argo Rollouts To Deploy Applications 1
377
kubectl create namespace argo-rollouts
2 3 4 5
kubectl --namespace argo-rollouts apply \ --filename https://raw.githubusercontent.com/argoproj/argo-rollouts/stable/manif\ ests/install.yaml
Now we have everything we need. Let’s deploy a demo application and see Argo Rollouts in action.
Exploring Argo Rollouts Definitions We will continue using the devops-toolkit application that we used in a few examples before. If you happened to skip the sections that use that app, the first step is to clone the repository with the code. 1
git clone https://github.com/vfarcic/devops-toolkit.git
Let’s get inside the local repo and pull the latest version, just in case you cloned it before and I changed something in the meantime. 1
cd devops-toolkit
2 3
git pull
Just as before, the whole application definition is in the helm directory. It contains the templates of all the definitions that we’ll need, as well as a few that we’ll ignore given that they are used in other examples. Everything directly related to Argo Rollouts is in the rollout.yaml file, so let it be the first one we’ll look at. While it might be easier to explore Argo Rollouts through “pure” Kubernetes YAML, I believe that it is better to use Helm templates since they will allow us to apply different variations of the strategies by changing a few values instead of creating new definitions.
1
cat helm/templates/rollout.yaml
The output, limited to the relevant parts, is as follows.
Using Argo Rollouts To Deploy Applications 1 2
{{- if .Values.rollout.enabled }} ---
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
apiVersion: argoproj.io/v1alpha1 kind: Rollout ... spec: ... strategy: canary: canaryService: {{ template "fullname" . }}-canary stableService: {{ template "fullname" . }} trafficRouting: istio: virtualService: name: {{ template "fullname" . }} routes: - primary steps: {{ toYaml .Values.rollout.steps | indent 6 }} {{- if .Values.rollout.analysis.enabled }} ... {{- end }}
24 25 26
{{- if .Values.rollout.analysis.enabled }} ---
27 28 29 30 31
apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate ... {{- end }}
32 33
---
34 35 36 37 38 39 40 41 42 43
apiVersion: v1 kind: Service metadata: name: {{ template "fullname" . }}-canary labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" spec: type: ClusterIP ports:
378
Using Argo Rollouts To Deploy Applications 44 45 46 47 48 49 50
379
- port: 80 targetPort: 80 protocol: TCP name: http selector: app: {{ template "fullname" . }} {{- end }}
There are three kinds of definitions in that file. We have the Rollout, the AnalysisTemplate, and the Service. The Rollout is almost the same definition as what we’d expect from a Kubernetes Deployment. As a matter of fact, everything that we can define as a Deployment, can be defined as a Rollout, with a few differences. The apiVersion should be argoproj.io/v1alpha1 and the kind should be Rollout. Those are the obvious differences. Aren’t they? The more important change, when compared with the Deployment is the addition of two new strategies. Instead of typical Recreate and RollingUpdate strategies, Rollout supports spec.strategy.blue-green and spec.strategy.canary fields. The current example uses canary but, as you will see soon, it could be easily changed to blue-green if that’s what you prefer (and you ignored me when I said that it is pointless). Inside the spec.strategy.canary entry are a few fields that provide the information to the Rollout how to perform the process. There is a reference to the canaryService. That’s the one it will use to redirect part of the traffic to the new release. Further on, we got the stableService which is where the bulk of the traffic goes. It’s the Service used for the release that is rolled out fully. Further on, we have trafficRouting which, in our case, contains the reference to the istio.virtualService. We’ll see later what Rollout does with that service and how it controlls the traffic through it. Then we have the steps which are set to the Helm value rollout.steps which we’ll explore soon when we take a closer look at the the Helm values file we’ll use. Finally, there is the analysis entry. It is enveloped inside an if conditional based on the rollout.analysis.enabled value. We will disable it initially so we’ll skip commenting on it now, and get back to it later when we enable it. The second definition in that file is AnalysisTemplate which is also inside an if statement based on the rollout.analysis.enabled value which, just as the analysis field in the previous definition, is set to false. So, we’ll leave the explanation of that whole definition for later. Finally, we have the Service. The primary Service is defined in service.yaml. We will not go through that Service since it is as simple as it can get, and you are almost certainly already familiar with how Kubernetes Service works. The one we see right now is the Service referenced in the canaryService field of the Rollout. It will be used only during the process of rolling out a new release. For all other cases, the one defined in service.yaml will be used. As a matter of fact, the only difference between the Service defined in rollout.yaml and the one in service.yaml is in the name.
Using Argo Rollouts To Deploy Applications
380
There are a few other changes we might need to make to the typical definitions. For example, the reference in a HorizontalPodAutoscaler needs to reference Rollout instead of a Deployment or whichever other resource we might normally use. Let’s take a quick look at the one we will use. 1
cat helm/templates/hpa.yaml
1
{{- if .Values.hpa }} ---
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler ... spec: minReplicas: 2 maxReplicas: 6 scaleTargetRef: {{- if .Values.rollout.enabled }} apiVersion: argoproj.io/v1alpha1 kind: Rollout {{- else }} apiVersion: apps/v1 kind: Deployment {{- end }} name: {{ template "fullname" . }} targetCPUUtilizationPercentage: 80 {{- end }}
That is the same HorizontalPodAutoscaler we would use for any other type of application. The only difference is in the apiVersion and kind fields set inside spec.scaleTargetRef. In this specific case, we have different values depending on whether it is a Rollout or a Deployment. As you can probably guess, we’ll have the value rollout.enabled set to true, so the Kind field will be set to Rollout. Finally, the last set of definitions are those specific to Istio. Let’s output them. 1
cat helm/templates/istio.yaml
The output, limited to the relevant parts, is as follows.
Using Argo Rollouts To Deploy Applications 1 2
381
{{- if .Values.istio.enabled }} ---
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService ... spec: ... http: - name: primary route: - destination: host: {{ template "fullname" . }} port: number: 80 weight: 100 - destination: host: {{ template "fullname" . }}-canary port: number: 80 weight: 0 ... {{- end }}
The only important field, in the context of Argo Rollouts, is the route with two destinations inside the VirtualService. Each is referencing one of the Services. That’s typical for Istio, so we won’t spend time on them, except to note that the weight of the primary destination is set to 100, while the weight of the one with the -canary suffix is 0. That means that all the traffic will go to the primary Service which will be the release that is fully rolled out. Later on, we’ll see how Argo Rollouts manipulates that field at run time during the process of canary deployments. Finally, we have the values.yaml that defines all the variables we can use to fine-tune the behavior of that application. 1
cat helm/values.yaml
I will not comment on that output since we will not use most of it. Those are the default values that we can consider “production ready”, even though this is only a demo application. As a matter of fact, we will use those default values later when we gain enough confidence to fully automate the process. For now, I want us to go through a simpler scenario defined through another set of values that will overwrite the default ones.
Using Argo Rollouts To Deploy Applications
382
Deploying The First Release Now that we explored the Rollout definition and the associated resources, let’s take a look at the specific customization we’ll use for the first release. The goal is to start simple, and progress towards a more complicated later. Simple, in this context, means manual approvals for rolling forward releases. Let’s take a look at the set of values we’ll use to overwrite those set by default in the chart of the demo app. 1
cat rollout/values-pause-x2.yaml
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
ingress: enabled: false istio: enabled: true hpa: true rollout: enabled: true steps: - setWeight: 20 - pause: {} - setWeight: 40 - pause: {} - setWeight: 60 - pause: {duration: 10} - setWeight: 80 - pause: {duration: 10} analysis: enabled: false
We can see that ingress.enabled is set to false. Normally, we might not want to have an Ingress definition if we are using Istio Gateway. However, this app is used for many different examples and is not specific to Argo Rollouts, so we are disabling NGINX Ingress since we will not need it today. We have the istio.enabled set to true. The reason should be obvious. We’ll need Istio for the examples that follow. Similarly, we’ll use HorizontalPodAutoscaler, so hpa is set to true as well. Please go back to the definitions in helm/templates to see how those and other variables are actually used. I am assuming that you have at least basic understanding of Helm and that you should have no problem matching the values with their usage inside the templates.
Using Argo Rollouts To Deploy Applications
383
Now we are getting to the important part. We are enabling the definitions specific to Argo Rollouts by setting the value of rollout.enabled to true. The “real” action is happening through the rollout.steps entry. It will be injected as spec.strategy.canary.steps values inside the Rollout definition in helm/templates/rollout.yaml. Go back to that file if you need to take another look at it. The steps can be described as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Redirect 20% of the requests to the new release Pause the process indefinitely Redirect 40% of the requests to the new release Pause the process indefinitely Redirect 60% of the requests to the new release Pause the process for 10 seconds Redirect 80% of the requests to the new release Pause the process for 10 seconds Roll out the new release fully and shut down the old one
The phrase “pause the process indefinitely” might sound confusing. We do not really want to pause forever. Instead, that step means that the process will be paused until we “promote” that release to the next step manually. When the process reaches the pause step without any arguments ({}) it will wait there until we perform a manual action of “promotion” which, in turn, will be a signal for Argo Rollouts to move to the next step of the process. The steps entry acts in a similar way as steps of a continuous integration or continuous delivery pipeline, but limited to a deployment process. Steps define what should be done, and in which order. Finally, the last value is setting analysis to false. We are not yet ready to explore it so we are disabling it for now. Let’s deploy the first release of the application and see Argo Rollouts in action. Before we continue, I must warn you that the command for Minikube will be slightly different from the rest of Kubernetes flavors. Since Istio Gateway is not accessible through the port 80 on Minikube, we will not be able to use xip.io²²⁹ to simulate a domain. As a result, the app running in Minikube will use the default domain devopstoolkitseries.com. Given that I know that you do not own that domain (I do), we’ll need to inject it into the header of the requests. I’ll explain what needs to be done when we get to that part. For now, the important thing to note is that the commands to deploy the application to Minikube will be slightly different from the rest of Kubernetes flavors.
Please execute the command that follows if you are NOT using Minikube.
²²⁹http://xip.io/
Using Argo Rollouts To Deploy Applications 1 2 3 4 5 6 7 8
384
helm upgrade --install \ devops-toolkit helm \ --namespace devops-toolkit \ --create-namespace \ --values rollout/values-pause-x2.yaml \ --set ingress.host=devops-toolkit.$ISTIO_HOST.xip.io \ --set image.tag=2.6.2 \ --wait
Please execute the command that follows if you are using Minikube.
1 2 3 4 5 6 7
helm upgrade --install \ devops-toolkit helm \ --namespace devops-toolkit \ --create-namespace \ --values rollout/values-pause-x2.yaml \ --set image.tag=2.6.2 \ --wait
That’s it. We deployed the first release of the application defined in the helm directory inside the devops-toolkit Namespace. We used the values defined in the file rollout/values-pause-x2.yaml file and we set the tag of the image to 2.6.2. I chose one of the older releases of the application so that we can see, later on, how it behaves when we try to upgrade it. Please note that we are not doing this the right way. We should have a repository associated with that environment. We should have changed the image.tag value in that repo, and we should probably let Argo CD do the deployment. Instead, we are executing helm upgrade commands. We’re doing that mostly for brevity. I am assuming that you went through the Deploying Applications Using GitOps Principles chapters and that you should have no trouble translating manual execution of the commands into definitions stored in Git and applied through Argo CD or any other similar tool.
We can see the information of the rollout process through the command that follows. 1 2 3 4
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
385
Using Argo Rollouts To Deploy Applications
You might see the InvalidSpec STATUS for a few moments. Don’t panic! That’s (usually) normal if you were too fast and retrieved the rollout before the process started progressing.
The output is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Name: Namespace: Status: Strategy: Step: SetWeight: ActualWeight: Images: Replicas: Desired: Current: Updated: Ready: Available:
devops-toolkit-devops-toolkit devops-toolkit � Healthy Canary 8/8 100 100 vfarcic/devops-toolkit-series:2.6.2 (stable) 2 2 2 2 2
15 16 17 18 19 20
NAME E INFO � devops-toolkit-devops-toolkit s └──# revision:1
KIND
STATUS
AG\
Rollout
� Healthy
34\ \
21 22 23
s
24 25
s
26 27
s
└──� devops-toolkit-devops-toolkit-849fcb5f44 stable ├──� devops-toolkit-devops-toolkit-849fcb5f44-fgl5v ready:1/1 └──� devops-toolkit-devops-toolkit-849fcb5f44-klcrh ready:1/1
ReplicaSet
� Healthy
19\
Pod
� Running
19\
Pod
� Running
19\
We can see that the application is Healthy and that all 8 steps of the specified Canary strategy were executed. The ActualWeight is set to 100 meaning that all the requests are being redirected to the release we just deployed. This was the first release we deployed, so the current revision is set to 1. One of the interesting observations from that output is that there is a ReplicaSet with two Pods. Even thourgh we deployed a Rollout resource, it behaves the same as a Deployment, except for the strategy it uses to roll out new releases. Just as a Deployment would do, the Rollout created a ReplicaSet which, in turn, created a Pod. Later on, the HorizontalPodAutoscaler kicked in and scaled the Rollout to two Pods since that is the minimum number of replicas we defined.
Using Argo Rollouts To Deploy Applications
386
Even more interesting observation we can make is that the Rollout ignored the steps we defined. It did not start by redirecting only twenty percent of the requests to the release, and it did not pause waiting for us to “promote” the process to the next step. Do not panic! We did not forget anything, and we did not found a bug. When deploying the first release of an application, Argo Rollouts ignores the steps. It would be pointless to start rolling out a new release to a fraction of the users, if that is the only release we have. The canary deployment strategy makes sense only when applied to the second release, and all those coming afterward. Argo Rollouts applies the canary strategy only if the application is already running. So, it rolled out the first release fully without any delay, without pauses, and without any analysis. It rolled it out as quickly as it could completely ignoring that we told it to use the canary strategy. The previous command is running in the watch mode. It is constantly updating the output to reflect the current state. Nevertheless, nothing new will happen since the application was rolled out and we did not yet instruct it to deploy the second release. So, feel free to stop watching the rollout status by pressing ctrl+c. We will see the canary deployment in action soon when we initiate the deployment of the second release. But, before we do that, we will confirm that the application is indeed up-and-running and accessible to our users. We’ll do that by opening the app in the default browser. Please execute the command that follows if you are NOT using Minikube. If you are using it, you’ll need to trust me when I say that the app is indeed running, but is not accessible since it is configured to use the domain devopstoolkitseries.com with DNSes that do not point to the cluster.
1
open http://devops-toolkit.$ISTIO_HOST.xip.io
If you are a Linux or a WSL user, I will assume that you created the alias open and set it to the xdg-open command. If that’s not the case, you will find instructions on doing that in the Setting Up A Local Development Environment chapter. If you do not have the open command (or the alias), you should replace open with echo and copy and paste the output into your favorite browser.
We should be able to see the Web app with the books I published long time ago. It is an old release after all. What matters more is that The DevOps Toolkit: Catalog, Patterns, And Blueprints is not listed. I’m mentioning that since we’ll be able to see whether it is a new or the old release by observing whether that title is there, or it isn’t.
387
Using Argo Rollouts To Deploy Applications
Deploying New Releases Using The Canary Strategy What we did up to now was boring. We haven’t yet seen any advantage of using Argo Rollouts. That is about to change. We will deploy a second release, and that should kick off the canary deployment process. Let’s go! 1 2 3 4
helm upgrade devops-toolkit helm \ --namespace devops-toolkit \ --reuse-values \ --set image.tag=2.9.9
We changed the tag of the image to 2.9.9 while reusing all the other values. Let’s watch the rollout and see what’s going on. 1 2 3 4
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The output is as follows. 1 2 3 4 5 6 7 8
Name: Namespace: Status: Strategy: Step: SetWeight: ActualWeight: Images:
9 10 11 12 13 14 15
Replicas: Desired: Current: Updated: Ready: Available:
devops-toolkit-devops-toolkit devops-toolkit � Paused Canary 1/8 20 20 vfarcic/devops-toolkit-series:2.6.2 (stable) vfarcic/devops-toolkit-series:2.9.9 (canary) 2 3 1 3 3
16 17 18 19
NAME E INFO � devops-toolkit-devops-toolkit
KIND
STATUS
AG\
Rollout
� Paused
2m\
388
Using Argo Rollouts To Deploy Applications 20 21
56s ├──# revision:2
\
22 23 24 25 26 27
│ └──� devops-toolkit-devops-toolkit-6785bfb67b s canary │ └──� devops-toolkit-devops-toolkit-6785bfb67b-zrzff s ready:1/1 └──# revision:1
ReplicaSet
� Healthy
36\
Pod
� Running
36\ \
28 29 30 31 32 33 34
└──� devops-toolkit-devops-toolkit-849fcb5f44 41s stable ├──� devops-toolkit-devops-toolkit-849fcb5f44-fgl5v 41s ready:1/1 └──� devops-toolkit-devops-toolkit-849fcb5f44-klcrh 41s ready:1/1
ReplicaSet
� Healthy
2m\
Pod
� Running
2m\
Pod
� Running
2m\
The process executed the first step which set the weight of the new release to twenty percent. After that, it executed the second step which is set to pause: {}. So, the Status changed to Paused and is waiting for us to promote it to the next step. We can also observe that a new ReplicaSet for the revision:2 was created with, currently, only one Pod. Based on the current traffic (which is none), one is more than enough. All in all, the new release is receiving twenty percent of traffic, and the process is paused waiting for a manual intervention. Press ctrl+c to stop watching the rollout. Let’s send some requests and see whether the new release indeed gets approximately twenty percent of the traffic. Please execute the command that follows if you are NOT using Minikube.
1 2 3 4
for i in {1..100}; do curl -s http://devops-toolkit.$ISTIO_HOST.xip.io \ | grep -i "catalog, patterns, and blueprints" done | wc -l
Please execute the command that follows if you are using Minikube.
Using Argo Rollouts To Deploy Applications 1 2 3 4 5
389
for i in {1..100}; do curl -s -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST" \ | grep -i "catalog, patterns, and blueprints" done | wc -l
We sent 100 requests to the application and piped each to the grep command that looks for catalog, patterns, and blueprints. That happens to be the title of the course and the book available only in the release we started deploying. At the end of the loop, we piped the whole output to wc -l effectivelly counting the number of lines. As a result, the output should show how many of those 100 requests were sent to the new release. In my case, the output is 17. If you are wondering why it is not a round number of twenty, you should know that each request to that page is followed up with quite a few requests to resources like JavaScrips, images, and so on. So, much more than a hundred requests were executed but, if we count only those directly sent to the page, seventeen came from the new release. That is close enough to twenty percent. If we send more requests, the sample would be larger, and the result would be closer to twenty percent. Let’s see what is happening with the VirtualService we created. As a reminder, it was initially configured to send 100 percent to the main destination, and nothing to canary. 1 2 3 4
kubectl --namespace devops-toolkit \ get virtualservice \ devops-toolkit-devops-toolkit \ --output yaml
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: networking.istio.io/v1beta1 kind: VirtualService ... spec: ... http: - name: primary route: - destination: host: devops-toolkit-devops-toolkit ... weight: 80 - destination: host: devops-toolkit-devops-toolkit-canary ... weight: 20
Using Argo Rollouts To Deploy Applications
390
We can see that Argo Rollouts changed the weight of each of the destinations. It set the main (the one associated with the old release) to 80 percent, and the canary (the one associated with the new release) to 20 percent. As we already commented, the process paused waiting for us to promote it to the next step. Let’s roll forward the release. 1 2 3
kubectl argo rollouts \ --namespace devops-toolkit \ promote devops-toolkit-devops-toolkit
Just as before, we can watch the rollout to observe what’s going on. 1 2 3 4
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The output, limited to the relevant parts, is as follows. 1 2 3 4 5 6 7 8
Name: Namespace: Status: Strategy: Step: SetWeight: ActualWeight: ...
devops-toolkit-devops-toolkit devops-toolkit � Paused Canary 3/8 40 40
We can see that the process paused again but, this time, with the weight set to 40 percent. From now on, the new release should be receiving forty percent of the traffic, while the rest should be going to the old release. We can confirm that by sending another round a requests. But, before we do, please stop watching the rollout by pressing ctrl+c. Please execute the command that follows if you are NOT using Minikube.
Using Argo Rollouts To Deploy Applications 1 2 3 4
391
for i in {1..100}; do curl -s http://devops-toolkit.$ISTIO_HOST.xip.io \ | grep -i "catalog, patterns, and blueprints" done | wc -l
Please execute the command that follows if you are using Minikube.
1 2 3 4 5
for i in {1..100}; do curl -s -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST" \ | grep -i "catalog, patterns, and blueprints" done | wc -l
The output, in my case, is as follows. 1
38
This was the second time the process paused. That was to be expected given that’s what we set as the step after increasing the weight to forty percent. Just as before, we need to unpause the process by executing the promote command. While we’re at it, we’ll run the command to watch the rollout right after unpausing. 1 2 3
kubectl argo rollouts \ --namespace devops-toolkit \ promote devops-toolkit-devops-toolkit
4 5 6 7 8
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
After a while, the output of the latter command should be similar to the one that follows.
392
Using Argo Rollouts To Deploy Applications 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Name: Namespace: Status: Strategy: Step: SetWeight: ActualWeight: Images: Replicas: Desired: Current: Updated: Ready: Available:
devops-toolkit-devops-toolkit devops-toolkit � Healthy Canary 8/8 100 100 vfarcic/devops-toolkit-series:2.9.9 (stable) 2 2 2 2 2
15 16 17 18 19 20
NAME AGE INFO � devops-toolkit-devops-toolkit 12m ├──# revision:2
KIND
STATUS
\
Rollout
� Healthy
\
│ └──� devops-toolkit-devops-toolkit-6785bfb67b 10m stable │ ├──� devops-toolkit-devops-toolkit-6785bfb67b-zrzff 10m ready:1/1 │ └──� devops-toolkit-devops-toolkit-6785bfb67b-4xmxc 75s ready:1/1 └──# revision:1
ReplicaSet
� Healthy
\
Pod
� Running
\
Pod
� Running
\
\
21 22 23 24 25 26 27 28
\
29 30 31
└──� devops-toolkit-devops-toolkit-849fcb5f44 12m
ReplicaSet
• ScaledDown \
The process continued executing the steps we specified. After we promoted the second pause, it set the weight to 60 percent, paused for 10 seconds, set the weight to 80 percent, and paused for another 10 seconds. After the last step was executed, it rolled out the new release fully, and it ScaledDown the old one. Let’s confirm that the application was indeed rolled out completely. Press ctrl+c to stop watching the rollout and execute the commands that follow to send hundred requests to the application. Please execute the command that follows if you are NOT using Minikube.
Using Argo Rollouts To Deploy Applications 1 2 3 4
393
for i in {1..100}; do curl -s http://devops-toolkit.$ISTIO_HOST.xip.io \ | grep -i "catalog, patterns, and blueprints" done | wc -l
Please execute the command that follows if you are using Minikube.
1 2 3 4 5
for i in {1..100}; do curl -s -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST" \ | grep -i "catalog, patterns, and blueprints" done | wc -l
The output is 100 meaning that every single request got a response from the new release. That should be enough of a confirmation that the new release was rolled out fully. Now that we saw how we can roll forward, let’s explore how we can reverse the process.
Rolling Back New Releases Hopefully, every single release will be successful. That is a worthy goal, but also one that we are probably never going to reach. No matter how well we do our job, bad things happen. We might introduce a bug, we might have a “broken” release, or we might forget something critical. Sooner or later, a release will be inadequate and will need to be replaced as soon as possible. That might mean that we will create a new “patch” release and roll it forward, or that we will need to roll back. Fixing a problematic release by rolling forward is the same as any other. It’s just a new release that follows the same process. The only difference we might apply is to skip all the steps, and go forward to a full rollout of a new release right away. You already know how to do that, since it is the same as what we just did (with, maybe, removing the steps). While I do prefer to roll forward, rolling back to the previous release is something we have to be prepared for. We should hope that we will never need to go back, but be prepared just in case. “Hope for the best, prepare for the worst” should be the motto. Let’s see how we can roll back to the previous release. We’ll repeat the same process again, but with a different tag.
394
Using Argo Rollouts To Deploy Applications 1 2 3 4
helm upgrade devops-toolkit helm \ --namespace devops-toolkit \ --reuse-values \ --set image.tag=2.9.17
5 6 7 8 9
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The output of the latter command is as follows. 1 2 3 4 5 6 7 8
Name: Namespace: Status: Strategy: Step: SetWeight: ActualWeight: Images:
9 10 11 12 13 14 15
Replicas: Desired: Current: Updated: Ready: Available:
devops-toolkit-devops-toolkit devops-toolkit � Paused Canary 1/8 20 20 vfarcic/devops-toolkit-series:2.9.17 (canary) vfarcic/devops-toolkit-series:2.9.9 (stable) 2 3 1 3 3
16 17 18 19 20 21
NAME AGE INFO � devops-toolkit-devops-toolkit 16m ├──# revision:3
KIND
STATUS
\
Rollout
� Paused
\
│ └──� devops-toolkit-devops-toolkit-7dd4875c79 45s canary │ └──� devops-toolkit-devops-toolkit-7dd4875c79-5vbgq 45s ready:1/1 ├──# revision:2
ReplicaSet
� Healthy
\
Pod
� Running
\
\
22 23 24 25 26 27
\
28 29 30 31
│ └──� devops-toolkit-devops-toolkit-6785bfb67b 13m stable │ ├──� devops-toolkit-devops-toolkit-6785bfb67b-zrzff
ReplicaSet
� Healthy
\
Pod
� Running
\
395
Using Argo Rollouts To Deploy Applications 32 33 34 35
13m ready:1/1 │ └──� devops-toolkit-devops-toolkit-6785bfb67b-4xmxc 4m47s ready:1/1 └──# revision:1
Pod
� Running
\ \
36 37 38
└──� devops-toolkit-devops-toolkit-849fcb5f44 15m
ReplicaSet
• ScaledDown \
As expected, the rollout paused after setting the weight to 20 percent. That is not different from what happend the last time. Please press ctrl+c to stop watching the rollout. Now, before you start thinking that we are going to repeat the same process, let me stress out that, this time, I want us to see how to roll back. For now, we’ll do it manually, just as we promoted steps manually as well. Later on, we’ll explore how to automate everything, including rolling forward and rolling back. So, let’s imagine that we detected that there is an issue with the new release, and that we would like to abort the process, and roll back. One way we could do that is through the abort command. Do NOT run the command that follows. I do not want us to use it. I’m showing it only so that you can see one possible option for rolling back. Better alternatives are coming soon.
1 2 3
kubectl argo rollouts \ --namespace devops-toolkit \ abort devops-toolkit-devops-toolkit
The reason I said that we should not execute that command lies in the fact that it would break one of the most important principles of GitOps. The desired state should be stored in Git, and the actual state should converge to it. Right now, the desired state in Git states that we want to have a new release, and the actual state is somewhere in between. Eighty percent of requests is going to the old release, and twenty percent to the new one. A better way to roll back would be to change the desired state so that it specifies that we want the previous release after all. That means that we should revert a commit in Git, or simply set image.tag value to whichever version we want to roll back to, and push the change to Git. But, as I already explained, we’re not involving GitOps in this chapter, so we’ll accomplish the same through a helm upgrade command.
396
Using Argo Rollouts To Deploy Applications 1 2 3 4
helm upgrade devops-toolkit helm \ --namespace devops-toolkit \ --reuse-values \ --set image.tag=2.9.9
5 6 7 8 9
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The output, limited to the relevant parts, is as follows 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Name: devops-toolkit-devops-toolkit Namespace: devops-toolkit Status: � Healthy Canary Strategy: Step: 8/8 SetWeight: 100 ActualWeight: 100 Images: vfarcic/devops-toolkit-series:2.9.9 (stable) ... NAME KIND AGE INFO � devops-toolkit-devops-toolkit Rollout 16m ├──# revision:4
STATUS
\
� Healthy
\ \
15 16 17 18 19 20 21 22
│ └──� devops-toolkit-devops-toolkit-6785bfb67b 14m stable │ ├──� devops-toolkit-devops-toolkit-6785bfb67b-zrzff 14m ready:1/1 │ └──� devops-toolkit-devops-toolkit-6785bfb67b-4xmxc 5m22s ready:1/1 ├──# revision:3
ReplicaSet
� Healthy
\
Pod
� Running
\
Pod
� Running
\ \
23 24 25 26
│ └──� devops-toolkit-devops-toolkit-7dd4875c79 80s └──# revision:1
ReplicaSet
└──� devops-toolkit-devops-toolkit-849fcb5f44 16m
ReplicaSet
• ScaledDown \ \
27 28 29
• ScaledDown \
We can see that Argo rolled back to the previous release right away. It aborted the rollout progress
Using Argo Rollouts To Deploy Applications
397
and went back to the tag 2.9.9. Even though technically we rolled forward, it detected that we want to deploy the same tag as the one used in the previous release and decided that it would be pointless to go through all the steps we specified. It understood that it is a “crisis situation” and did not loose any time on plesantries. This approach to rolling forward new release might be useful for those not yet secure in their processes and automation. It provides manual “approval gates”. We can pause the process indefinitely. We can monitor the behavior of the partial rollout, run tests, and do whatever else we need to do to gain confidence that the new release is indeed a good one. We can even involve management and wait for their approval to continue the process. They can get a big red shiny button with the label “we need your blessing to proceed” that would execute the kubectl argo rollouts promote command. But all that might not be the best way to approach the challenge. We’ll take a short break from Argo Rollouts, with a quick jump into metrics. We’ll need them if we are ever to fully automate the rollout process. But, before we go there, you should press ctrl+c to stop watching the rollout. We’ll need the terminal to run a few other commands.
Exploring Prometheus Metrics And Writing Rollout Queries We are about to deploy Prometheus. But, before we do, let’s start generating some traffic so that there are metrics we can explore. If you are new to monitoring and alerting with Prometheus, you might want to consider getting The DevOps Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes²³⁰.
We will open a new terminal and create an infinite loop that will be sending requests to the devops-toolkit app. That way, we’ll have a constant stream of metrics related to requests and responses. Let’s start by outputting the Istio Gateway host. 1
echo $ISTIO_HOST
Please copy the output. We’ll need it soon. Open a second terminal session. Now we can re-declare the ISTIO_HOST variable in the new terminal session. We’ll use it for constructing the address to which we will be sending a stream of requests. Please replace [...] with the output of the ISTIO_HOST variable you copied from the first terminal session. ²³⁰https://www.devopstoolkitseries.com/posts/devops-25/
Using Argo Rollouts To Deploy Applications 1
398
export ISTIO_HOST=[...]
Now we are ready to execute the loop. Please execute the command that follows if you are NOT using Minikube.
1 2 3 4
while true; do curl -I http://devops-toolkit.$ISTIO_HOST.xip.io sleep 1 done
Please execute the command that follows if you are using Minikube.
1 2 3 4 5
while true; do curl -I -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST" sleep 1 done
You should see a steady stream of responses with 200 OK statuses, with one second pause between each. If you are using WSL, you might see errors like sleep: cannot read realtime clock: Invalid argument. If that’s the case, you should know that you probably encountered a bug in WSL. The solution is to upgrade Ubuntu to 20.04 or latter version. Please stop the loop with ctrl+c, execute the commands that follow, and repeat the while loop command.
1
sudo apt-mark hold libc6
2 3
sudo apt -y --fix-broken install
4 5
sudo apt update
6 7
sudo apt -y full-upgrade
Now that we are sending requests and, through them, generating metrics, we can deploy Prometheus and see them in action. Please go back to the first terminal session and execute the commands that follow.
Using Argo Rollouts To Deploy Applications 1 2
399
helm repo add prometheus \ https://prometheus-community.github.io/helm-charts
3 4 5 6 7 8
helm upgrade --install \ prometheus prometheus/prometheus \ --namespace monitoring \ --create-namespace \ --wait
Normally, we would configure the Istio Gateway, NGINX Ingress, or something similar to forward requests to Prometheus based on a domain. However, since Prometheus is not the main subject of this chapter, we’ll take the easier route and port-forward to the prometheus-server Deployment. That will allow us to open it through localhost on a specific port. 1 2 3
kubectl --namespace monitoring \ port-forward deployment/prometheus-server \ 9090 &
The output is as follows. 1
Forwarding from [::1]:9090 -> 9090
You might need to press the enter key to be released back to the terminal prompt. Now we can open Prometheus in the default browser. 1
open http://localhost:9090
Next, we’ll explore a few metrics and queries. Bear in mind that we are not going to do a deep dive in Prometheus. If that’s what you need, The DevOps Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes²³¹ might be a good source of info. Instead, we’ll focus only on creating a query that we might want to use to instruct Argo Rollouts whether to move forward or to roll back a release. If we would like to automate rollout decision making, the first step is to define the criteria. One simple, yet effective, strategy could be to measure the error rate of requests. We can use the istio_requests_total metric for that. It provides the total number of requests. Given that Prometheus metrics have labels we can use to filter the results, and that one of those attached to istio_requests_total is response_code we should be able to distinguish those that do not fall into the 2xx range. Please type the query that follows in the Expression field.
²³¹https://www.devopstoolkitseries.com/posts/devops-25/
Using Argo Rollouts To Deploy Applications 1
400
istio_requests_total
Press the Execute button and select the Graph field. You should see a graph with requests processed by Istio. Given that you installed Prometheus only a few minutes ago and, therefore, it started collecting metrics only recently, you might want to adjust the timeframe to 5m or some other shorter duration.
Figure 7-1-1: Prometheus graph with a single metric query
Retrieving raw metrics is not very useful by itself, so we might want to make it a bit more complex. Instead of querying a metric alone, we should calculate the sum of the rate of requests passing through a specific Service, and calculated over a specific interval. We can do that by replacing the existing query with the one that follows.
Using Argo Rollouts To Deploy Applications 1 2 3 4 5 6 7
401
sum(irate( istio_requests_total{ reporter="source", destination_service=~"devops-toolkit-devops-toolkit.devops-toolkit.svc.cluster.l\ ocal" }[2m] ))
Remember to press the Execute button. That’s not very useful given that our goal is to see the percentage of errors, or to calculate the percentage of successful requests. We’ll choose the latter approach and for that we’ll need to have a bit more elaborated query. We should retrieve the sum of the rate of all successful queries (those in the 2xx range) and divide it with the sum of the rate of all queries. Both expressions should be limited to a specific Service. We can accomplish that through the query that follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sum(irate( istio_requests_total{ reporter="source", destination_service=~"devops-toolkit-devops-toolkit.devops-toolkit.svc.cluster.l\ ocal", response_code=~"2.*" }[2m] )) / sum(irate( istio_requests_total{ reporter="source", destination_service=~"devops-toolkit-devops-toolkit.devops-toolkit.svc.cluster.l\ ocal" }[2m] ))
Type (or copy & paste) that query into the Expression field and press the execute button.
Using Argo Rollouts To Deploy Applications
402
Figure 7-1-2: Prometheus graph with the percentage of successful requests
That is, finally, a query that produces useful results. The graph is kind of boring, but that is a good thing. It shows that the results are constantly 1. Since the percentage is expressed as fractions of 1, it means that hundred percent of the requests were successful during that whole period. That was to be expected since we have a constant loop of requests that are returning response code 200. That’s all I’ll show in Prometheus. As I already stated, it is means to an end, and not the main subject of this chapter. We’ll get back to deploying releases using Argo Rollouts. But, before we do, we will not need to access Prometheus UI any more, so let’s kill port forwarding. 1
pkill kubectl
Exploring Automated Analysis We explored how to promote rollouts manually. That might be a great solution for many, but it shouldn’t be the end goal. We should strive for more. In this context, and most of the others, “more” means removal of manual repetitive actions.
Using Argo Rollouts To Deploy Applications
403
We are going to try to automate the whole deployment process, including potential rollbacks. We’ll try to instruct the machines how to judge whether the process is progressing in the right direction, and whether there is an issue that might require them to roll back. We will do all that by adding analysis based on metrics stored in Prometheus. Prometheus is not the only supported analysis engine. It could be Wavefront²³², Datadog²³³, and a few others. Even if your metrics are not in one of the supported engines, you can always use the web provider that allows fetching metrics from any service reachable through a URL. We are using Prometheus mostly because I had to pick one for the examples.
Let’s take a look at a new Helm values file. 1
cat rollout/values-analysis.yaml
The output, limited to the rollout entries, is as follows. 1 2 3
... rollout: enabled: true
That one is much shorter than values-pause-x2.yaml we used before. Specifically, the rollout.steps and rollout.analysis.enabled entries are missing. That’s intentional since the default values already have the steps we will use, and the analysis is enabled by default. So, let’s take a quick look at the default values instead. 1
cat helm/values.yaml
The output, limited to the rollout entries, is as follows. 1 2 3 4 5 6 7 8 9 10 11 12
... rollout: enabled: false steps: - setWeight: 10 - pause: {duration: 2m} - setWeight: 30 - pause: {duration: 30s} - setWeight: 50 - pause: {duration: 30s} analysis: enabled: true ²³²https://www.wavefront.com/ ²³³https://www.datadoghq.com/
Using Argo Rollouts To Deploy Applications
404
One important difference is that none of the steps have pause: {}. Instead, there is a duration in each. That means that the process will not pause indefinitely waiting for a manual promotion. Instead, it will wait for the specified duration, before moving to the next step. During that duration, the system will be evaluating metrics. We’ll see them soon. For now, what matters, is that there are no manual interventions. The system will not wait for us to do anything. If you were wondering why the first pause is set to 2m while the others are much shorter (30s), the explanation is relatively simple. Prometheus does not collect metrics at real-time. Instead, it fetches them periodically. On top of that, the queries, which we will explore soon, are measuring sums of two minute periods. So, I set the first pause to 2m to ensure that the metrics are collected before proceeding further. As you will see soon, we’ll run analysis only once we reach the second step (the second setWeight). That way, we’ll be sure that analysis is running against data collected since the canary processes started, and not those from before. As for the rest of pause durations, I set them to much lower values (30s), so that we do not waste too much time.
Let’s output rollout.yaml again, end take a closer look at the definitions within the rollout.analysis.enabled. So far, it was set to false, so we did not bother exploring them. Now that we are setting that value to true, we might just as well see what will be created. 1
cat helm/templates/rollout.yaml
The output, limited to the entries relevant for the analysis, are as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
apiVersion: argoproj.io/v1alpha1 kind: Rollout ... spec: ... template: ... spec: ... {{- if .Values.rollout.analysis.enabled }} analysis: templates: - templateName: {{ template "fullname" . }} startingStep: 2 args: - name: service-name value: "{{ template "fullname" . }}-canary.{{ .Release.Namespace }}.svc.cl\ uster.local" {{- end }}
Using Argo Rollouts To Deploy Applications 20 21 22
{{- if .Values.rollout.analysis.enabled }} ---
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: {{ template "fullname" . }} spec: args: - name: service-name metrics: - name: success-rate interval: 10s successCondition: result[0] >= 0.8 failureCondition: result[0] < 0.8 failureLimit: 3 provider: prometheus: address: http://prometheus-server.monitoring query: | sum(irate( istio_requests_total{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}", response_code=~"2.*" }[2m] )) / sum(irate( istio_requests_total{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) - name: avg-req-duration interval: 10s successCondition: result[0] 1000 failureLimit: 3 provider: prometheus: address: http://prometheus-server.monitoring query: | sum(irate(
405
Using Argo Rollouts To Deploy Applications 63 64 65 66 67 68 69 70 71 72 73 74
406
istio_request_duration_milliseconds_sum{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) / sum(irate( istio_request_duration_milliseconds_count{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) {{- end }} ...
There’s a lot to digest there. To begin with, we will be adding the spec.template.spec.analysis section of the Rollout. It uses a template that is a reference to the AnalysisTemplate definition which we will explore soon. The startingStep means that the analysis will start only after the process reaches the second step. Finally, we are passing service-name to the template through the args entry. To be honest, we do not need the args entry. We could have hard-coded the name of the service directly inside the template since we are defining it specifically for that application. Nevertheless, I wanted to show that we can pass the arguments to templates. As a result, we could define templates that could be used by multiple applications. Some, if not all, will be applicable across the board, so it would be pointless to repeat the same ones over and over again. However, since we have only one application today, that is not particularly useful, so we are defining the template directly in the app. Let’s move to the definition of the AnalysisTemplate referenced by the Rollout. We can see a single entry in the spec.args. That is the same service-name that we provided the value for in the Rollout. The bulk of the definition is in the spec.metrics section. We have two entries over there. There is success-rate that queries the percentage of the successful requests (those with responses in the 2xx range). Rollouts will run that analysis in the interval of 10s (ten seconds). It will be considered successful (successCondition) if the result is equal to or higher than 0.8 (eighty percent). Similarly, it will be considered a failure (failureCondition) if the result is below 0.8 (eighty percent). We do not really need to define failureCondition if it is directly oposite of the successCondition. Both are used mostly in cases when we want to have values in between that would be interpreted as undecisive results. Nevertheless, I wanted to show you that both can be set.
More often then not, we should not consider a rollout a failure on the first sign of trouble. There are many reasons why a single failure might be temporary and not an indication that the whole release
Using Argo Rollouts To Deploy Applications
407
is bad and should be rolled back. That’s why we are setting the failureLimit to 3. The analysis would need to fail three times for the Rollout to abort the process and roll back. Finally, the last entry in the success-rate metric is the provider. In our case, it is prometheus, with the address and the query. It is almost the same expression as the one we executed manually in Prometheus, except that the value of the destination_service is not hard-coded. Instead, we are using the value of the service-name argument passed to the AnalysisTemplate from the Rollout. The second metric is avg-req-duration which, as you can guess from the name, calculates the average duration of the requests. It follows the same pattern as the first metric, so there is probably no need to go through it. Now, before we see the new definition of the Rollout in action, let’s remove the whole devops-toolkit Namespace, and start over with a clean slate. 1
kubectl delete namespace devops-toolkit
Deploying Releases With Fully Automated Steps We are about to deploy the first release. The only difference from before is that, this time, we’ll use the values defined in rollout/values-analysis.yaml. Please execute the command that follows if you are NOT using Minikube.
1 2 3 4 5 6 7 8
helm upgrade --install \ devops-toolkit helm \ --namespace devops-toolkit \ --create-namespace \ --values rollout/values-analysis.yaml \ --set ingress.host=devops-toolkit.$ISTIO_HOST.xip.io \ --set image.tag=2.6.2 \ --wait
Please execute the command that follows if you are using Minikube.
Using Argo Rollouts To Deploy Applications 1 2 3 4 5 6 7
408
helm upgrade --install \ devops-toolkit helm \ --namespace devops-toolkit \ --create-namespace \ --values rollout/values-analysis.yaml \ --set image.tag=2.6.2 \ --wait
Let’s watch the rollout. 1 2 3 4
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
As you already know from the past experience, the first deployment always rolls out right away without going through all the steps. When there is no previous release, there is no point executing canary deployments strategy. As a result, you should see the Status became Healthy and the ActualWeight set to 100 almost right way. Please stop watching the rollout by pressing ctrl+c. Let’s see what happens if there is an issue with a new release. Does it indeed roll back if the failureCondition is reached at least 3 times? We’ll do the simulation by sending a constant stream of requests to a non-existing path. That way, the analysis of the percentage of successful requests will surely discover that the successCondition is not met. Please go to the second terminal session, stop the loop that is currently executing by pressing ctrl+c, and execute the commands that follow. Please execute the command that follows if you are NOT using Minikube.
1 2 3 4
while true; do curl -I http://devops-toolkit.$ISTIO_HOST.xip.io/this-does-not-exist sleep 1 done
Please execute the command that follows if you are using Minikube.
Using Argo Rollouts To Deploy Applications 1 2 3 4 5
409
while true; do curl -I -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST/this-does-not-exist" sleep 1 done
The output should be a stream of 404 Not Found responses. Now that we are simulating issues, please go back to the first terminal session and initiate the deployment of the second release. 1 2 3 4
helm upgrade devops-toolkit helm \ --namespace devops-toolkit \ --reuse-values \ --set image.tag=2.9.9
5 6 7 8 9
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The process should set the weight to 10 percent right away and pause for the duration of 2m. After that, it should continue the process by executing the second step which sets the weight to 20 percent. The steps are a bit missleading. In the get rollout output (the one you’re watching right now), both setWeight and pause count as steps. However, the startingStep field in the Rollout definition refers only to setWeight as steps.
The process started the analysis as soon as it set the weight to 30. We can see that through the AnalysisRun in the get rollout output. It should show that one is successful, and one is failing. A bit later, it will change both to 2, and then to 3. That means that the analysis of one of the metrics is passing (avg-req-duration), while the other is failing (success-rate). Soon after the AnalysisRun reached the failureLimit currently set to 3, the process was aborted and the roll back started. The Status changed to Degraded, the ReplicaSet of the new release was ScaledDown, and the ActualWeight changed to 0. As a result, all the traffic is redirected to the old release. Canary deployments failed the analysis, and the process reverted to the previous state. Now that we experienced how unsuccessful rollouts look like, we can just as well confirm that a “good” release will be rolled out fully. Press ctrl+c to stop watching the rollout. Go to the second terminal session, stop the loop that sends requests to the non-existing path, and execute the commands that follow to start streaming “good” requests.
Using Argo Rollouts To Deploy Applications
410
Please execute the command that follows if you are NOT using Minikube.
1 2 3 4
while true; do curl -I http://devops-toolkit.$ISTIO_HOST.xip.io sleep 1 done
Please execute the command that follows if you are using Minikube.
1 2 3 4 5
while true; do curl -I -H "Host: devopstoolkitseries.com" \ "http://$ISTIO_HOST" sleep 1 done
We are simulating a scenario in which everything goes as planned, and we can confirm that by observing the responses. It is a steady stream of 200 OK messages. Next, go back to the first terminal session, and execute the commands that follow to upgrade the application to a newer release and watch the rollout. 1 2 3 4
helm upgrade devops-toolkit helm \ --namespace devops-toolkit \ --reuse-values \ --set image.tag=2.9.17
5 6 7 8 9
kubectl argo rollouts \ --namespace devops-toolkit \ get rollout devops-toolkit-devops-toolkit \ --watch
The process will set the weight of the new release to 10 percent almost instantly and pause for two minutes. Once that period passes, it will increase the weight to 30 percent, and pause again. Given that we specified that analysis should be delayed, that’s when it will start evaluating metrics. We should see the AnalysisRun being set to 2 successful executions. The process will keep progressing by moving through all the steps. It will keep increasing the weight and keep pausing for specified duration. In parallel, it will keep running new analysis queries. They
Using Argo Rollouts To Deploy Applications
411
should all be successful since all the requests are getting 200 response, and their average duration is below the threshold we specified. Once all the steps were executed, the new release should be fully rolled out without us doing any manual intervention. Instead of promoting from one step to another manually and deciding whether to roll back ourselves, we instructed the machines to perform those tasks for us. There’s no reason to keep the loop that sends requests running, so go to the second terminal session and press ctrl+c to stop it. Similarly, go back to the first terminal and press ctrl+c to stop watching the rollout.
What Happens Now? No matter whether we (humans) or the machines make decisions when to roll forward and when to roll back, those are always made based on some information. Someone has to consult the state and make decisions. When we do those tasks, the decisions are not made based on arbitrary choices. We always make them based on some data, and there is no better information than metrics. When you think about it, no matter how much we are involved in making decisions, they (almost) always follow a pattern. The beauty of patterns is that machines are very good at following them. That’s what we just did by fully automating canary deployments with Argo Rollouts. Truth be told, the two metrics we used are almost never sufficient. Relying on a few simple queries is far from enough to give us confidence that machines can make decisions for us. Nevertheless, the “real world” high-level patterns are the same as the one we used in the demo. Figure out the patterns you use to make decisions, convert those patterns into queries, instruct Argo Rollouts how to use them, and spend your valuable time on more productive and creative tasks. That being said, I suggest you start with manual promotions and rollbacks. That’s a very good way to gain the experience you will need to automate the whole process. Over time, you will understand which queries are reliable, and which are not. You’ll understand which metrics are missing. You will learn how to go beyond those provided out of the box, and start instrumenting your applications to provide fine-tuned metrics. Those are crucial for the decision-making process. All in all, automate everything only after you become confident in your manual processes and their reliability. Before we move on, there is one important note I might have forgotten to mention. All our examples were based on manual execution of helm commands. That should not be the end goal. Instead, we should combine Argo Rollouts with Argo CD or some similar tool. We should be pushing changes of the desired state to Git. Argo CD should be watching for such changes and initiate deployments. If those are defined as Argo Rollouts resources, Argo CD will initiate canary or blue-green deployment strategies and let Argo Rollouts take care of the process. Finally, all that should be wrapped in continuous delivery pipelines so that changes to Git repos initiate pipeline builds that will create binaries, images, and releases, run tests, push changes to the desired state in Git, and other steps associated with application lifecycle.
Using Argo Rollouts To Deploy Applications
412
That’s it. Now you should have more than enough information to make a decision whether Argo Rollouts is a worthwhile investment. If it is, you should have a base knowledge through which you should be able to set up a rollout strategy that works well for your use cases. There isn’t much left to do but to destroy everything we created. We can delete all the Namespaces and everything in them with the commands that follow. 1
kubectl delete namespace devops-toolkit
2 3
kubectl delete namespace argo-rollouts
4 5
kubectl delete namespace monitoring
6 7
cd ..
Please destroy the whole cluster, if it was created only for the purposes of running hands-on exercises from this chapter. If you used one of my Gists to create the cluster in the first place, you will find the instructions how to destroy it at the bottom.
This Is NOT The End Do not open the champagne just yet. This is NOT the end. This is a work in progress. You are a part of an experiment by participating in something that is in progress. This book is not yet finished. There are other tools and processes I will add. This material will grow. In the meantime, while waiting for me to add more “stuff”, please contact me and let me know what is missing. What do you need? Which tool would you like me to add? Which process are you interested in exploring? Which area of software development is not covered? What would you like to explore next? This book is a community effort. As such, I would love you to get involved. Please contact me by sending a public or a private message on DevOps20 Slack Workspace²³⁴. You can also send me an email to [email protected], reach me on Twitter²³⁵ or LinkedIn²³⁶, or contact me through any other communication means you feel comfortable with. Don’t be a stranger! The direction of this book depends on your participation and your suggestions.
²³⁴http://slack.devops20toolkit.com/ ²³⁵https://twitter.com/vfarcic ²³⁶https://www.linkedin.com/in/viktorfarcic/